The Tidyverse is a collection of R packages designed for data science. It provides a consistent and intuitive framework for data manipulation, visualization, and modeling.

1. What is the Tidyverse?

A cohesive set of packages for data science workflows:

# Install and load the core tidyverse
install.packages("tidyverse")
library(tidyverse)

# Core packages include:
# - ggplot2 (visualization)
# - dplyr (data manipulation)
# - tidyr (data tidying)
# - readr (data import)
# - purrr (functional programming)
# - tibble (modern data frames)
# - stringr (string manipulation)
# - forcats (factor handling)

Tidyverse Basics Quiz

Which package is NOT part of the core tidyverse?

dplyr
data.table
tidyr

2. Data Import (readr)

Fast and user-friendly data import functions:

# Read CSV files
students <- read_csv("students.csv")

# Read TSV files
survey <- read_tsv("survey_results.tsv")

# Column specification
sales <- read_csv(
  "sales_data.csv",
  col_types = cols(
    date = col_date("%Y-%m-%d"),
    region = col_factor(),
    revenue = col_double()
  )
)

# Write data back to files
write_csv(students, "students_clean.csv")

Data Import Quiz

Which function reads comma-separated values?

read_table()
read_csv()
read_data()

3. Data Wrangling (dplyr)

The grammar of data manipulation:

# Create a tibble (modern data frame)
tib <- tibble(
  id = 1:5,
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  score = c(88, 92, 85, 79, 95)
)

# Key verbs:
# Filter rows
tib %>% filter(score > 85)

# Select columns
tib %>% select(name, score)

# Create new columns
tib %>% mutate(grade = ifelse(score >= 90, "A", "B"))

# Summarize data
tib %>% summarize(avg_score = mean(score))

# Grouped operations
tib %>%
  mutate(grade = ifelse(score >= 90, "A", "B")) %>%
  group_by(grade) %>%
  summarize(count = n(), avg = mean(score))

dplyr Quiz

Which dplyr verb creates new columns?

summarize()
mutate()
filter()

4. Data Tidying (tidyr)

Getting your data into a tidy format:

# Pivoting data
sales <- tibble(
  product = c("A", "B", "C"),
  Q1 = c(120, 150, 90),
  Q2 = c(140, 110, 95)
)

# Wide to long
sales_long <- sales %>%
  pivot_longer(cols = c(Q1, Q2), 
               names_to = "quarter", 
               values_to = "revenue")

# Long to wide
sales_wide <- sales_long %>%
  pivot_wider(names_from = quarter, 
              values_from = revenue)

# Handling missing values
df <- tibble(x = c(1, NA, 3), y = c("a", NA, "c"))
df %>% drop_na()    # Remove rows with NAs
df %>% replace_na(list(x = 0, y = "unknown"))  # Replace NAs

tidyr Quiz

Which function converts data from wide to long format?

spread()
pivot_longer()
make_long()

5. Data Visualization (ggplot2)

Create publication-quality graphics:

# Basic scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

# With aesthetics
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  labs(title = "Engine Size vs Highway MPG",
       x = "Engine Displacement (L)",
       y = "Highway MPG")

# Faceting
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~class)

# Saving plots
ggsave("mpg_plot.png", width = 8, height = 6, dpi = 300)

ggplot2 Quiz

Which function adds points to a ggplot?

geom_line()
geom_point()
add_points()

6. Functional Programming (purrr)

Elegant alternatives to loops:

# Map functions
numbers <- list(1:5, 6:10, 11:15)

# Apply function to each element
map(numbers, mean)        # Returns list
map_dbl(numbers, mean)    # Returns numeric vector

# Mapping with additional arguments
map(numbers, ~ .x * 2)    # Double each element

# Working with data frames
mtcars %>%
  split(.$cyl) %>%        # Split by cylinder
  map(~ lm(mpg ~ wt, data = .x)) %>%  # Model for each group
  map(summary)            # Get summaries

# Safely handle errors
safe_log <- safely(log)
results <- map(list(1, 10, "a"), safe_log)

purrr Quiz

Which map variant returns a numeric vector?

map()
map_dbl()
map_all()

7. String Manipulation (stringr)

Consistent functions for working with strings:

# Basic string operations
text <- c("Data Science", "Machine Learning", "R Programming")

str_length(text)      # Count characters
str_to_upper(text)    # Convert to uppercase
str_sub(text, 1, 4)   # Extract substrings

# Pattern matching
str_detect(text, "Science")    # TRUE/FALSE
str_subset(text, "ing")        # Return matches
str_extract(text, "[A-Za-z]+") # Extract patterns

# String splitting
dates <- c("2023-01-15", "2022-11-30")
str_split(dates, "-", simplify = TRUE)

# String replacement
str_replace(text, " ", "_")    # Replace first space
str_replace_all(text, " ", "_") # Replace all spaces

stringr Quiz

Which function detects if a pattern exists?

str_find()
str_detect()
str_locate()

8. Factor Handling (forcats)

Tools for working with categorical variables:

# Create factors
survey <- tibble(
  response = factor(c("Agree", "Neutral", "Disagree", "Agree"))
)

# Reorder factors
survey %>%
  mutate(response = fct_relevel(response, "Disagree", "Neutral", "Agree"))

# Count factor levels
fct_count(survey$response)

# Lump small categories
countries <- factor(c("USA", "UK", "France", "USA", "Germany", "UK"))
fct_lump(countries, n = 2)  # Keep top 2, others as "Other"

# Reorder by frequency
survey %>%
  mutate(response = fct_infreq(response))

forcats Quiz

Which function reorders factors by frequency?

fct_sort()
fct_infreq()
fct_count()

0 Interaction

0 Views

0 Likes

Tidyverse in R: Data Science Made Easy