The dplyr package provides a grammar of data manipulation with intuitive functions that make data wrangling efficient and readable.
R Data Manipulation with dplyr
1. dplyr Basics
Core dplyr functions for data manipulation:
# Load dplyr
library(dplyr)
# Create a sample data frame
df <- tibble(
id = 1:5,
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
age = c(25, 30, 35, 40, 45),
score = c(80, 85, 90, 95, 100)
)
# View the data
glimpse(df)
# Basic verbs:
# - filter(): subset rows
# - select(): select columns
# - mutate(): create new columns
# - arrange(): sort rows
# - summarize(): aggregate data
dplyr Basics Quiz
Which dplyr function is used to select columns?
2. Filtering and Selecting
Subset your data by rows and columns:
# Filter rows (like WHERE in SQL)
df %>% filter(age > 30)
# Multiple conditions
df %>% filter(age > 30, score >= 90)
# Select columns
df %>% select(name, score)
# Select helpers:
df %>% select(starts_with("s")) # score
df %>% select(ends_with("e")) # name, age
df %>% select(contains("a")) # name, age
# Rename columns
df %>% rename(student_name = name)
Filtering Quiz
How do you select columns whose names start with "date"?
3. Creating and Modifying Columns
Transform and create new variables with mutate:
# Create new column
df %>% mutate(score_per_age = score / age)
# Modify existing column
df %>% mutate(age = age + 1)
# Multiple mutations
df %>% mutate(
age_group = ifelse(age < 35, "Young", "Mature"),
score_centered = score - mean(score)
)
# Conditional mutation with case_when
df %>% mutate(
grade = case_when(
score >= 90 ~ "A",
score >= 80 ~ "B",
TRUE ~ "C"
)
)
Mutation Quiz
Which function is best for complex conditional column creation?
4. Summarizing and Grouping
Aggregate data with group_by and summarize:
# Basic summary
df %>% summarize(
avg_score = mean(score),
max_age = max(age)
)
# Grouped operations
df %>%
group_by(age_group) %>%
summarize(
count = n(),
mean_score = mean(score),
sd_score = sd(score)
)
# Count distinct values
df %>% count(name)
# Window functions
df %>%
mutate(rank = dense_rank(desc(score)))
Summarizing Quiz
What does the n() function return?
5. Joining Data
Combine data from multiple sources:
# Create another data frame
df2 <- tibble(
id = 3:6,
department = c("Math", "Science", "Arts", "History")
)
# Inner join (keep only matching rows)
df %>% inner_join(df2, by = "id")
# Left join (keep all left rows)
df %>% left_join(df2, by = "id")
# Full join (keep all rows)
df %>% full_join(df2, by = "id")
# Anti join (keep only non-matching)
df %>% anti_join(df2, by = "id")
# Binding rows/columns
bind_rows(df, df) # stack vertically
bind_cols(df, df) # combine horizontally
Joining Quiz
Which join keeps all rows from the first table?
6. Piping with %>%
The pipe operator for readable data workflows:
# Without pipes (nested)
arrange(
summarize(
group_by(
filter(df, age > 30),
age_group
),
avg_score = mean(score)
),
avg_score
)
# With pipes (sequential)
df %>%
filter(age > 30) %>%
group_by(age_group) %>%
summarize(avg_score = mean(score)) %>%
arrange(avg_score)
# Native pipe |> (R 4.1+)
df |>
filter(age > 30) |>
group_by(age_group) |>
summarize(avg_score = mean(score)) |>
arrange(avg_score)
Piping Quiz
What does %>% pass to the next function?
×