Loading...
Loading...

R Data Manipulation with dplyr

The dplyr package provides a grammar of data manipulation with intuitive functions that make data wrangling efficient and readable.

1. dplyr Basics

Core dplyr functions for data manipulation:

# Load dplyr
library(dplyr)

# Create a sample data frame
df <- tibble(
  id = 1:5,
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, 30, 35, 40, 45),
  score = c(80, 85, 90, 95, 100)
)

# View the data
glimpse(df)

# Basic verbs:
# - filter(): subset rows
# - select(): select columns
# - mutate(): create new columns
# - arrange(): sort rows
# - summarize(): aggregate data

dplyr Basics Quiz

Which dplyr function is used to select columns?

  • filter()
  • select()
  • mutate()

2. Filtering and Selecting

Subset your data by rows and columns:

# Filter rows (like WHERE in SQL)
df %>% filter(age > 30)

# Multiple conditions
df %>% filter(age > 30, score >= 90)

# Select columns
df %>% select(name, score)

# Select helpers:
df %>% select(starts_with("s"))  # score
df %>% select(ends_with("e"))    # name, age
df %>% select(contains("a"))     # name, age

# Rename columns
df %>% rename(student_name = name)

Filtering Quiz

How do you select columns whose names start with "date"?

  • select(contains("date"))
  • select(starts_with("date"))
  • filter(colnames == "date")

3. Creating and Modifying Columns

Transform and create new variables with mutate:

# Create new column
df %>% mutate(score_per_age = score / age)

# Modify existing column
df %>% mutate(age = age + 1)

# Multiple mutations
df %>% mutate(
  age_group = ifelse(age < 35, "Young", "Mature"),
  score_centered = score - mean(score)
)

# Conditional mutation with case_when
df %>% mutate(
  grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    TRUE ~ "C"
  )
)

Mutation Quiz

Which function is best for complex conditional column creation?

  • ifelse()
  • case_when()
  • when()

4. Summarizing and Grouping

Aggregate data with group_by and summarize:

# Basic summary
df %>% summarize(
  avg_score = mean(score),
  max_age = max(age)
)

# Grouped operations
df %>% 
  group_by(age_group) %>% 
  summarize(
    count = n(),
    mean_score = mean(score),
    sd_score = sd(score)
  )

# Count distinct values
df %>% count(name)

# Window functions
df %>% 
  mutate(rank = dense_rank(desc(score)))

Summarizing Quiz

What does the n() function return?

  • The number of columns
  • The number of rows/observations
  • The number of NA values

5. Joining Data

Combine data from multiple sources:

# Create another data frame
df2 <- tibble(
  id = 3:6,
  department = c("Math", "Science", "Arts", "History")
)

# Inner join (keep only matching rows)
df %>% inner_join(df2, by = "id")

# Left join (keep all left rows)
df %>% left_join(df2, by = "id")

# Full join (keep all rows)
df %>% full_join(df2, by = "id")

# Anti join (keep only non-matching)
df %>% anti_join(df2, by = "id")

# Binding rows/columns
bind_rows(df, df)  # stack vertically
bind_cols(df, df)  # combine horizontally

Joining Quiz

Which join keeps all rows from the first table?

  • inner_join()
  • left_join()
  • full_join()

6. Piping with %>%

The pipe operator for readable data workflows:

# Without pipes (nested)
arrange(
  summarize(
    group_by(
      filter(df, age > 30),
      age_group
    ),
    avg_score = mean(score)
  ),
  avg_score
)

# With pipes (sequential)
df %>%
  filter(age > 30) %>%
  group_by(age_group) %>%
  summarize(avg_score = mean(score)) %>%
  arrange(avg_score)

# Native pipe |> (R 4.1+)
df |>
  filter(age > 30) |>
  group_by(age_group) |>
  summarize(avg_score = mean(score)) |>
  arrange(avg_score)

Piping Quiz

What does %>% pass to the next function?

  • Only the first argument
  • The result of the previous operation
  • The entire data frame
0 Interaction
0 Views
Views
0 Likes
×
×
×
🍪 CookieConsent@Ptutorials:~

Welcome to Ptutorials

$ Allow cookies on this site ? (y/n)

top-home