R is a powerful tool for statistical analysis. This guide walks you through essential techniques—from summarizing data to advanced modeling—with practical examples.
Statistical Analysis in R: From Basic to Advanced
1. Descriptive Statistics: Summarizing Data
What it is: Basic calculations to describe your dataset (e.g., averages, spread).
Use case: Understand the distribution of customer ages in a survey.
# Example: Summarize car fuel efficiency (mpg) in mtcars
mean(mtcars$mpg) # Average miles per gallon
sd(mtcars$mpg) # Standard deviation (how spread out values are)
summary(mtcars$mpg) # Five-number summary (min, Q1, median, Q3, max)
2. Statistical Testing: Answering Questions with Data
What it is: Methods to test hypotheses (e.g., "Does Drug A work better than Drug B?").
Use case: Compare exam scores between two student groups.
# Two-sample t-test: Compare mpg of automatic vs. manual cars
t.test(mpg ~ am, data = mtcars)
# Interpretation:
# p-value < 0.05? Manual cars (am=1) likely have significantly different mpg.
3. Linear Regression: Predicting Relationships
What it is: Models how a variable changes with others (e.g., "How does weight affect fuel efficiency?").
Use case: Predict house prices based on size and location.
# Predict mpg using car weight (wt)
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
# Key output:
# Estimate (wt): -5.34 → For every 1000-lb increase, mpg drops by ~5.34.
# R-squared: 0.75 → 75% of mpg variation is explained by weight.
4. GLMs: Beyond Normal Distributions
What it is: Extends regression to binary/count data (e.g., "Does smoking affect cancer risk?").
Use case: Model yes/no outcomes like customer churn.
# Logistic regression: Predict Titanic survival
model <- glm(Survived ~ Age + Sex,
data = Titanic,
family = binomial)
summary(model)
# Interpretation:
# Sexmale estimate: -2.5 → Males had lower survival odds (holding age constant).
5. Mixed Models: Handling Grouped Data
What it is: Accounts for nested data (e.g., students in schools, repeated measurements).
Use case: Analyze test scores across different schools.
# Model reaction times across days (accounting for subject differences)
library(lme4)
model <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(model)
# Interpretation:
# Days estimate: 10.47 → Reaction time increases ~10ms/day on average.
6. Time Series: Analyzing Trends Over Time
What it is: Methods for data collected sequentially (e.g., stock prices, temperature).
Use case: Forecast monthly sales.
# AirPassengers dataset: Monthly airline passengers (1949-1960)
plot(AirPassengers) # Clear upward trend and seasonality
# Forecast next 2 years
library(forecast)
model <- auto.arima(AirPassengers)
forecast(model, h = 24) |> plot()
7. Survival Analysis: Time-to-Event Data
What it is: Models how long until an event (e.g., machine failure, patient recovery).
Use case: Compare drug efficacy in clinical trials.
# Kaplan-Meier plot: Compare survival by sex
library(survival)
fit <- survfit(Surv(time, status) ~ sex, data = lung)
plot(fit, col = c("red", "blue"))
# Cox model: Quantify risk factors
coxph(Surv(time, status) ~ age + sex, data = lung)
8. Machine Learning: Predictive Modeling
What it is: Algorithms that learn patterns to make predictions.
Use case: Classify iris flowers based on measurements.
# Random Forest: Predict species using flower traits
library(randomForest)
model <- randomForest(Species ~ ., data = iris)
table(iris$Species, predict(model)) # Check accuracy: 97-100%
9. Bayesian Methods: Probabilistic Reasoning
What it is: Updates beliefs about parameters using data (vs. fixed "true" values).
Use case: Estimate uncertainty in A/B test results.
# Bayesian linear regression
library(rstanarm)
model <- stan_glm(mpg ~ wt, data = mtcars)
summary(model)
# Interpretation:
# wt posterior mean: -5.34 (95% credible interval: [-6.5, -4.2])
# → We're 95% confident the true effect of weight on mpg is between -6.5 and -4.2.
10. Multivariate Techniques: High-Dimensional Data
What it is: Methods for datasets with many variables (e.g., reducing dimensions, clustering).
Use case: Group customers by purchasing behavior.
# PCA: Reduce 4 flower measurements to 2 key dimensions
pca <- prcomp(iris[,1:4], scale = TRUE)
biplot(pca) # Shows how variables contribute to components
# K-means clustering: Group irises into 3 clusters
kmeans(iris[,1:4], centers = 3)$cluster