Loading...
Loading...

Statistical Analysis in R: From Basic to Advanced

R is a powerful tool for statistical analysis. This guide walks you through essential techniques—from summarizing data to advanced modeling—with practical examples.

1. Descriptive Statistics: Summarizing Data

What it is: Basic calculations to describe your dataset (e.g., averages, spread).

Use case: Understand the distribution of customer ages in a survey.

# Example: Summarize car fuel efficiency (mpg) in mtcars
mean(mtcars$mpg)      # Average miles per gallon
sd(mtcars$mpg)        # Standard deviation (how spread out values are)
summary(mtcars$mpg)   # Five-number summary (min, Q1, median, Q3, max)

2. Statistical Testing: Answering Questions with Data

What it is: Methods to test hypotheses (e.g., "Does Drug A work better than Drug B?").

Use case: Compare exam scores between two student groups.

# Two-sample t-test: Compare mpg of automatic vs. manual cars
t.test(mpg ~ am, data = mtcars) 

# Interpretation: 
# p-value < 0.05? Manual cars (am=1) likely have significantly different mpg.

3. Linear Regression: Predicting Relationships

What it is: Models how a variable changes with others (e.g., "How does weight affect fuel efficiency?").

Use case: Predict house prices based on size and location.

# Predict mpg using car weight (wt)
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

# Key output:
# Estimate (wt): -5.34 → For every 1000-lb increase, mpg drops by ~5.34.
# R-squared: 0.75 → 75% of mpg variation is explained by weight.

4. GLMs: Beyond Normal Distributions

What it is: Extends regression to binary/count data (e.g., "Does smoking affect cancer risk?").

Use case: Model yes/no outcomes like customer churn.

# Logistic regression: Predict Titanic survival
model <- glm(Survived ~ Age + Sex, 
            data = Titanic, 
            family = binomial)
summary(model)

# Interpretation: 
# Sexmale estimate: -2.5 → Males had lower survival odds (holding age constant).

5. Mixed Models: Handling Grouped Data

What it is: Accounts for nested data (e.g., students in schools, repeated measurements).

Use case: Analyze test scores across different schools.

# Model reaction times across days (accounting for subject differences)
library(lme4)
model <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(model)

# Interpretation:
# Days estimate: 10.47 → Reaction time increases ~10ms/day on average.

6. Time Series: Analyzing Trends Over Time

What it is: Methods for data collected sequentially (e.g., stock prices, temperature).

Use case: Forecast monthly sales.

# AirPassengers dataset: Monthly airline passengers (1949-1960)
plot(AirPassengers)  # Clear upward trend and seasonality

# Forecast next 2 years
library(forecast)
model <- auto.arima(AirPassengers)
forecast(model, h = 24) |> plot()

7. Survival Analysis: Time-to-Event Data

What it is: Models how long until an event (e.g., machine failure, patient recovery).

Use case: Compare drug efficacy in clinical trials.

# Kaplan-Meier plot: Compare survival by sex
library(survival)
fit <- survfit(Surv(time, status) ~ sex, data = lung)
plot(fit, col = c("red", "blue"))

# Cox model: Quantify risk factors
coxph(Surv(time, status) ~ age + sex, data = lung)

8. Machine Learning: Predictive Modeling

What it is: Algorithms that learn patterns to make predictions.

Use case: Classify iris flowers based on measurements.

# Random Forest: Predict species using flower traits
library(randomForest)
model <- randomForest(Species ~ ., data = iris)
table(iris$Species, predict(model))  # Check accuracy: 97-100%

9. Bayesian Methods: Probabilistic Reasoning

What it is: Updates beliefs about parameters using data (vs. fixed "true" values).

Use case: Estimate uncertainty in A/B test results.

# Bayesian linear regression
library(rstanarm)
model <- stan_glm(mpg ~ wt, data = mtcars)
summary(model)

# Interpretation:
# wt posterior mean: -5.34 (95% credible interval: [-6.5, -4.2])
# → We're 95% confident the true effect of weight on mpg is between -6.5 and -4.2.

10. Multivariate Techniques: High-Dimensional Data

What it is: Methods for datasets with many variables (e.g., reducing dimensions, clustering).

Use case: Group customers by purchasing behavior.

# PCA: Reduce 4 flower measurements to 2 key dimensions
pca <- prcomp(iris[,1:4], scale = TRUE)
biplot(pca)  # Shows how variables contribute to components

# K-means clustering: Group irises into 3 clusters
kmeans(iris[,1:4], centers = 3)$cluster
0 Interaction
0 Views
Views
0 Likes
×
×
×
🍪 CookieConsent@Ptutorials:~

Welcome to Ptutorials

$ Allow cookies on this site ? (y/n)

top-home