This tutorial covers machine learning in R from foundational concepts to advanced techniques, with practical examples and clear explanations.
Machine Learning with R: A Complete Guide
1. Introduction to Machine Learning
What it is: Algorithms that learn patterns from data to make predictions or decisions without explicit programming.
Key Concepts:
- Supervised Learning: Predict outcomes (classification/regression)
- Unsupervised Learning: Find patterns (clustering/dimensionality reduction)
- Model Evaluation: Metrics to assess performance
# Essential packages
install.packages(c("caret", "randomForest", "e1071", "xgboost", "keras"))
library(caret) # Unified ML interface
2. Data Preprocessing
Why it matters: Quality data = Better models. Most ML projects spend 80% time here.
Key Steps:
- Handling missing values
- Feature scaling
- Categorical encoding
# Using caret's preProcess()
data(iris)
preproc <- preProcess(iris[,1:4],
method = c("center", "scale", "knnImpute"))
processed_data <- predict(preproc, iris[,1:4])
# One-hot encoding
dummy_vars <- dummyVars(~ Species, data = iris)
encoded_data <- predict(dummy_vars, iris)
3. Supervised Learning
3.1 Classification
Goal: Predict categorical outcomes (e.g., spam/not spam)
# Train/test split
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]
# Random Forest
model_rf <- train(Species ~ ., data = train, method = "rf")
predictions <- predict(model_rf, test)
confusionMatrix(predictions, test$Species)
3.2 Regression
Goal: Predict continuous values (e.g., house prices)
# Linear regression
model_lm <- train(mpg ~ ., data = mtcars, method = "lm")
summary(model_lm)
# Gradient Boosting
model_gbm <- train(mpg ~ ., data = mtcars, method = "gbm", verbose = FALSE)
predict(model_gbm, newdata = mtcars[1:3, ])
4. Unsupervised Learning
4.1 Clustering
Goal: Group similar data points (e.g., customer segmentation)
# K-means clustering
kmeans_result <- kmeans(iris[,1:4], centers = 3)
table(iris$Species, kmeans_result$cluster)
# Hierarchical clustering
dist_matrix <- dist(iris[,1:4])
hclust_result <- hclust(dist_matrix, method = "ward.D2")
plot(hclust_result)
4.2 Dimensionality Reduction
Goal: Reduce features while preserving information
# PCA
pca_result <- prcomp(iris[,1:4], scale = TRUE)
summary(pca_result)
biplot(pca_result)
# t-SNE (for visualization)
library(Rtsne)
tsne_result <- Rtsne(iris[,1:4], perplexity = 30)
plot(tsne_result$Y, col = iris$Species)
5. Model Evaluation
Key Metrics:
- Classification: Accuracy, Precision, Recall, F1, ROC-AUC
- Regression: RMSE, R², MAE
# Cross-validation
ctrl <- trainControl(method = "cv", number = 10)
model <- train(Species ~ ., data = iris, method = "rf", trControl = ctrl)
model
# ROC curve (for binary classification)
library(pROC)
roc_curve <- roc(response = test$Species,
predictor = as.numeric(predictions))
plot(roc_curve)
6. Advanced Techniques
6.1 Ensemble Methods
Why: Combine models to improve performance
# XGBoost (Gradient Boosting)
library(xgboost)
model_xgb <- train(Species ~ ., data = iris, method = "xgbTree")
varImp(model_xgb)
# Stacking models
library(caretEnsemble)
models <- caretList(Species ~ ., data = iris,
trControl = trainControl(method = "cv"),
methodList = c("rf", "glm", "svmRadial"))
ensemble <- caretEnsemble(models)
summary(ensemble)
6.2 Deep Learning
When: Complex patterns in unstructured data (images, text)
# Neural Networks with Keras
library(keras)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 3, activation = 'softmax')
model %>% compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = c('accuracy')
)
history <- model %>% fit(
x = as.matrix(iris[,1:4]),
y = to_categorical(as.numeric(iris$Species)-1),
epochs = 50,
batch_size = 5
)
7. Hyperparameter Optimization
Goal: Find the best model settings
# Grid search
tuneGrid <- expand.grid(
mtry = c(2, 3, 4),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 5, 10)
)
model <- train(Species ~ .,
data = iris,
method = "ranger",
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 5))
plot(model)
8. Building ML Pipelines
Why: Automate preprocessing + modeling workflows
# ML Pipeline with recipes
library(recipes)
recipe <- recipe(Species ~ ., data = iris) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
model <- train(recipe,
data = iris,
method = "rf",
trControl = trainControl(method = "cv"))
9. Model Deployment
Options:
- R Shiny apps
- Plumber APIs
- Export to PMML
# Save/load models
saveRDS(model_rf, "rf_model.rds")
loaded_model <- readRDS("rf_model.rds")
# Simple prediction API with plumber
# plumber.R:
#* @post /predict
#* @param Sepal.Length numeric
#* @param Sepal.Width numeric
#* @param Petal.Length numeric
#* @param Petal.Width numeric
function(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) {
new_data <- data.frame(
Sepal.Length = as.numeric(Sepal.Length),
Sepal.Width = as.numeric(Sepal.Width),
Petal.Length = as.numeric(Petal.Length),
Petal.Width = as.numeric(Petal.Width)
)
predict(loaded_model, new_data)
}
×