Loading...
Loading...

Machine Learning with R: A Complete Guide

This tutorial covers machine learning in R from foundational concepts to advanced techniques, with practical examples and clear explanations.

1. Introduction to Machine Learning

What it is: Algorithms that learn patterns from data to make predictions or decisions without explicit programming.

Key Concepts:

  • Supervised Learning: Predict outcomes (classification/regression)
  • Unsupervised Learning: Find patterns (clustering/dimensionality reduction)
  • Model Evaluation: Metrics to assess performance
# Essential packages
install.packages(c("caret", "randomForest", "e1071", "xgboost", "keras"))
library(caret)  # Unified ML interface

2. Data Preprocessing

Why it matters: Quality data = Better models. Most ML projects spend 80% time here.

Key Steps:

  • Handling missing values
  • Feature scaling
  • Categorical encoding
# Using caret's preProcess()
data(iris)
preproc <- preProcess(iris[,1:4], 
                     method = c("center", "scale", "knnImpute"))
processed_data <- predict(preproc, iris[,1:4])

# One-hot encoding
dummy_vars <- dummyVars(~ Species, data = iris)
encoded_data <- predict(dummy_vars, iris)

3. Supervised Learning

3.1 Classification

Goal: Predict categorical outcomes (e.g., spam/not spam)

# Train/test split
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]

# Random Forest
model_rf <- train(Species ~ ., data = train, method = "rf")
predictions <- predict(model_rf, test)
confusionMatrix(predictions, test$Species)

3.2 Regression

Goal: Predict continuous values (e.g., house prices)

# Linear regression
model_lm <- train(mpg ~ ., data = mtcars, method = "lm")
summary(model_lm)

# Gradient Boosting
model_gbm <- train(mpg ~ ., data = mtcars, method = "gbm", verbose = FALSE)
predict(model_gbm, newdata = mtcars[1:3, ])

4. Unsupervised Learning

4.1 Clustering

Goal: Group similar data points (e.g., customer segmentation)

# K-means clustering
kmeans_result <- kmeans(iris[,1:4], centers = 3)
table(iris$Species, kmeans_result$cluster)

# Hierarchical clustering
dist_matrix <- dist(iris[,1:4])
hclust_result <- hclust(dist_matrix, method = "ward.D2")
plot(hclust_result)

4.2 Dimensionality Reduction

Goal: Reduce features while preserving information

# PCA
pca_result <- prcomp(iris[,1:4], scale = TRUE)
summary(pca_result)
biplot(pca_result)

# t-SNE (for visualization)
library(Rtsne)
tsne_result <- Rtsne(iris[,1:4], perplexity = 30)
plot(tsne_result$Y, col = iris$Species)

5. Model Evaluation

Key Metrics:

  • Classification: Accuracy, Precision, Recall, F1, ROC-AUC
  • Regression: RMSE, R², MAE
# Cross-validation
ctrl <- trainControl(method = "cv", number = 10)
model <- train(Species ~ ., data = iris, method = "rf", trControl = ctrl)
model

# ROC curve (for binary classification)
library(pROC)
roc_curve <- roc(response = test$Species, 
                predictor = as.numeric(predictions))
plot(roc_curve)

6. Advanced Techniques

6.1 Ensemble Methods

Why: Combine models to improve performance

# XGBoost (Gradient Boosting)
library(xgboost)
model_xgb <- train(Species ~ ., data = iris, method = "xgbTree")
varImp(model_xgb)

# Stacking models
library(caretEnsemble)
models <- caretList(Species ~ ., data = iris,
                  trControl = trainControl(method = "cv"),
                  methodList = c("rf", "glm", "svmRadial"))
ensemble <- caretEnsemble(models)
summary(ensemble)

6.2 Deep Learning

When: Complex patterns in unstructured data (images, text)

# Neural Networks with Keras
library(keras)
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = 'relu', input_shape = c(4)) %>%
  layer_dense(units = 3, activation = 'softmax')

model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = c('accuracy')
)

history <- model %>% fit(
  x = as.matrix(iris[,1:4]),
  y = to_categorical(as.numeric(iris$Species)-1),
  epochs = 50,
  batch_size = 5
)

7. Hyperparameter Optimization

Goal: Find the best model settings

# Grid search
tuneGrid <- expand.grid(
  mtry = c(2, 3, 4),
  splitrule = c("gini", "extratrees"),
  min.node.size = c(1, 5, 10)
)

model <- train(Species ~ .,
              data = iris,
              method = "ranger",
              tuneGrid = tuneGrid,
              trControl = trainControl(method = "cv", number = 5))
plot(model)

8. Building ML Pipelines

Why: Automate preprocessing + modeling workflows

# ML Pipeline with recipes
library(recipes)
recipe <- recipe(Species ~ ., data = iris) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = 2)

model <- train(recipe,
              data = iris,
              method = "rf",
              trControl = trainControl(method = "cv"))

9. Model Deployment

Options:

  • R Shiny apps
  • Plumber APIs
  • Export to PMML
# Save/load models
saveRDS(model_rf, "rf_model.rds")
loaded_model <- readRDS("rf_model.rds")

# Simple prediction API with plumber
# plumber.R:
#* @post /predict
#* @param Sepal.Length numeric
#* @param Sepal.Width numeric
#* @param Petal.Length numeric
#* @param Petal.Width numeric
function(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) {
  new_data <- data.frame(
    Sepal.Length = as.numeric(Sepal.Length),
    Sepal.Width = as.numeric(Sepal.Width),
    Petal.Length = as.numeric(Petal.Length),
    Petal.Width = as.numeric(Petal.Width)
  )
  predict(loaded_model, new_data)
}
0 Interaction
0 Views
Views
0 Likes
×
×
×
🍪 CookieConsent@Ptutorials:~

Welcome to Ptutorials

$ Allow cookies on this site ? (y/n)

top-home