Machine Learning with scikit-learn: The Complete Guide
scikit-learn is used by 85% of data science teams for traditional ML tasks (KDnuggets 2024). This tutorial covers the complete workflow from data preparation to model deployment using Python's most popular ML library.
scikit-learn Usage Distribution (2024)
1. scikit-learn Fundamentals
Core Concepts:
- Estimator API: Consistent
fit()/predict()pattern - Pipelines: Chaining preprocessing and modeling
- Transformers:
fit_transform()for feature processing - Model Persistence:
pickleorjoblib
Basic Workflow:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
X, y = load_data()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.2f}")
2. Feature Engineering Pipeline
Common Transformers:
| Transformer | Purpose | Example |
|---|---|---|
| StandardScaler | Feature scaling | (x - μ)/σ |
| OneHotEncoder | Categorical encoding | Cat → Binary columns |
| SimpleImputer | Missing values | Fill with median |
Pipeline Example:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define preprocessing
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
3. Model Selection & Tuning
Tuning Methods:
GridSearchCV
Exhaustive search
For small spacesRandomizedSearchCV
Random sampling
For medium spacesHalvingGridSearch
Successive halving
For large spacesImplementation:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 5, 10]
}
# Search
search = GridSearchCV(pipeline, param_grid, cv=5)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best score: {search.best_score_:.3f}")
scikit-learn Algorithm Cheat Sheet
| Task | Linear | Nonlinear | Ensemble |
|---|---|---|---|
| Classification | LogisticRegression | SVC | RandomForest |
| Regression | LinearRegression | SVR | GradientBoosting |
| Clustering | KMeans | DBSCAN | N/A |
4. Advanced Techniques
Powerful Features:
- Custom Transformers: Create your own with
BaseEstimator - Feature Unions: Combine multiple feature extraction methods
- Metric Scoring: Define custom evaluation metrics
- Out-of-Core: Partial_fit for large datasets
Custom Transformer Example:
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return [[len(text)] for text in X]
# Use in pipeline
pipeline = Pipeline([
('text_len', TextLengthTransformer()),
('clf', RandomForestClassifier())
])
scikit-learn Mastery Checklist
✓ Understand the estimator API
✓ Master pipeline construction
✓ Learn common transformers
✓ Practice model tuning
✓ Build custom components
scikit-learn Expert Insight: The 2024 Python Ecosystem Survey shows that projects using scikit-learn pipelines experience 40% fewer data leakage issues. The most effective teams combine scikit-learn with specialized libraries like imbalanced-learn and category-encoders for enhanced functionality.
×