scikit-learn is used by 85% of data science teams for traditional ML tasks (KDnuggets 2024). This tutorial covers the complete workflow from data preparation to model deployment using Python's most popular ML library.
Machine Learning with scikit-learn: The Complete Guide
scikit-learn Usage Distribution (2024)
1. scikit-learn Fundamentals
Core Concepts:
- Estimator API: Consistent
fit()
/predict()
pattern - Pipelines: Chaining preprocessing and modeling
- Transformers:
fit_transform()
for feature processing - Model Persistence:
pickle
orjoblib
Basic Workflow:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
X, y = load_data()
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.2f}")
2. Feature Engineering Pipeline
Common Transformers:
Transformer | Purpose | Example |
---|---|---|
StandardScaler | Feature scaling | (x - μ)/σ |
OneHotEncoder | Categorical encoding | Cat → Binary columns |
SimpleImputer | Missing values | Fill with median |
Pipeline Example:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define preprocessing
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
3. Model Selection & Tuning
Tuning Methods:
GridSearchCV
Exhaustive search
For small spacesRandomizedSearchCV
Random sampling
For medium spacesHalvingGridSearch
Successive halving
For large spacesImplementation:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 5, 10]
}
# Search
search = GridSearchCV(pipeline, param_grid, cv=5)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best score: {search.best_score_:.3f}")
scikit-learn Algorithm Cheat Sheet
Task | Linear | Nonlinear | Ensemble |
---|---|---|---|
Classification | LogisticRegression | SVC | RandomForest |
Regression | LinearRegression | SVR | GradientBoosting |
Clustering | KMeans | DBSCAN | N/A |
4. Advanced Techniques
Powerful Features:
- Custom Transformers: Create your own with
BaseEstimator
- Feature Unions: Combine multiple feature extraction methods
- Metric Scoring: Define custom evaluation metrics
- Out-of-Core: Partial_fit for large datasets
Custom Transformer Example:
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return [[len(text)] for text in X]
# Use in pipeline
pipeline = Pipeline([
('text_len', TextLengthTransformer()),
('clf', RandomForestClassifier())
])
scikit-learn Mastery Checklist
✓ Understand the estimator API
✓ Master pipeline construction
✓ Learn common transformers
✓ Practice model tuning
✓ Build custom components
scikit-learn Expert Insight: The 2024 Python Ecosystem Survey shows that projects using scikit-learn pipelines experience 40% fewer data leakage issues. The most effective teams combine scikit-learn with specialized libraries like imbalanced-learn and category-encoders for enhanced functionality.
×