Scikit-learn Reference
Free reference guide: Scikit-learn Reference
About Scikit-learn Reference
This Scikit-learn Reference is a searchable cheat sheet covering the most important classes and functions in the scikit-learn machine learning library. It is organized into eight categories — Classification, Regression, Clustering, Preprocessing, Feature Selection, Model Evaluation, Pipeline, and Hyperparameter Tuning — giving you instant access to the API patterns that form the backbone of any ML workflow in Python.
Scikit-learn provides a consistent estimator interface: every model exposes fit(), predict(), and score() methods, making it straightforward to swap algorithms without restructuring your code. This reference captures that uniformity by showing side-by-side examples of RandomForestClassifier, LogisticRegression, SVC, GradientBoostingClassifier, and more, along with their regression counterparts. You can compare API signatures and default hyperparameters at a glance, reducing the time spent jumping between documentation pages.
Beyond individual estimators, the reference covers the preprocessing and evaluation infrastructure that surrounds model training. Entries for StandardScaler, MinMaxScaler, OneHotEncoder, and SimpleImputer address data preparation. Metrics like accuracy_score, classification_report, confusion_matrix, and r2_score evaluate model performance. Pipeline and ColumnTransformer prevent data leakage by encapsulating preprocessing and modeling into a single reproducible unit, while GridSearchCV and RandomizedSearchCV automate hyperparameter optimization with cross-validation.
Key Features
- Classification algorithms — RandomForest, LogisticRegression, SVC, KNN, GradientBoosting, and DecisionTree with fit/predict API
- Regression models — LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor, and SVR
- Clustering methods — KMeans, DBSCAN, AgglomerativeClustering, and silhouette_score for cluster quality evaluation
- Preprocessing tools — StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, train_test_split, and SimpleImputer
- Feature selection techniques — SelectKBest, PCA for dimensionality reduction, RFE, and feature_importances_ inspection
- Model evaluation metrics — accuracy_score, classification_report, confusion_matrix, cross_val_score, MSE, and R2 score
- Pipeline construction — Pipeline, make_pipeline, ColumnTransformer for mixed-type data, and FeatureUnion
- Hyperparameter tuning — GridSearchCV, RandomizedSearchCV, best_params_, learning_curve, validation_curve, and model persistence with joblib
Frequently Asked Questions
What version of scikit-learn does this reference cover?
The examples use the scikit-learn 1.x API. Specific details like sparse_output=False in OneHotEncoder (replacing sparse in older versions) reflect the current API. All examples are compatible with scikit-learn 1.2+ and Python 3.9+.
How do I choose between RandomForest and GradientBoosting?
RandomForest trains trees independently in parallel, making it faster to fit and less prone to overfitting on noisy data. GradientBoosting trains trees sequentially, each correcting the errors of the previous one, which often yields higher accuracy but is slower and more sensitive to hyperparameters. Start with RandomForest for a quick baseline, then try GradientBoosting (or XGBoost/LightGBM) when you need to squeeze out more performance.
When should I use StandardScaler vs MinMaxScaler?
StandardScaler centers data to zero mean and unit variance, which works well when features are approximately normally distributed and is required by algorithms sensitive to feature magnitudes (SVM, logistic regression, PCA). MinMaxScaler rescales features to a [0, 1] range and is useful when you want bounded values or when the data has no strong outliers.
What is the purpose of a Pipeline in scikit-learn?
A Pipeline chains preprocessing steps and a final estimator into a single object that exposes fit() and predict(). This prevents data leakage by ensuring that transformations like scaling or encoding are fit only on training data and applied consistently to test data. It also simplifies cross-validation and hyperparameter search because the entire workflow is treated as one estimator.
How does cross_val_score differ from a manual train/test split?
cross_val_score performs k-fold cross-validation, splitting the data into k folds and training/evaluating k times, each time using a different fold as the test set. This provides a more robust estimate of model performance than a single train/test split because every data point appears in the test set exactly once, reducing variance in the evaluation metric.
What is the difference between GridSearchCV and RandomizedSearchCV?
GridSearchCV exhaustively tries every combination of hyperparameters in the specified grid, which guarantees finding the best combination but becomes computationally expensive as the grid grows. RandomizedSearchCV samples a fixed number of random combinations from the parameter distributions, making it much faster for large search spaces while still finding good configurations.
How do I handle categorical features in scikit-learn?
Use LabelEncoder for target variables (ordinal encoding of labels) and OneHotEncoder for input features with no ordinal relationship. For mixed datasets with both numeric and categorical columns, wrap transformers in a ColumnTransformer to apply StandardScaler to numeric columns and OneHotEncoder to categorical columns within a single Pipeline.
How do I save and load a trained model?
Use joblib.dump(model, "model.pkl") to serialize the trained model to disk and joblib.load("model.pkl") to restore it. Joblib is more efficient than pickle for objects containing large NumPy arrays (like tree-based models). Always save the full Pipeline (including preprocessing) to ensure consistent predictions on new data.