scikit-learn Cheat Sheet: Functions for Machine Learning
Mastering Machine Learning with Python and scikit-learn: A Comprehensive Guide for Data Scientists and AI Enthusiasts

Introduction
It is no secret that data science and machine learning have become essential components of the modern business landscape. With the rise of artificial intelligence and the increasing demand for data-driven insights, more and more companies are turning to these powerful tools to gain a competitive edge. Fortunately, Python has emerged as the language of choice for many data scientists, and the Sci-kit learn library provides a comprehensive set of tools for building and deploying machine learning models.
In this article, we will explore 50 of the most useful functions provided by Sci-kit learn for machine learning tasks. From data preprocessing to model selection and evaluation, these functions cover a wide range of techniques and methodologies for solving real-world problems. As if that is not enough, we will use pre-built datasets to illustrate the application of each function, making it easier for you to follow along and apply them in your own projects.
Sounds fantastic? Now, for the surprise: many of these functions are easy to use and require only a few lines of code to implement. Whether you are a seasoned data scientist or just starting out, this cheat sheet will help you become more familiar with the powerful tools available in Sci-kit learn and enable you to accelerate your data science and machine learning projects.
So, grab your favorite beverage, sit back, and let’s dive into the world of Sci-kit learn!
train_test_split
This function is used to split a dataset into training and testing sets. It takes in the dataset, the target variable, and the size of the testing set as parameters.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)StandardScaler
This function is used to standardize the dataset by subtracting the mean and dividing by the standard deviation. It is often used to prepare data for algorithms that require standardized input.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)MinMaxScaler
This function is used to scale the dataset to a specific range (usually 0 to 1). It is often used to prepare data for algorithms that require inputs within a certain range.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)LabelEncoder
This function is used to encode categorical variables as integers. It is often used to prepare data for algorithms that cannot handle categorical variables.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)OneHotEncoder
This function is used to encode categorical variables as binary vectors. It is often used to prepare data for algorithms that require binary inputs.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
y_train_encoded = encoder.fit_transform(y_train.reshape(-1,1))
y_test_encoded = encoder.transform(y_test.reshape(-1,1))DecisionTreeClassifier
This function is used to create a decision tree model. It takes in the training data and labels as parameters.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train_scaled, y_train_encoded)RandomForestClassifier
This function is used to create a random forest model. It takes in the training data and labels as parameters.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train_scaled, y_train_encoded)KMeans
This function is used to create a K-means clustering model. It takes in the dataset and the number of clusters as parameters.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train_scaled)LinearRegression
This function is used to create a linear regression model. It takes in the training data and labels as parameters.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train_scaled, y_train)LogisticRegression
This function is used to create a logistic regression model. It takes in the training data and labels as parameters.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train_scaled, y_train_encoded)SVM
This function is used to create a support vector machine model. It takes in the training data and labels as parameters.
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train_scaled, y_train_encoded)NaiveBayes
This function is used to create a Naive Bayes model. It takes in the training data and labels as parameters.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train_scaled, y_train_encoded)GridSearchCV
This function is used to perform a grid search to find the best hyperparameters for a model. It takes in the model, the hyperparameter grid, and the cross-validation strategy as parameters.
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [2, 4, 8]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train_encoded)Pipeline
This function is used to create a pipeline of data preprocessing and modeling steps. It takes in a list of tuples, where each tuple contains a name for the step and the corresponding function.
from sklearn.pipeline import Pipeline
pipe = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier())])
pipe.fit(X_train, y_train)PCA
This function is used to perform principal component analysis on the dataset. It takes in the number of components to keep as a parameter.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)TSNE
This function is used to perform t-distributed stochastic neighbor embedding on the dataset. It takes in the number of dimensions to embed the data into as a parameter.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_train_tsne = tsne.fit_transform(X_train_scaled)GradientBoostingClassifier
This function is used to create a gradient boosting classifier. It takes in the training data and labels as parameters.
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf.fit(X_train_scaled, y_train_encoded)AdaBoostClassifier
This function is used to create an AdaBoost classifier. It takes in the training data and labels as parameters.
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf.fit(X_train_scaled, y_train_encoded)Lasso
This function is used to perform Lasso regression. It takes in the training data and labels as parameters.
from sklearn.linear_model import Lasso
reg = Lasso()
reg.fit(X_train_scaled, y_train)Ridge
This function is used to perform Ridge regression. It takes in the training data and labels as parameters.
from sklearn.linear_model import Ridge
reg = Ridge()
reg.fit(X_train_scaled, y_train)ElasticNet
This function is used to perform Elastic Net regression. It takes in the training data and labels as parameters.
from sklearn.linear_model import ElasticNet
reg = ElasticNet()
reg.fit(X_train_scaled, y_train)SGDClassifier
This function is used to create a stochastic gradient descent classifier. It takes in the training data and labels as parameters.
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
clf.fit(X_train_scaled, y_train_encoded)KernelPCA
This function is used to perform kernel principal component analysis on the dataset. It takes in the kernel function and the number of components to keep as parameters.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel='rbf', n_components=2)
X_train_kpca = kpca.fit_transform(X_train_scaled)IsolationForest
This function is used to create an isolation forest model for anomaly detection. It takes in the contamination level and the random seed as parameters.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.1, random_state=42)
clf.fit(X_train_scaled)DBSCAN
This function is used to perform density-based spatial clustering of applications with noise (DBSCAN) on the dataset. It takes in the minimum number of samples and the radius of the neighborhood as parameters.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(min_samples=5, eps=0.5)
dbscan.fit(X_train_scaled)AgglomerativeClustering
This function is used to perform hierarchical clustering on the dataset. It takes in the number of clusters and the linkage method as parameters.
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
agg.fit(X_train_scaled)KernelDensity
This function is used to estimate the probability density function of the dataset using a kernel density estimator. It takes in the kernel function and the bandwidth as parameters.
from sklearn.neighbors import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.1)
kde.fit(X_train_scaled)GaussianMixture
This function is used to perform Gaussian mixture modeling on the dataset. It takes in the number of components and the covariance type as parameters.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, covariance_type='full')
gmm.fit(X_train_scaled)NearestNeighbors
This function is used to perform nearest neighbor searches on the dataset. It takes in the number of neighbors and the distance metric as parameters.
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=5, metric='euclidean')
nn.fit(X_train_scaled)KNNClassifier
This function is used to create a K-nearest neighbors classifier. It takes in the training data and labels as parameters.
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train_scaled, y_train_encoded)LDA
This function is used to perform linear discriminant analysis on the dataset. It takes in the number of components to keep as a parameter.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train_scaled, y_train_encoded)QDA
This function is used to perform quadratic discriminant analysis on the dataset.
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train_scaled, y_train_encoded)RANSACRegressor
This function is used to perform RANSAC regression on the dataset. It takes in the base estimator and the maximum number of iterations as parameters.
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression
ransac = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
ransac.fit(X_train_scaled, y_train)GradientBoostingRegressor
This function is used to create a gradient boosting regression model. It takes in the training data and labels as parameters.
from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor()
reg.fit(X_train_scaled, y_train)AdaBoostRegressor
This function is used to create an AdaBoost regression model. It takes in the training data and labels as parameters.
from sklearn.ensemble import AdaBoostRegressor
reg = AdaBoostRegressor()
reg.fit(X_train_scaled, y_train)SVR
This function is used to create a support vector regression model. It takes in the training data and labels as parameters.
from sklearn.svm import SVR
reg = SVR()
reg.fit(X_train_scaled, y_train)DecisionTreeRegressor
This function is used to create a decision tree regression model. It takes in the training data and labels as parameters.
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor()
reg.fit(X_train_scaled, y_train)RandomForestRegressor
This function is used to create a random forest regression model. It takes in the training data and labels as parameters.
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg.fit(X_train_scaled, y_train)PolynomialFeatures
This function is used to generate polynomial features from the dataset. It takes in the degree of the polynomial as a parameter.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_scaled)TruncatedSVD
This function is used to perform truncated singular value decomposition on the dataset. It takes in the number of components to keep as a parameter.
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2)
X_train_svd = svd.fit_transform(X_train_scaled)NMF
This function is used to perform non-negative matrix factorization on the dataset. It takes in the number of components to extract as a parameter.
from sklearn.decomposition import NMF
nmf = NMF(n_components=2)
X_train_nmf = nmf.fit_transform(X_train_scaled)Binarizer
This function is used to binarize the dataset based on a threshold value. It takes in the threshold value as a parameter.
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.5)
X_train_binarized = binarizer.fit_transform(X_train_scaled) LabelBinarizer
This function is used to binarize categorical variables as binary vectors. It is often used to prepare data for algorithms that require binary inputs.
from sklearn.preprocessing import LabelBinarizer
binarizer = LabelBinarizer()
y_train_binarized = binarizer.fit_transform(y_train) MultiLabelBinarizer
This function is used to binarize multiple categorical variables as binary vectors. It is often used to prepare data for algorithms that require binary inputs.
from sklearn.preprocessing import MultiLabelBinarizer
binarizer = MultiLabelBinarizer()
y_train_binarized = binarizer.fit_transform(y_train)LabelPropagation
This function is used to perform label propagation on the dataset. It takes in the kernel function and the number of iterations as parameters.
from sklearn.semi_supervised import LabelPropagation
propagation = LabelPropagation(kernel='knn', max_iter=100)
propagation.fit(X_train_scaled, y_train)LabelSpreading
This function is used to perform label spreading on the dataset. It takes in the kernel function and the number of iterations as parameters.
from sklearn.semi_supervised import LabelSpreading
spreading = LabelSpreading(kernel='knn', max_iter=100)
spreading.fit(X_train_scaled, y_train)CalibratedClassifierCV
This function is used to calibrate the probabilities of a classifier. It takes in the base classifier and the calibration method as parameters.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
calibrated_clf = CalibratedClassifierCV(clf, cv=5, method='sigmoid')
calibrated_clf.fit(X_train_scaled, y_train_encoded)DummyClassifier
This function is used to create a dummy classifier that predicts using a simple strategy. It takes in the strategy as a parameter.
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train_scaled, y_train_encoded)Conclusion
In conclusion, we have covered 50 of the most useful functions provided by Sci-kit learn for machine learning tasks. T
These functions cover a wide range of techniques and methodologies, making it easier for you to solve real-world problems and accelerate your data science projects.
If you want to stay up-to-date with the latest news and trends in data science, machine learning, and AI, then I encourage you to subscribe to my mynewsletter.
By subscribing to my newsletter, you will receive regular updates on new articles, tutorials, and resources that can help you improve your skills and stay ahead of the curve. You can subscribe it by filling out the following forms;
Here is my NumPy cheat sheet.
Here is the source code of the “How to be a Billionaire” data project.
Here is the source code of the “Classification Task with 6 Different Algorithms using Python” data project.
Here is the source code of the “Decision Tree in Energy Efficiency Analysis” data project.
Here is the source code of the “DataDrivenInvestor 2022 Articles Analysis” data project.
Thanks for reading!
If you still are not a member of Medium and are eager to learn by reading, here is my referral link.
“Machine learning is the last invention that humanity will ever need to make.” Nick Bostrom
