Data Science with Python — Optimizing Scikit-Learn Models
This article follow this one, be sure to check it out before reading further to be sure to understand if you’re new to scikit-learn.
In the field of data science, building accurate and reliable predictive models is crucial for making informed decisions. However, simply building a model is not enough; it is equally important to optimize and fine-tune the model to ensure it performs well on unseen data.
I will try to explain how you can optimize your scikit-learn models to solve complex data science problems.
Testing/Validation
To know whether your model needs to be optimized or not, you need to test and validate it.
Testing is the process of evaluating the performance of a model on a dataset that the model has not seen during training. This allows us to estimate the model’s performance on unseen data.
Validation is the process of evaluating a model’s performance on a separate dataset, called the validation set. This allows us to tune the model’s hyperparameters and ensure that it is not overfitting to the training data.
One method we can use to test our models is the train-test split method. It allows us to randomly split the data into a training set and a testing set. We can also create a third validation set to tune the model’s hyperparameters.
Here is an example of a train-test split with 20% of the data used for the test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X and y represent the feature and target variables respectively, and the test_size parameter specifies the proportion of the data to be used for testing.
We can also perform k-fold cross-validation. This method involves splitting the data into k folds, where k-1 folds are used for training and the remaining fold is used for testing. This process is repeated k times, with a different fold being used as the test set each time. The average performance across all k iterations is used to evaluate the model’s performance.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# train and evaluate the model on the current fold
Let’s take the example from the previous article with the ice creams to understand the purpose of testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
And here is the output:
Actual Predicted
2 130 136.841645
8 200 190.183727
As you can see, splitting the data into 2 sets allows us to use data from our dataset to test the model and see how far are the predictions from the real data.
Hyperparameters Tuning
Hyperparameters are parameters that are not learned during the training process, but are set before training the model. These parameters can have a significant impact on the performance of a model, so it’s important to tune them to achieve the best performance possible.
Hyperparameters are parameters that are not learned during the training process, but are set before training the model. These parameters can have a significant impact on the performance of a model, so it’s important to tune them to achieve the best performance possible.
We can use the GridSearchCV
and RandomizedSearchCV
classes to perform hyperparameters tuning.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {'C': [1, 10, 100, 1000], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
Here, the param_grid
dictionary specifies the range of values for each hyperparameter (you can find hyperparameters in the sklearn’s doc), and the cv
parameter specifies the number of folds to use in k-fold cross-validation. The fit
method trains the model and performs the grid search.
GridSearchCV and RandomizedSearchCV also have a best_params_
attribute that returns the best set of hyperparameters and a best_score_
attribute that returns the best score achieved during the search. So, once you finish tuning your parameters, you can access these attributes to get the best result.
Model Selection
Model selection is the process of choosing the most appropriate model for a given dataset and problem.
Scikit-learn provides a wide range of models for various types of problems such as classification, regression and clustering. Some of the most commonly used models in scikit-learn include:
- Linear models: Linear Regression, Logistic Regression, and Ridge Regression.
- Tree-based models: Decision Trees, Random Forest, and Gradient Boosting.
- Support Vector Machines (SVMs)
- K-Nearest Neighbors (KNN)
- Neural Networks
- and many more.
When selecting a model, it’s important to consider the problem type, the size and structure of the dataset, and the computational resources available. In some cases, it’s also a good idea to try multiple models and compare their performance using metrics such as accuracy, precision, recall, and F1 score.
In scikit-learn, training and evaluating a model is a simple process. For example, to train and evaluate a decision tree model, you can use the following code:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# train the model
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
# evaluate the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The DecisionTreeClassifier
class is used to train a decision tree model with a maximum depth of 5. The fit
method is used to train the model on the training data, and the predict
method is used to make predictions on the test data. The accuracy_score
function is then used to evaluate the model's performance.
Another way to compare different models is by using the cross_val_score
function from the model_selection
module. This function allows you to evaluate a model using k-fold cross-validation, which can be useful for getting a more robust estimate of a model's performance. For example, to evaluate a decision tree model and a random forest model, you can use the following code:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
dt_scores = cross_val_score(dt, X, y, cv=5)
rf_scores = cross_val_score(rf, X, y, cv=5)
print("Decision tree:", dt_scores.mean())
print("Random forest:", rf_scores.mean())
Ensemble Models
Ensemble models are a type of model that combine the predictions of multiple base models to improve performance. This can be beneficial when the base models have different strengths and weaknesses, and the ensemble model can exploit these strengths to make more accurate predictions.
There are several types of ensemble models in scikit-learn, including:
- Bagging: Bagging stands for Bootstrap Aggregating, and it is a method that generates multiple models by training each one on a different random subset of the data. The final prediction is done by averaging the predictions of all base models. Random Forest is an example of Bagging Algorithm
- Boosting: Boosting is an iterative method that adjusts the weight of incorrectly predicted examples in the dataset and trains new models on the updated dataset. The final prediction is done by combining the predictions of all base models. Gradient Boosting is an example of Boosting Algorithm
- Stacking: Stacking is a method that combines predictions of multiple base models by training a meta-model to make the final prediction.
In scikit-learn, ensemble models can be easily implemented using the BaggingClassifier
, RandomForestClassifier
, AdaBoostClassifier
, GradientBoostingClassifier
, and VotingClassifier
classes. For example, to train a random forest model, you can use the following code:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
Here, the RandomForestClassifier
class is used to train a random forest model with 100 decision trees. The fit
method is used to train the model on the training data.
Similarly, to train an AdaBoost model, you can use the following code:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100)
ada.fit(X_train, y_train)
Final Note
As you can see, there are a lot of features implemented in scikit-learn. It allows us to have a very big arsenal to solve data science problems. And it’s good because the best approach will depend on the specific dataset and problem, so it’s crucial to try different techniques and evaluate their performance.
In the next article, we’ll cover some advanced scikit-learn applications. Be sure to follow me if you don’t want to miss this article!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: