Data Science with Python — K-Fold Cross Validation

Source: Wikipedia: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Once you’ve built your machine learning model, it’s necessary to evaluate it. You have many ways to do so. One thing you can do is k-fold cross-validation. K-fold cross-validation is powerful as it allows us to have an idea about how our model performs with unseen data.

Today, we’ll discover k-fold cross-validation and how to use it in Python.

What is K-Fold Cross Validation

K-fold cross-validation is a technique used in machine learning to assess the performance and generalization ability of a model. It helps us understand how well our model will perform on unseen data.

Let’s say you want to train a model to classify images of animals into different categories: cats, dogs, and birds. You have a dataset of 1000 images, but you want to evaluate your model’s performance accurately.

Here’s how K-fold cross-validation works:

Splitting the dataset: First, you divide your dataset into K equal-sized subsets or folds. For example, let’s use K=5, so you’ll have 5 subsets, each containing 200 images.
Training and testing: Now, you iterate through each fold, treating it as a testing set, while the remaining K-1 folds serve as the training set. In the first iteration, the first fold will be the testing set, and the other four folds will be the training set.
Model training and evaluation: Train your model using the training set, and then evaluate its performance on the testing set. You can measure metrics like accuracy, precision, recall, or F1 score to assess how well your model performs on the current fold.
Iteration: Repeat steps 2 and 3 for each fold, changing the testing set each time until all folds have been used as the testing set.
Average performance: Once you’ve completed all iterations, you calculate the average performance of your model across all folds. This gives you a more reliable estimate of how well your model is likely to perform on unseen data.

The main advantage of K-fold cross-validation is that it allows you to make better use of your limited dataset. You use all the data you have to evaluate your model.

Python Implementation

It’s so easy to use k-fold cross-validation in Python as it’s already implemented in scikit-learn.

You just have to use the KFold class with the cross_val_score function:

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Assume you have your feature data 'X' and target labels 'y' ready

# Assume your model is already trained
model = LogisticRegression()

# Define the number of folds (K)
k = 5

# Create a KFold object
kfold = KFold(n_splits=k)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kfold)

# Print the accuracy for each fold
for i, score in enumerate(scores):
    print(f"Fold {i+1}: {score}")

# Calculate and print the average accuracy across all folds
print(f"Average accuracy: {np.mean(scores)}")

The cross_val_score function is used to perform cross-validation by fitting the model on each fold’s training data and evaluating its performance on the corresponding test data. It returns an array of scores, where each score represents the model’s accuracy for a particular fold.

The KFold constructor takes 2 other arguments:

shuffle (optional): By default, shuffle is set to False. However, if you want to shuffle the data before splitting it into folds, you can set shuffle=True.
random_state (optional): This argument is used to specify the random seed for shuffling the data if shuffle=True. It ensures the reproducibility of results.

And for cross_val_score , here is the full list of arguments:

estimator: This argument specifies the machine learning model or estimator that you want to evaluate. It can be an instance of a classifier or regressor from scikit-learn.
X: The feature data or independent variables.
y: The target labels or dependent variable.
cv: This argument determines the cross-validation splitting strategy. It can accept a KFold object or an integer specifying the number of folds (similar to the n_splits argument of KFold). You can also use other cross-validation strategies from scikit-learn, such as StratifiedKFold or GroupKFold.
scoring (optional): This argument specifies the scoring metric to evaluate the model's performance. It can be a string representing a predefined scoring metric or a callable object. For classification tasks, common metrics include accuracy, precision, recall, 'etc…
n_jobs (optional): This argument determines the number of parallel jobs to run during cross-validation. Setting it to -1 will use all available processors.

How to Find the Ideal Number of Folds for a K-Fold Cross Validation

It’s important to find a good number of folds to ensure your evaluation of the model is right. I have a few tips for you:

Consider the size of your dataset: If you have a small dataset, using a higher number of folds may be beneficial to ensure sufficient training and testing samples in each fold. For example, you might consider using 5 or 10 folds.
Evaluate the stability of your model: Cross-validation can provide an estimate of the model’s performance. If you notice that the model’s performance (e.g., evaluation metrics) varies significantly across different folds, it may indicate that the dataset is not representative enough or the model is sensitive to different training samples. In such cases, using a larger number of folds can help improve the stability of the evaluation.
Consider the trade-off between bias and variance: With a higher number of folds, the training sets become smaller, which can lead to higher bias due to less data available for training. Conversely, a lower number of folds can result in higher variance because there is less variation in the training data. It’s important to strike a balance based on your specific scenario.

Combining K-Fold Cross-Validation with Other Validation Techniques

To further improve the robustness and reliability of the evaluation process, you can combine k-fold cross-validation with other techniques.

First, you can use hold-out validation. In addition to k-fold cross-validation, you can set aside a small portion of your dataset as a hold-out validation set. This hold-out set is not used during the k-fold cross-validation process but can be used as a final evaluation of your model’s performance. It provides an additional unbiased estimate of the model’s generalization ability on unseen data.

You can also use stratified sampling if your dataset is imbalanced, meaning that the classes are not evenly represented. Stratified sampling ensures that each fold contains a representative distribution of the classes, helping to prevent any individual class from being underrepresented or overrepresented in a particular fold.

Finally, you can perform nested cross-validation. It involves an outer loop of k-fold cross-validation and an inner loop of another k-fold cross-validation. The outer loop is used for hyperparameter tuning and model selection, while the inner loop is used for evaluating the performance of the selected model on each fold. This approach provides a more unbiased estimate of the model’s performance by using separate data for model selection and evaluation.

Final Note

K-fold cross-validation is a powerful tool to evaluate a model. I don’t have any examples in mind where it is not beneficial to use this technique, so I’m sure you can go for it no matter the problem you have to solve.

To explore the other stories of this story, click below!