How to split data into three sets (train, validation, and test) And why?
Sklearn train test split is not enough. We need something better, and faster

INTRODUCTION
Why do you need to split data?
You don’t want your model to over-learn from training data and perform poorly after being deployed in production. You need to have a mechanism to assess how well your model is generalizing. Hence, you need to separate your input data into training, validation, and testing subsets to prevent your model from overfitting and to evaluate your model effectively.
In this post, we will cover the following things.
- A brief definition of training, validation, and testing datasets
- Ready to use code for creating these datasets (2 methods)
- Understand the science behind dataset split ratio
Definition of Train-Valid-Test Split
Train-Valid-Test split is a technique to evaluate the performance of your machine learning model — classification or regression alike. You take a given dataset and divide it into three subsets. A brief description of the role of each of these datasets is below.
Train Dataset
- Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model
Valid Dataset
- Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters.
- Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection.
Test Dataset
- Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.
Read this article by Jason Brownlee if you want to know more about how experts in machine learning define train, test, and validation datasets. Link in the references sections below #1
Ready to use code snippets
In this post we will see two ways of splitting the data into train, valid and test set —
- Splitting Randomly
- Splitting using the temporal component
1. Splitting Randomly
You can’t evaluate the predictive performance of a model with the same data you used for training. It would be best if you evaluated the model with new data that hasn’t been seen by the model before. Randomly splitting the data is the most commonly used method for that unbiased evaluation.

i. Using Sklearn → ‘train_test_split’
In the code snippet below, you will learn how to use train_test_split twice to create the train | valid | test dataset of our desired proportions.







