Machine Learning — Look-ahead Bias
Understand the bias with a concrete example

The concept
In Data Science and Machine Learning, the look-ahead bias refers to a problem we could face and impact the performance of the model and predictions.
This bias is easy to understand but difficult to avoid because the model performance is often good during the training. So, you may see no visible issues.
In this article, I illustrate this bias with a simple example (build a supervised model to predict a product price). Afterward, we dive into explanations and tips to avoid it.
Example
To illustrate this bias, imagine you want to build a model to predict the price of a product. To train the model, you use a bunch of data such as:
- Product characteristics (The brand, weight, size, category of product)
- Historical data (Previous price, number of sales)
- Geographical data (The country in which the product is sold, the region)
- Customers reviews
- Consumer Characteristics (Customer age, preference)
- Etc.
During the training phase, you gather all those data, make feature engineering and start to train your machine learning models.
Afterwards, you start to evaluate your model. For that, there is the test dataset (containing products never seen by the model). The test dataset contains the price, the variable we want to predict.
To evaluate the model, you give the model the variables of the products (size, weight, country, etc.), then verify with different metrics (R2, MSE, etc.) that its predictions are accurate. From this moment you are very happy, you obtain great results.
Those great results make you feel the model is perfect. But it may not be the case at all.
I dive into explanations of the bias below.
Explanations
After training the model, we evaluated it. For that, we give the model some data never seen before (brand, product size, weight, etc.)
The bias can happen when we train the model with data that will not be available when we make predictions.
At the time of the prediction, we assume we have data like weight, size, and brand. But for a new product (not even marketed), the customer reviews are not available.
The model has been trained with customer reviews, while this data is never available for new products. Here is the bias.
Prevent the bais
Ensure the features (variables) you use are information available up to the time of prediction. You can document your data preprocessing steps and feature engineering to ensure other team members or stakeholders can review and identify potential lookahead bias.
Notes
This bias may look obvious with a simple example, but it may become more challenging when dealing with hundreds of features (variables) and data.
We assume that we face this bias by building a supervised model with a static dataset, but there are other techniques like unsupervised and reinforcement learning, in which you may not face that bias because they can incorporate unknown data in the model without retraining it.





