Summary

The web content discusses strategies for continuous learning in machine learning models to handle data drift, focusing on online machine learning techniques and comparing stateful and continuous learning approaches.

Abstract

The article "Retrain, or not Retrain? Online Machine Learning with Gradient Boosting" delves into the challenges faced when deploying machine learning models in streaming data contexts where data drift is a concern. It outlines different refitting strategies, including stateful learning, weighted stateful learning, and continuous learning, to update models with new data without retraining from scratch. Through experiments with simulated conceptual drift scenarios using SGDRegressor and LGBMRegressor, the post demonstrates that continuous learning, or online machine learning, can lead to better performance and efficiency compared to traditional retraining methods. The results suggest that online machine learning can be effectively applied to various algorithms, including powerful tree-based gradient boosting methods, provided that careful model validation is in place to prevent catastrophic forgetting.

Opinions

The author emphasizes the importance of updating predictive models with new data due to the potential for performance degradation caused by data drift.
Retraining from scratch is seen as less efficient and more costly compared to online machine learning approaches that incrementally update the model.
The article conveys that continuous learning can outperform traditional stateful learning methods, especially when the recent data is given higher importance through weighted approaches.
Online machine learning is presented as not only feasible but also superior in handling dynamic patterns over time, particularly when using algorithms like SGDRegressor and LGBMRegressor.
The author advocates for a cautious approach to online machine learning, highlighting the risk of catastrophic forgetting and the necessity for robust validation strategies to ensure model robustness.

Retrain, or not Retrain? Online Machine Learning with Gradient Boosting

Comparing Refit Strategies for Continuous Learning in Scikit-Learn

Training a machine learning model requires energy, time, and patience. Smart data scientists organize experiments and track trials on the historical data to deploy the best solution. Problems may arise when we pass newly available samples to our pre-build machine learning pipeline. In the case of predictive algorithms, the registered performances may diverge from the expected ones.

The causes behind discrepancies are variegated. Excluding technical mistakes, the most common and feared responsible is data drift. From the standard distribution shift to the sneaky multivariate and conceptual drift, we must prepare to handle all these situations.

In this post, we don’t focus on how to detect data drift. We try to outline how to react in the presence of data drift. Numerous interesting tools and fancy techniques have been introduced recently to facilitate data drift detection. That’s cool but what can we do after? “Refit is all you need” is the most known slogan used to handle the situation. In other words, when new labeled data became available, we should make our model continuously learn new insights from them.

With online machine learning, we refer to a multi steps training process to allow our algorithms to dynamically suit new patterns. If it’s properly made, it may provide great benefits (both in terms of speed and performance) over retraining from scratch. That’s exactly what we want to test in this post.

EXPERIMENT SETUP

We imagine operating in a streaming context where, at some regular time intervals, we can access new labeled data, calculate the metrics of interest, and retrain our predictive model.

We simulate a conceptual drift scenario. We have some features that maintain stationary and unchanged distributions over time. Our target is a linear combination of the features. The contributions of every single feature to the target are dynamic and not constant over time.

Left: simulated features. Right: simulated coefficients (image by the author)

In this situation, using the same predictive model trained a time ago may be useless. Let’s investigate the option at our disposal.

REFIT IS ALL YOU NEED

To learn the relationships between target and features we need to frequently update our model. In this sense, we have different strategies at our disposal.

We may adopt stateful learning where we initialize, at some predefined intervals, training from scratch with the data at our disposal. All we have to do is merge the new samples with the historical ones. We don’t need to store the previously fitted model since we recreate it.

A variation of stateful learning it’s weighted stateful learning. It consists in giving the latest observations a higher weight. This may be useful to weigh more the recent data and make the new model focus on the latest patterns. Carrying out a weighted training is straightforward. Lots of the latest machine learning algorithm implementations provide the build-in possibility to give each sample a different weight.

weighted stateful learning (image by the author)

On the other hand, we may consider continuous learning, aka online machine learning. In continuous learning, we use the previous model knowledge to initialize a new training step. We take the new set of available samples and make the previously fitted model learn new patterns from them. By updating (instead of reinitializing) the model knowledge, we hope to get better performances reducing the costs of training from scratch.

continuos learning (image by the author)

ONLINE MACHINE LEARNING IN PRACTICE

Online machine learning is natively supported by all neural network-based algorithms. We can anytime continue the training process by updating the loss sample-wise while passing new data.

Practically speaking, in the scikit-learn ecosystem, all the algorithms that support the partial_fit method can carry out continual learning. In the code snippets below, we introduce how we can do it with a few lines of code.

cv = TimeSeriesSplit(n_splits, test_size=test_size)

for i,(id_train,id_test) in enumerate(cv.split(X)):
    if i>0:
        model = model.partial_fit(
            X[id_train[-test_size:]], y[id_train[-test_size:]]
        )
    else:
        model = SGDRegressor(**fit_params).fit(
            X[id_train], y[id_train]
        )

Coming back to our experiment, we test the three mentioned training strategies (stateful learning, weighted stateful learning, and continuous learning) on our simulated data using a SGDRegressor. We don’t do this a single time but we do it multiple times by simulating different scenarios to better take care of the variability in the simulation process. We regularly evaluate our models for 20 periods and store the prediction errors (calculated as SMAPE) for all the simulated scenarios.

SGDRegressor online machine learning performances (image by the author)

We can see that continuous learning can achieve the greatest performance compared to the other strategies. Giving more weight to the recent observations, weighted stateful learning can also do better than standard stateful learning.

These results sound promising. Is it possible to do online machine learning with other algorithms? We know the great power of tree-based gradient boosting. Lots of machine learning projects use them thanks to their adaptability in a variegated range of situations. It would be great to have the possibility to operate online machine learning also with them.

Hopefully, we can do it! It is easy as in the previous case. We report a snippet where we introduce how we can do it with LGBMRegressor.

cv = TimeSeriesSplit(n_splits, test_size=test_size)

for i,(id_train,id_test) in enumerate(cv.split(X)):
    if i>0:
        model = LGBMRegressor(**fit_params).fit(
            X[id_train[-test_size:]], y[id_train[-test_size:]],
            init_model = model.booster_
        )
    else:
        model = LGBMRegressor(**fit_params).fit(
            X[id_train], y[id_train]
        )

Let’s see it in action in our simulated scenario.

LGBMRegressor online machine learning performances (image by the author)

We achieve the same satisfactory results as before. If properly handled, online machine learning sounds to be effective and available with different algorithms.

SUMMARY

In this post, we introduced the concept of online machine learning. We explored different stateful refitting strategies comparing them with a continuous learning approach. Online machine learning revealed to be a good approach to test in some applications. Doing online machine learning is a bit of an art. It’s not granted it may lead to performance improvement. High it’s the risk to make the model forget what it learned (catastrophic forgetting). In this sense, having a solid and adequate validation strategy is more important than ever.

If you are interested in the topic, I suggest:

CHECK MY GITHUB REPO

Keep in touch: Linkedin