MACHINE LEARNING IN PYTHON
The powerful feature extraction method you’ve never heard of
Extracting time series features using autoregressive models.

Time series feature engineering is a very complex area because you are trying to take something that is of a long, possibly variable length, and transform it into a short fixed-length vector to be compared with other time series. This challenge has been tackled with various creative ideas and methods. One of these methods that is rarely mentioned, yet very powerful, is extracting autoregressive (AR) model coefficients.
The AR model is a classical time series model. It is relatively simple compared to modern larger neural networks like recurrent neural networks, convolutional neural networks or transformers, but is still today very competitive, especially for smaller datasets. It works by simply forecasting future values of a time series by using a linear sum of past lags. We have:

Okay, so it is a time series model, what has this to do with feature extraction? Here comes the ingenuity.. when you fit a model to a time series it learns a model of fixed parameters that will not only help forecast the future of the time series but will also characterize the time series. Thus, these parameters/coefficients can be used as a fixed-length vector of features! All you have to do is fit one AR model to each time series with the same lag and then extract the parameters.
But why the AR model? Why not fit a neural network or some other model? The AR model specifically works well for this because it is fast to train and has low variance. In comparison, neural networks are slow to train and have high variance, simply running the training procedure multiple times could lead to widely different results. The autoregressive-moving-average model (ARIMA) or vector auto-regression (VAR) model (in case of multivariate data) would be alternatives, but both of these are slower to train.
Going back to the assumptions of the AR model. One thing that is important to have is a weakly stationary time series. It is not a necessary condition for this feature extraction method, but it will likely perform much better with the assumption fulfilled. What does this mean? A time series is said to be weakly stationary if the mean, variance and covariance/correlation between time lags remain fixed for all t. In plain English it is a time series without large changes like cycles or trends, its behavior remains the same. Why do we want this? Because if the properties of the time series remain the same then the model can treat each observation the same way also.
When this assumption is not fulfilled, an operation called differencing can sometimes transform the time series to become stationary. This means that instead of using the raw data we use the changes from point t to t+1. That is:

A simple example showing how this works is to simulate some data. If we generate data points from a normal distribution, then use the cumulative sum as a time series we get the following:







