This context discusses various time domain features that can be extracted from a time series using pandas in Python, including basic aggregations, kurtosis, skew, quantile/percentile-function, autocorrelation, correlation, differencing, and segmentation.
Abstract
The context begins by introducing the concept of time domain features in time series data and their extraction using pandas in Python. It then proceeds to explain various types of time domain features, starting with basic aggregations such as mean, median, max, min, variance, mode, and standard deviation. The context then delves into more complex features such as kurtosis, skew, quantile/percentile-function, autocorrelation, and correlation. The article also discusses the importance of preprocessing steps such as differencing and segmentation in enhancing the extracted information. The context concludes by demonstrating the feature extraction techniques in action using a dataset and comparing the results with previous methods.
Bullet points
Time domain features are extracted from the time domain of a time series.
Basic aggregations such as mean, median, max, min, variance, mode, and standard deviation can be calculated using pandas.
Kurtosis measures the "fat-tailed" nature of a distribution and can be calculated using the .kurt() function in pandas.
Skew measures the asymmetry of a distribution around its mean and can be calculated using the .skew() function in pandas.
The quantile/percentile-function returns the value of which a certain fraction of points is below.
Autocorrelation is the correlation of a time series with itself at a certain lag and can be calculated using the .autocorr() function in pandas.
Correlation between variables in a multivariate time series can be calculated using the .corr() function in pandas.
Differencing is a preprocessing step that can turn a nonstationary time series into a stationary one.
Segmentation is another preprocessing step that can enhance the extracted information by splitting the time series into different segments.
The context demonstrates the feature extraction techniques in action using a dataset and compares the results with previous methods.
There are several different types of features that can be extracted from a time series and the most common type is called time domain features. By time domain we refer to the usual way we see time series, i.e. the signal changes over time. This is to contrast it with the frequency domain, where we instead look at what amplitude different frequencies are present in the signal. Thus, time domain features are simply features extracted from the time domain of the time series.
Luckily for us (since we use pandas), there are several directly usable functions to extract time domain features. Let’s talk about them first and then later we will try them out in action.
Basic aggregations
As I’m sure many of you are aware, with pandas you can easily calculate the .mean(), .median(), .max(), min(), var() , mode()and std() . These are simple, yet can be quite capable in classification or regression tasks and should always be considered.
Kurtosis
Kurtosis is a measure of how “fat-tailed” a distribution is. A distribution with a high kurtosis has heavy tails, i.e. extreme values are common, while a distribution with a low kurtosis has not. The normal distribution has a kurtosis of 3; for this reason, a measure known as excess kurtosis is often used. This is simply kurtosis - 3 and tells us how fat the tails are compared to the normal distribution. A distribution that has fatter tails than the normal distribution (positive excess kurtosis) is called leptokurtic and a distribution with thinner tails than the normal distribution (negative excess kurtosis) is called platykurtic. The formula for kurtosis is:
With pandas, you can simply write .kurt() to calculate the excess kurtosis.
Kurtosis for different distributions
Skew
Skew or skewness measures the asymmetry of a distribution around its mean. A negative skew indicates that the distribution has a longer tail on its left side, while a positive skew indicates a longer tail on its right side. Symmetric distributions have a skew of 0. The formula for skew is:
With pandas, skew is easily calculated with .skew() .
Quantile/Percentile-function
The quantile or percentile-function are the same except that the percentile-function takes values between 0 and 100 as input and the quantile-function values between 0 and 1. I will use quantile for the rest of the article. The function takes a value between 0 and 1 and returns the value of which that fraction of points is below. For example, quantile(0.8) is the point that is above 80% of points and quantile(0.1) is the point that is above 10% of points. Extracting several such points can make great features and can easily be done with .quantile(...) in pandas.
Autocorrelation
Autocorrelation is the correlation of a time series with itself at a certain lag and is a helpful measure to characterize time series. Just like previous measures, it is simple in pandas: .autocorr(lag=...) .
Correlation
If you have a multivariate time series you can calculate correlations between the variables. This is also trivial to do in pandas: .corr() . You can alternatively calculate the covariance instead with .cov() where you obtain the variance for each variable at the same time. Note that with both of these methods a symmetric matrix will be returned. Since duplicate values are of no use in feature extraction and additionally it is also more convenient with a flat vector, you can do the following:
Another consideration to make is if the time series variables are “aligned”. What if the correlation is very small only because the variables are lagged versions of each other? I wrote a short article about this and how you can synchronize the time series in the article below:
Now, all these features can be calculated for any time series, but there are many ways you could preprocess the time series to amplify the extracted information. For instance, stationarity is an important concept in time series forecasting, but it can also be useful for information extraction. Differencing is one way you could turn a nonstationary time series into a stationary one (with pandas it is .diff() ). To learn more about this you can read the article I wrote below about another time series feature extraction method where I go into more detail on this topic.
One way to get the best of both worlds is to apply the feature extraction to both the non-differenced time series and then again after differencing it. You can always try different options and see how the results change and what works best.
Another method for enhancing the extracted information is to segmentthe time series. If you can split the time series into different segments and then perform the feature extractions mentioned above to each segment, this could be a considerable advantage.
What’s important is that all preprocessing steps are systematic, i.e. it is done the same way for each and every time series. Otherwise, if each individual time series is handled differently then the features won’t be comparable either. For instance, if you take the first difference for one time series but the second difference for another, the comparison won’t make sense.
Feature extraction in action
We’ve gone over the feature extraction techniques and it’s time to put it all together in action. To test the methods, I will use the same dataset I used in the article I referenced above (The powerful feature extraction method you’ve never heard of). Another article had previously applied a CNN to this same dataset and that was my comparison for that article. Thus, this time I have two results to compare with. The CNN achieved an accuracy of ~96.8% and in my later article, I obtained ~97.7% with AR-coefficients and a logistic regression model. Let’s see what we can do this time.
First, we load the data:
Below the first 3 time series are visualized:
3 time series from the dataset
They look stationary, and from a unit-root test done in the referenced article nonstationarity could be rejected with a significance level of 0.05.
Now the feature extraction:
And that’s it! Let’s split the datasets into another validation split to perform some hyperparameter tuning. The only hyperparameter I will tune is the differencing. As mentioned before, it is always helpful to see how the performance changes with or without differencing and/or other preprocessing steps. I will keep the model fixed with the default parameters (a RandomForestClassifier), although to maximize performance it should also be tuned.
First test: no differencing.
With ~95.6% accuracy the result is not very impressive, beaten by both previous methods. Is differencing once better then?
Wow ~98.5%! That’s a huge difference. An increase of slightly lower than 3% percentage points and a better result than both of the previous methods (although still on the validation set). And this even though the time series were already stationary! This illustrates the impact preprocessing of the time series can have. Now we need to test another difference:
Alright, ~97.9%, slightly worse. Let’s test one more differencing and then call it the day.
No way… ~99.4%?! That’s ridiculous, completely crushing all previous results. What theory states that you should even try differencing 3 times? Classification is just something completely different from forecasting. I didn’t expect these results, a learning experience for me also. To see if this metric holds I will now train on all training time series and then calculate a final performance score on the test set:
Once again, ~99.4%. Way better than both previous methods. Though, I would expect that the AR-coefficients method would perform better if the time series were also preprocessed with the same number of differencing steps. Perhaps the CNN too. To finalize, let’s look at the feature importances:
We see that the mode unexpectedly (since the data is continuous) is one of the most important features. Then the autocorrelations had a large impact. The importance will naturally fluctuate between datasets.
Conclusion
Time domain feature extraction should probably be the first thing you experiment with in a time series classification or regression task. It is easy, fast and interpretable. In the experiments, differencing was proven an impactful preprocessing step, changing the results completely. Next time, try differencing one more time, just to see. Sometimes miracles happen. Thanks for reading.
If you’re interested in reading more articles about data science, check out my reading list below: