Text Data Pre-processing for Time-Series Models
Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?
Introduction
Text data offer qualitative information that can be quantified, aggregated, and used as a variable in time-series models. Simple methods of text data representation, such as one-hot encoding of categorical variables and word n-grams, have been used since NLP’s early beginnings. Over time, more complex methods, including the Bag-of-words model, found their way to represent text data for machine learning algorithms. Based on the distributional hypothesis formulated by Harris [1] and Firth [2], modern models such as Word-to-Vec [3] and [4], GloVe [5], and ELMo [6] use vector representation of words in their neural network architectures. Since computers process text as vectors, it can be used as a variable in time-series econometric models.
In this way, we can use qualitative information from the text and use it to extend the possibilities of quantitative time-series models.
In this article, you’ll learn more about:
- How to use qualitative information from text for quantitative modeling
- How to clean and represent text data for time-series models
- How to work efficiently with 1 million rows of text data
- End-to-end coding example in Python.
In our recent conference paper, we developed a structural plan for text-data pre-processing that might be used for areas such as: (1) predicting exchange rates with the sentiment from social networks, (2) predicting agricultural prices using public news data, (3) demand prediction in various areas.
1. Structural plan of text data representation
Let’s start with a plan. In the beginning, there is qualitative raw text data collected over time. In the end, we have empirical estimates with time-varying numerical vectors (= quantitative data). This graph says more about how we will proceed:
2. Empirical example in Python
Let’s illustrate the coding on the News Category Dataset compiled by Rishabh Misra [8], [9] and released under the Attribution 4.0 International license. The data contains news headlines published between 2012 and 2022 on huffpost.com. It was multiplicated to reach a 1-million-row dataset.
The primary aim is to construct a time series in monthly frequency from news headlines reflecting public sentiment.
The dataset contains 1 million headlines. Because of its size, I used Polars library, which makes dataframe operations much faster. Compared to the mainstream Pandas, it handles large data files highly efficiently. On top of that, the code was run in Google Colab with GPU hardware accelerator.
The python code is here, and the data looks like this:
2.1. Text data pre-processing
The purpose of text data pre-processing is to remove all redundant information that might bias the analysis or lead to an incorrect interpretation of the results. We’ll remove punctuation, numbers, extra spaces, English stopwords (most common words with low or zero information value), and lowercase the text.
Probably the simplest and most efficient way of cleaning text data in Python is with cleantext library.
First, define a cleaning function to perform the cleaning operations:
def preprocess(text):
output = clean(str(text), punct=True,
extra_spaces=True,
stopwords=True,
lowercase=True,
numbers = True)
return output
Next, we clean the 1 mil. dataset with Polars:
data_clean = data.with_columns([
pl.col("headline").apply(preprocess)
])
The clean dataset contains text with maximum informational value for further steps. Any unnecessary strings and digits reduce the accuracy of the final empirical modeling.
2.2. Text data representation
Data representation involves methods used to represent data in a computer. Since computers work with numbers, we select an appropriate model to vectorize the text dataset.
In our project, we are constructing a time series of sentiment. For this use case, the pre-trained sentiment classifier VADER (Valence Aware Dictionary and Sentiment Reasoner) is a good choice. Read my previous article to learn more about this classifier, along with some other alternatives.
The classification with vaderSentiment library looks in the code as follows. First, create the function for classification:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# calculate the compound score
def sentiment_vader(sentence):
# create a SentimentIntensityAnalyzer object
sid_obj = SentimentIntensityAnalyzer()
sentiment_dict = sid_obj.polarity_scores(sentence)
# create overall (compound) indicator
compound = sentiment_dict['compound']
return compound
Next, apply the function for the time-series dataset:
# apply the function with Polars
sentiment = data_clean.with_columns([
pl.col("headline").apply(sentiment_vader)
])
Here is what the result looks like:
The headline column includes sentiment on the scale [-1:1] reflecting prevalent emotional content in the headlines for each row.
2.3. Time-series representation
The next step in time-series text data representation involves extending the data matrix with a time dimension. It can be achieved by (a) aggregating data along a time axis and (b) selecting a method implementing time-series text data representation. In the case of our data, we’ll do the former and aggregate sentiment from each row by monthly frequency.
This code makes the average aggregation of sentiment and prepares monthly time series:
# aggregate over months
timeseries = (sentiment.lazy()
.groupby("date_monthly")
.agg(
[
pl.avg("headline")
]
).sort("date_monthly")
).collect()
2.4. Quantitative modeling
The final step is to use the time series for modeling. To show an example, in our recent conference paper, we similarly extracted the sentiment from headlines of research articles published in the top 5 economic journals. Then, we use rolling time-varying correlations of a 5-year window and looked at how sentiment relates to GDP and other global economic indicators (see figure 2).
We hypothesized that sentiment correlates with the macroeconomic environment during periods of sharp recessions and inflation shocks. The results support, except for one specific journal, these considerations for the Oil Shocks of the 1970s, which led to a steep recession accompanied by a massive inflation spike.
Conclusions
In this article, we have constructed monthly time series of sentiment from 1 million rows of text data. The key points are:
- Qualitative information can extend the capacities of quantitative time-series models
- Polars library makes large text-data pre-processing feasible even in Python language
- Cloud services such as Google Colab make the processing of extensive text datasets even faster.
The complete code in this tutorial is on my GitHub. The recommended reading is The Most Favorable Pre-trained Sentiment Classifiers in Python.
Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!
References
[1] Z. Harris. 1954. Distributional structure. Word, vol. 10, no. 23, pp. 146–162.
[2] J. R. Firth. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pp. 1–32. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959, London: Longman 1968.
[3] Mikolov, T., Chen, K., Corrado, G. S., Dean, J. 2013b. Efficient estimation of word representations in vector space. Computation and Language: International Conference on Learning Representations.
[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, vol. 26 (NIPS 2013).
[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, L. 2018. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1.
[6] J. Pennington, R. Socher and C. D. Manning. 2014. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[7] Poměnková, J., Koráb, P., Štrba, D. Text Data Pre-processing for Time-series Modelling. Submitted to MAREW 2023.
[8] Misra, Rishabh. “News Category Dataset.” arXiv preprint arXiv:2209.11429 (2022).
[9] Misra, Rishabh and Jigyasa Grover. “Sculpting Data for ML: The first act of Machine Learning.” ISBN 9798585463570 (2021).