avatarPetr Korab

Summary

The article outlines a methodological approach for incorporating sentiment from text data into time-series econometric models to enhance their predictive capabilities.

Abstract

The article discusses the integration of qualitative text data into quantitative time-series models, detailing a structural plan for text data pre-processing. It emphasizes the transformation of raw text into quantitative data, suitable for time-series analysis, through various NLP techniques and sentiment analysis using tools like VADER. The process involves cleaning text data to enhance informational value, vectorizing text for sentiment classification, and aggregating sentiment scores into a monthly time series for econometric modeling. The article also presents an empirical example using a large dataset of news headlines, demonstrating the feasibility of processing large text datasets with the Polars library in Python, and highlights the potential of sentiment analysis in forecasting economic indicators such as GDP.

Opinions

  • The author advocates for the use of qualitative information from text data to extend the possibilities of quantitative time-series models.
  • The Polars library is recommended for efficiently handling large datasets in Python, with a notable performance advantage over Pandas.
  • Sentiment analysis, particularly using the VADER classifier, is presented as a valuable tool for constructing time series that reflect public sentiment.
  • The article suggests that sentiment derived from text data correlates with macroeconomic trends, especially during significant economic events like recessions and inflation shocks.
  • The author encourages readers to engage with their work by inviting them to support their writing and subscribe to their email list for updates on future articles.

Text Data Pre-processing for Time-Series Models

Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?

Photo by Kaleidico on Unsplash

Introduction

Text data offer qualitative information that can be quantified, aggregated, and used as a variable in time-series models. Simple methods of text data representation, such as one-hot encoding of categorical variables and word n-grams, have been used since NLP’s early beginnings. Over time, more complex methods, including the Bag-of-words model, found their way to represent text data for machine learning algorithms. Based on the distributional hypothesis formulated by Harris [1] and Firth [2], modern models such as Word-to-Vec [3] and [4], GloVe [5], and ELMo [6] use vector representation of words in their neural network architectures. Since computers process text as vectors, it can be used as a variable in time-series econometric models.

In this way, we can use qualitative information from the text and use it to extend the possibilities of quantitative time-series models.

In this article, you’ll learn more about:

  • How to use qualitative information from text for quantitative modeling
  • How to clean and represent text data for time-series models
  • How to work efficiently with 1 million rows of text data
  • End-to-end coding example in Python.

In our recent conference paper, we developed a structural plan for text-data pre-processing that might be used for areas such as: (1) predicting exchange rates with the sentiment from social networks, (2) predicting agricultural prices using public news data, (3) demand prediction in various areas.

1. Structural plan of text data representation

Let’s start with a plan. In the beginning, there is qualitative raw text data collected over time. In the end, we have empirical estimates with time-varying numerical vectors (= quantitative data). This graph says more about how we will proceed:

Figure 1. Structural plan of text data representation. Source: Poměnková et al., submitted to MAREW 2023.

2. Empirical example in Python

Let’s illustrate the coding on the News Category Dataset compiled by Rishabh Misra [8], [9] and released under the Attribution 4.0 International license. The data contains news headlines published between 2012 and 2022 on huffpost.com. It was multiplicated to reach a 1-million-row dataset.

The primary aim is to construct a time series in monthly frequency from news headlines reflecting public sentiment.

The dataset contains 1 million headlines. Because of its size, I used Polars library, which makes dataframe operations much faster. Compared to the mainstream Pandas, it handles large data files highly efficiently. On top of that, the code was run in Google Colab with GPU hardware accelerator.

The python code is here, and the data looks like this:

Figure 2. News Category Dataset

2.1. Text data pre-processing

The purpose of text data pre-processing is to remove all redundant information that might bias the analysis or lead to an incorrect interpretation of the results. We’ll remove punctuation, numbers, extra spaces, English stopwords (most common words with low or zero information value), and lowercase the text.

Probably the simplest and most efficient way of cleaning text data in Python is with cleantext library.

First, define a cleaning function to perform the cleaning operations:

def preprocess(text):
    output = clean(str(text), punct=True,
                              extra_spaces=True,
                              stopwords=True,
                              lowercase=True,
                              numbers = True)
    return output

Next, we clean the 1 mil. dataset with Polars:

data_clean = data.with_columns([
    pl.col("headline").apply(preprocess)
])

The clean dataset contains text with maximum informational value for further steps. Any unnecessary strings and digits reduce the accuracy of the final empirical modeling.

2.2. Text data representation

Data representation involves methods used to represent data in a computer. Since computers work with numbers, we select an appropriate model to vectorize the text dataset.

In our project, we are constructing a time series of sentiment. For this use case, the pre-trained sentiment classifier VADER (Valence Aware Dictionary and Sentiment Reasoner) is a good choice. Read my previous article to learn more about this classifier, along with some other alternatives.

The classification with vaderSentiment library looks in the code as follows. First, create the function for classification:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# calculate the compound score
def sentiment_vader(sentence):

    # create a SentimentIntensityAnalyzer object
    sid_obj = SentimentIntensityAnalyzer()

    sentiment_dict = sid_obj.polarity_scores(sentence)
    
    # create overall (compound) indicator
    compound = sentiment_dict['compound']

    return compound

Next, apply the function for the time-series dataset:

# apply the function with Polars

sentiment = data_clean.with_columns([
    pl.col("headline").apply(sentiment_vader)
])

Here is what the result looks like:

Figure 3. Sentiment evaluation

The headline column includes sentiment on the scale [-1:1] reflecting prevalent emotional content in the headlines for each row.

2.3. Time-series representation

The next step in time-series text data representation involves extending the data matrix with a time dimension. It can be achieved by (a) aggregating data along a time axis and (b) selecting a method implementing time-series text data representation. In the case of our data, we’ll do the former and aggregate sentiment from each row by monthly frequency.

This code makes the average aggregation of sentiment and prepares monthly time series:

# aggregate over months

timeseries = (sentiment.lazy()
    .groupby("date_monthly")
    .agg(
        [
            pl.avg("headline")
        ]
    ).sort("date_monthly")
).collect()

2.4. Quantitative modeling

The final step is to use the time series for modeling. To show an example, in our recent conference paper, we similarly extracted the sentiment from headlines of research articles published in the top 5 economic journals. Then, we use rolling time-varying correlations of a 5-year window and looked at how sentiment relates to GDP and other global economic indicators (see figure 2).

We hypothesized that sentiment correlates with the macroeconomic environment during periods of sharp recessions and inflation shocks. The results support, except for one specific journal, these considerations for the Oil Shocks of the 1970s, which led to a steep recession accompanied by a massive inflation spike.

Figure 4. Rolling correlations of sentiment and GDP. Source: Poměnková et al., submitted to MAREW 2023.

Conclusions

In this article, we have constructed monthly time series of sentiment from 1 million rows of text data. The key points are:

  • Qualitative information can extend the capacities of quantitative time-series models
  • Polars library makes large text-data pre-processing feasible even in Python language
  • Cloud services such as Google Colab make the processing of extensive text datasets even faster.

The complete code in this tutorial is on my GitHub. The recommended reading is The Most Favorable Pre-trained Sentiment Classifiers in Python.

Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

References

[1] Z. Harris. 1954. Distributional structure. Word, vol. 10, no. 23, pp. 146–162.

[2] J. R. Firth. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pp. 1–32. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959, London: Longman 1968.

[3] Mikolov, T., Chen, K., Corrado, G. S., Dean, J. 2013b. Efficient estimation of word representations in vector space. Computation and Language: International Conference on Learning Representations.

[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, vol. 26 (NIPS 2013).

[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, L. 2018. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1.

[6] J. Pennington, R. Socher and C. D. Manning. 2014. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[7] Poměnková, J., Koráb, P., Štrba, D. Text Data Pre-processing for Time-series Modelling. Submitted to MAREW 2023.

[8] Misra, Rishabh. “News Category Dataset.” arXiv preprint arXiv:2209.11429 (2022).

[9] Misra, Rishabh and Jigyasa Grover. “Sculpting Data for ML: The first act of Machine Learning.” ISBN 9798585463570 (2021).

Text Mining
Time Series Analysis
Data Processing
Python
Recommended from ReadMedium