avatarLan Chu

Summary

The provided content discusses advanced text visualization techniques for uncovering insights and patterns in text data, utilizing tools like ScatterText, Latent Dirichlet Allocation (LDA), BERTopic, and n-gram word clouds.

Abstract

The article "Beyond Words: Unleashing the Power of Text Visualization" delves into the transformative potential of visualizing text data to reveal latent insights and patterns. It emphasizes the importance of moving beyond traditional text analysis by introducing interactive scatter plots through ScatterText, which explores term frequencies and dispersion. The article further elaborates on the application of two topic modeling techniques—Latent Dirichlet Allocation (LDA) and BERTopic—to identify and visualize word and topic distributions within a corpus. Additionally, it demonstrates the creation of n-gram word clouds and frequency analysis to highlight the most significant words in a dataset. The visualizations produced not only facilitate exploratory data analysis but also provide a more engaging and insightful way to understand the content and context of large text corpora, such as the UN General Debate.

Opinions

  • The author advocates for the use of ScatterText as an interactive, scalable tool for visualizing text data, suggesting that it makes exploratory data analysis more enjoyable.
  • The article suggests that term dispersion, as visualized by ScatterText, is crucial for understanding how terms are distributed across different documents, which is not captured by term frequency alone.
  • The author expresses that LDA is a popular and effective model for analyzing large text datasets, capable of revealing topics within documents and their distribution across the corpus.
  • BERTopic is presented as a modern alternative to LDA, leveraging transformers and c-TF-IDF to produce interpretable topics and allowing for a more nuanced understanding of text data.
  • The use of n-gram word clouds and frequency analysis is recommended for their ability to provide a quick and intuitive understanding of the most discussed topics in a corpus.
  • The author implies a preference for visual tools that not only display data but also facilitate the interpretation of complex textual information.

Beyond Words: Unleashing the Power of Text Visualization

The written word is a powerful thing. It started with the invention of the first written language by the ancient Sumerians, and later the introduction of the Gutenberg press enabled the written word to spread knowledge. Discovering latent insights and patterns in text data can be quite challenging, which is why visualizing it is an important step. In this article, we will explore stunning ways to visualize text data. First, we’ll use ScatterText to present the text data in sexy and interactive scatter plots to explore term frequencies and dispersion. Then, we’ll use two powerful topic modeling techniques — Latent Dirichlet Allocation (LDA) and BERTopic, to uncover and present word and topic distributions within the corpus. And last but not least, we’ll create n-gram word clouds and frequency to showcase the most frequent/important words in a corpus. Let’s get visualizing!

1. Data extraction and pre-processing

But first thing first, let us acquire the data and do some pre-processing ;)

Photo by Mathias Reding on Unsplash

This article uses a dataset composed of the corpus of texts of UN General Debate . It contains all the statements made by each country’s presentative at UN General Debate from 1970 to 2020. You can gain extra insight into its contents by reading this paper. This is open data and is available online here. You will need to request access to the data. The data is provided in the form of text files and the following block of code can be used to extract the texts and save them into a data frame format.

import pandas as pd
import os
import re
import nltk
from tqdm import tqdm

nltk.download("averaged_perceptron_tagger")

dir_path = os.path.dirname(os.path.abspath("__file__"))
main_data_dir = os.path.join(dir_path, "TXT")


def open_speech(file_path):
    """
    This function opens a file with the correct formatting
    :param file_path:
    :return:
    """

    file = open(file_path, encoding="utf-8-sig")
    data = file.read()

    return data


def remove_line_number(speech):
    """
    removes the line number at the beginning of speech

    Parameters
    ---------
    speech : str
        piece of text
    """

    pattern = "\n|^\d+.*?(\w)"
    speech = re.sub(pattern, "\n\g<1>", speech)
    pattern = "\t"
    speech = re.sub(pattern, "", speech)
    pattern = "\n\n"
    speech = re.sub(pattern, "\n", speech)
    pattern = "^\n *"
    speech = re.sub(pattern, "", speech)

    return speech


if __name__ == "__main__":
    # True --> run preprocessing and save the results, False --> just do the data analysis with your previously saved
    # dataframe file (always have to do a preprocessing run to save the dataframe of course)
    do_preprocessing = True

    if do_preprocessing:
        speeches_df = pd.DataFrame(
            columns=[
                "session_nr",
                "year",
                "country",
                "speech",
            ]
        )

        num_directories = len(next(os.walk(main_data_dir))[1])

        # loop through all directories of the data
        for root, subdirectories, files in tqdm(
            os.walk(main_data_dir), total=num_directories, desc="directory: "
        ):
            # remove all the files starting with '.' (files created by opening a mac directory on a windows PC,
            # so will only do something if you are working on a windows PC
            files_without_dot = [file for file in files if not file.startswith(".")]

            # loop through files and extract data
            for file in tqdm(files_without_dot, desc="files: ", leave=False):
                country, session_nr, year = file.replace(".txt", "").split("_")

                # open a speech with the correct formatting
                speech_data = open_speech(os.path.join(root, file))
                speech_data = remove_line_number(speech_data)

                # append the features to the dataframe
                speeches_df = speeches_df.append(
                    {
                        "session_nr": int(session_nr),
                        "year": int(year),
                        "country": country,
                        "speech": speech_data,
                    },
                    ignore_index=True,
                )
        speeches_df.to_csv("Data/Raw/raw_speeches.csv")

Data pre-processing pipeline

Technically, any text document is just a sequence of characters. To analyze the content, we first need to transform them into meaningful sequences of words or tokens and remove the noise such as frequent words carrying little meaning. As part of preprocessing, we will use the following configuration to build a simple data processing pipeline:

  • Remove the first sentences of the speech (welcoming sentences which are nearly identical across documents)
  • Lower case
  • Tokenize (split the documents into tokens).
  • Remove stop words
  • Lemmatize
  • Create a second filter — an (additional) list of stop words. Sometimes it is helpful to filter out specific frequent but uninteresting words for the visualization.

The following block of code can be used to do data processing. And feel free to use the processed data set in this link.

import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import pos_tag
import scattertext as st

nltk.download("averaged_perceptron_tagger")
nltk.download("omw-1.4")
nltk.download("wordnet")

try:
    nltk.data.find("punkt")
except LookupError:
    nltk.download("punkt")

try:
    nltk.data.find("stopwords")
except LookupError:
    nltk.download("stopwords")

try:
    nltk.data.find("vader_lexicon")
except LookupError:
    nltk.download("vader_lexicon")

def stem_token(token):
    """
    Stems the given token using the PorterStemmer from the nltk library
    Input: a single token
    Output: the stem of the token
    """
    ps = PorterStemmer()
    stemmed_word = ps.stem(token)
    return stemmed_word

def penn2morphy(penntag):
    """Converts Penn Treebank tags to WordNet."""
    morphy_tag = {"NN": "n", "JJ": "a", "VB": "v", "RB": "r"}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return "n"
def lemmatize_token(token):
    """
    Lemmatize the token using nltk library
    Input: a single token
    Output: the lemmatization of the token
    """
    wordnet = WordNetLemmatizer()
    token_tagged = pos_tag([token])
    tag = token_tagged[0][1]
    morphy_tag = penn2morphy(tag)
    lemmatized_word = wordnet.lemmatize(token, pos=morphy_tag)
    return lemmatized_word
def remove_line_number(speech):
    """
    removes the line number at the beginning of speech
    Input: str
    Output: str
    """

    pattern = "\n|^\d+.*?(\w)"
    speech = re.sub(pattern, "\n\g<1>", speech)
    pattern = "\t"
    speech = re.sub(pattern, "", speech)
    pattern = "\n\n"
    speech = re.sub(pattern, "\n", speech)
    pattern = "^\n *"
    speech = re.sub(pattern, "", speech)

    return speech


def filter_common_words(words):
    common_words = [
        "first",
        "like",
        "welcome",
        "pleased",
        "let",
        "good",
        "afternoon",
        "press",
        "conference",
        "meeting",
        "would",
        "outcome",
        "going",
        "know",
        "said",
        "along",
        "together",
        "also",
        "formally",
        "meetings",
        "evening",
        "annual",
        "one",
        "two",
        "second",
        "third",
        "last",
        "next",
        "point",
        "per",
        "answer",
        "ask",
        "say",
        "said",
        "mention",
        "talk",
        "tell",
        "told",
        "suggest",
        "think",
        "wonder",
        "mean",
        "understand",
        "know",
        "maybe",
        "perhaps",
        "remain",
        "generally",
        "thus",
        "member",
        "seem",
        "see",
        "look",
        "consider",
        "regard",
        "include",
        "hear",
        "going",
        "go",
        "goes",
        "come",
        "came",
        "give",
        "use",
        "using",
        "get",
        "can",
        "could",
        "should",
        "may",
        "might",
        "way",
        "yes",
        "no",
        "lot",
        "bit",
        "also",
        "case",
        "fact",
        "like",
        "want",
        "believe",
        "feel",
        "actual",
        "well",
        "kin",
        "moment",
        "time",
        "now"
    ]
    return [word for word in words if word not in common_words]


def remove_first_sentence(speech):
    """
    remove the first sentence
    """
    pattern = r"^.*?\."
    speech = re.sub(pattern, "", speech)

    return speech


def preprocess_speech(speech):
    """
    This function does the preprocessing
    """
    # put all characters in lower case
    speech["Text"] = speech["Text"].str.lower()
    speech["Tokens"] = speech["Text"].apply(lambda x: nltk.word_tokenize(str(x)))
    # remove stop words and non-alphabetic from all the text
    stop_word = nltk.corpus.stopwords.words("english")
    speech["Tokens"] = speech["Tokens"].apply(
        lambda x: [word for word in x if (word not in stop_word) and word.isalpha()]
    )
    # lemmatize
    speech["Tokens"] = speech["Tokens"].apply(
        lambda x: [lemmatize_token(token) for token in x]
    )
    # additional filter
    speech["Tokens"] = speech["Tokens"].apply(filter_common_words)
    speech["Joined_Tokens"] = speech["Tokens"].apply(lambda x: " ".join(x))
    speech = speech.sort_values(by="year").reset_index(drop=True)
    speech = country_code_cleanup(speech)
    # create a scattertext object for visualization
    speech['parse'] = speech.Joined_Tokens.apply(st.whitespace_nlp_with_sentences)
    return speech


speech = pd.read_csv("Data/Raw/raw_speeches.csv", index_col=0)
speech = preprocess_speech(speech)

speech_happiness.to_csv("Data/Processed/preprocessed_speech.csv")

The final data frame will have the following attributes:

UN general debate corpus.

2. Text Visualization

2.1 ScatterText for Term Frequency and Dispersion

ScatterText is a Python interactive, scalable tool to visualize text data in a scatter plot that can display a high number of words and phrases used in a corpus on an HTML page. Exploratory data analysis just gets so much more fun with this 😊 The official Github repo can be found here. I will only focus on the term frequency and dispersion, feel free to look into the repo for other types of text visualizations.

Term Frequency vs Dispersion without distinguishing document category

One insightful thing to do with text data is to plot term dispersion against term frequency and identify the terms which are the most and least dispersed given their frequencies. The term dispersion will tell us about how a term is distributed among different documents. A term that appears multiple times in one document, but not in any other will have a low dispersion, whereas if it appears in similar frequency across all documents it will have a high dispersion.

First, we need to prepare the data to be used in the ScatterText visualizations. The st.whitespace_nlp_with_sentences function is perfect for that. It is a pre-processing step that tokenizes the text using whitespace and performs sentence segmentation. The resulting output is a list of spaCy Doc objects, which represent the tokenized and parsed documents.

import scattertext as st
speeches['parse'] = speech.Joined_Tokens.apply(st.whitespace_nlp_with_sentences)
speeches_2020 = speeches[speeches["year"] == 2020]

At this point, we are not trying to distinguish between document categories, therefore we will use the st.CorpusWithoutCategoriesFromParsedDocuments class to create a ScatterText corpus object using the “parse” column created in the previous step as the input. This will return a version of the corpus where each document is represented by a bag of words using unigrams. Next, we remove infrequent words from the corpus using a minimum term count threshold of 6 and rank the terms in the corpus using their absolute frequency. This processed corpus can be used for further analysis and visualization with ScatterText.

import scattertext as st
from scattertext.termranking import AbsoluteFrequencyRanker

corpus = (
    st.CorpusWithoutCategoriesFromParsedDocuments(speeches_2020, parsed_col="parse")
    .build()
    .get_unigram_corpus()
)
corpus.remove_infrequent_words(
    minimum_term_count=6, term_ranker=AbsoluteFrequencyRanker
)
corpus.get_categories()

Next, to plot the frequency and dispersion of all the terms, we need a data frame with 2 columns: frequency and dispersion score. We will create this data frame that captures the frequency of each term and scores of various dispersion measures. These will be shown after a term is activated in the plot.

dispersion = st.Dispersion(corpus)
dispersion_df = dispersion.get_df()
dispersion_df.head(5)

The above code should return:

For this example, we will be using Rosengren’s S measure (Gries 2021) to display the dispersion of each term in the document. In order to start plotting, we need to add the coordinates for each term to the data frame. We will create Xpos and Ypos columns which are generated by scaling the original Frequency(X) and dispersion score(Y) values using log_scale() and scale() method from Scattertext.Scalers:

dispersion_df = dispersion_df.assign(
    X=lambda df: df.Frequency,
    Xpos=lambda df: st.Scalers.log_scale(df.X),
    Y=lambda df: df["Rosengren's S"],
    Ypos=lambda df: st.Scalers.scale(df.Y),
)

Finally, we can now plot the scatter graph using the dataframe_scattertext function and write the scatter plot in a stand-alone interactive HTML file:

html = st.dataframe_scattertext(
    corpus,
    plot_df=dispersion_df,
    ignore_categories=True,
    color_score_column="ColorScore",
    x_label="Log Frequency",
    y_label="Rosengren's S",
    y_axis_labels=["Less Dispersion", "Medium", "More Dispersion"],
)

open("unga_dispersion.html", "wb").write(html.encode("utf-8"))

Looking at the following visualization seems overwhelming at first. But in fact, it is a simple visualization of words used in the corpus. Each dot corresponds to a term mentioned. The visualization tells us about how the word is distributed among different documents. If a term appears the same amount of times in all documents, you will have a dispersion of 1. Meaning it has a constant/homogeneous dispersion. If the term appears many times but only in a single document, you have the opposite effect and dispersion of 0. It’s similar to TF-IDF, but the main difference is that TF-IDF is used for term importance in a specific document, whereas the dispersion looks at how the term is used across all documents.

We can see various dispersion statistics under a term’s name, in addition to the standard usage statistics. Use the search bar to search for each term and see all the documents in the corpus where the term appears. Image by author, code adapted from here.

Term frequency vs. dispersion with a document category

Finding words and phrases that discriminate categories of text is a common application of NLP. Scattertext is also intended for visualizing what words and phrases are more characteristic of a category than others. Let’s identify differences in word usage between developed and developing countries in the UN General Debate Corpus. For example, to look for differences in how developed and developing countries choose terms in their speeches, set the category_col parameter to the ‘Developed / Developing Countries’ indicator.

corpus_2020 = (
    st.CorpusFromParsedDocuments(
        speeches_2020,
        category_col="Developed / Developing Countries",
        parsed_col="parse",
    )
    .build()
    .get_unigram_corpus()
    .compact(st.AssociationCompactor(2000))
)

html = st.produce_scattertext_explorer(
    corpus_2020,
    category="Developing",
    category_name="Developing Countries",
    not_category_name="Developed Countries",
    minimum_term_frequency=5,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    metadata=corpus.get_df()["Country or Area"],
    transform=st.Scalers.dense_rank,
)
open("./un_dispersion_category.html", "w").write(html)
Scatter plot form term frequency and term dispersion using UNGDC 2020. The interesting things are usually observed at the upper-left and lower-right corners. In the (middle) upper-left corner, words like “excellency” and “debt” are frequently used by Developing countries but infrequently by Developed countries. Likewise, terms frequently used by Developed countries and infrequently by Developing ones occupy the bottom-right corner. These include “European” or “nuclear”. The most characteristic terms for both country categories are displayed on the top-right of the visualization, e.g. “pandemic”. Terms associated with Developing countries are in blue, and those more associated with Developed countries are in red. Image by author, code adapted from here.

2.2 Topic Model Visualizations

An application of text visualization is Topic modeling, a technique used to uncover hidden (“latent”) topics and themes from a collection of documents. It will tell us which topics exist in the corpus and how they are distributed among each document.

A. Topic Model Visualization with Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (Blei, 2003) is a popular model for analyzing large amounts of text. It is a generative probabilistic model that returns the topic distribution in a document and the word distribution for each topic. If you are interested in knowing how LDA works, I have an article that is worth having a look at. In this example, we will use the LDA algorithm to discover topics that appear in the UN general debate corpus dataset and visualize it. The visualization allows us to quickly see words that are most relevant to a topic and the distances between topics.

Let’s install and import some relevant libraries:

pip install pyldavis
pip install gensim

from gensim.corpora import Dictionary
from gensim import corpora, models
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

Moving on, we will create a training corpus from our texts. We start with converting a collection of words to a bag of words, which is a list of tuples (word, word_frequency). Gensim.corpora.Dictionary is a great tool for this:

#create a Gensim corpus from a list of texts
Texts = list(speeches_2020['Tokens'])
dictionary = corpora.Dictionary(Texts)
corpus = [dictionary.doc2bow(text) for text in Texts]

Now let’s build an LDA topic model. We will use models.ldamodel.LdaModel for training the LDA model and pyLDAvis for visualizing the topic. A lot of parameters can be tuned to optimize the training, such as the number of topics, chunk size, eta (a-priori belief on topic-word distribution), and alpha (a-priori belief on document-topic distribution).

lda_sym = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    update_every=1,
    chunksize=100000,
    passes=100,
    alpha="auto",
    eta="auto",
)

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_sym, corpus, dictionary)
vis

In the below visualization, you can see that there are 10 bubbles since I have chosen 10 topics for this corpus in the above code. The visualization provides information on the topics and the words that are important for each topic. You will see the most important words for each topic change while hovering over them. LDA, unfortunately, does not label the topic for us, but it returns the word distribution for each topic, from which we — as users have to make an inference on what the topic actually means. The further the bubbles are away from each other, the more different they are. As observed, topic 1 and topic 2 are more related than, for instance, topic 1 and topic 7; with topic 1 and 2 being more closely about pandemics and health while topic 7 is about social global, and economic developments in some countries around the middle east.

Each bubble represents a topic. Blue bars represent the overall term frequency of each word in the corpus. Red bars give the estimated number of times a given term was generated by a given topic. The word with the longest red bar is the word that is used the most by the speeches belonging to that topic. Security, peace, Russia, and Ukraine appear to be among the most commonly used word in this latent topic 1. Image by authors.

B. Topic model Visualization using BertTopic

Another technique for topic modeling is BERTopic, which is an algorithm that leverages 🤗 transformers and c-TF-IDF score — which is a modification of the traditional TF-IDF score that takes into account the distribution of words on cluster/categorical/topic level instead of a document level. This results in a score that reflects the importance of a word in a specific topic while also accounting for its overall frequency in the entire corpus. BERTopic embeds the c-TF-IDF representation of the topics in 2D and then visualizes the two dimensions using Plotly such that we can create an interactive view allowing for interpretable topics. As such, we can visualize the topics that were generated in a way similar to PyLDAvis for LDA.

Visualize Topics and Terms

First, we need to train our BERT model:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Convert training set to list of documents
docs = speech["Joined_Tokens"].drop_duplicates().to_list()

# Train the BERTopic model
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english")
topic_model = BERTopic(
    vectorizer_model=vectorizer_model,
    nr_topics=5,
    min_topic_size=2,
    calculate_probabilities=True,
)
topics, probs = topic_model.fit_transform(docs)

Moving on, let’s take a look at a few topics that we get out of training this way by running topic_model.get_topic_info(). We can see several interesting topics appearing here. They seem to relate to the topics returned by LDA:

Then, we can simply call.visualize_topics() to create a 2D representation of the topics within the corpus. The resulting graph is a Plotly interactive graph that tells you the general information of each topic, including the size of the topic and its corresponding most important words.

topic_model.visualize_topics(width=500, height=500)
We can clearly see that there are two clusters of topics here. Image by author.

One useful tool for understanding the most important words in each topic and for interpreting and labeling the topics as well as comparing topic representation to each other is .visualize_barchart().We can visualize the top representative terms for each topic and their corresponding c-TF-IDF score by creating bar charts by simply calling .visualize_barchart().

In this bar chart visualization, the words are sorted by their c-TF-IDF score, with the highest-scoring words appearing at the top of the bar. The length of each bar represents the c-TF-IDF score of the corresponding word. Image by author.

Visualize documents

If you want to look for a tool for exploring the distribution of documents across topics and gaining insights into the structure of a corpus, topic_model.visualize_documents() is a great tool for that. By visualizing the distribution of documents across topics, it becomes possible to see which documents are most similar to each other, and it also helps to check whether documents were assigned correctly.

The first step is converting the documents to embedding (a fancy way of saying for representing text as an array of numbers). We will be using the pre-trained “all-MiniLM-L6-v2” model from SentenceTransformers, a framework for text and image embeddings to represent text as an array of numbers.

from sentence_transformers import SentenceTransformer

docs = list(speech["Joined_Tokens"].values)
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
topic_model.visualize_documents(docs, embeddings=embeddings, 
width=600, 
height=700)

Next, we can use the topic_model.visualize_documents()function to visualize the documents within each topic. What this function does is recalculate the document embeddings and reduce them to 2-dimensional space for easier visualization purposes:

Hover over the points and see the content of the documents. Image by author.

2.3. N-gram Word Cloud

Word clouds visualize the term frequencies by different font sizes, which are much easier to comprehend and compare. The output is an image that depicts different words in different sizes and opacities relative to their frequency in the document.

The default Python module WordCloud generates unigrams (single words ), but we can explore a slightly more advanced version of the graph which, for instance, plots the frequency of bigrams, i.e., two consecutive words, by simply setting Collocation_threshold = 2 and collocations =True parameters to tell Python to display bigrams in generated wordcloud objects:

import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

text = "square"
x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130**2
mask = 255 * mask.astype(int)

#joins all the tokens in the "Joined_Tokens" column into a single string.
all_tokens = "".join(speech["Joined_Tokens"])

# generate unigram wordcloud
unigram_wordcloud = WordCloud(
    collocations=False, background_color="white", mask=mask
).generate(all_tokens)

# Generate a bi-gram word cloud
bigram_wordcloud = WordCloud(
    collocation_threshold=2, collocations=True, background_color="white", mask=mask
).generate(all_tokens)

# plot the wordclouds side by side
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].imshow(unigram_wordcloud)
axes[0].set_title("Unigram Wordcloud")
axes[0].axis("off")

axes[1].imshow(bigram_wordcloud)
axes[1].set_title("Bigram Wordcloud")
axes[1].axis("off")

plt.show()
Wordcloud using term frequency. Image by author.

Both word clouds give a quick impression of the most talked-about topics in the United Nations General Assembly debates. While the unigram clearly talks about the pandemic and shows some words that on their own don’t mean much (e.g., right, support, global), the bigram word cloud tells a slightly different story about climate change and peace. Meanwhile, the pandemic is no longer present.

2.4. N-gram Frequencies

Frequently used words and phrases give us some basic understanding of the discussed topics. N-grams are used to describe the number of words used as observation points, e.g. unigram means singly-worded, bigram means the 2-worded phrase, and trigram means 3-worded phrase.

There are dozens of ways to produce N-gram frequencies in Python. We can make use of the CountVectorizer function from scikit-learn.

Let’s first create a function “get_top_ngrams” that tokenizes the input corpus, counts the occurrences of each n-gram in the corpus using CountVectorizer, and returns a data frame containing the n most frequent n-grams along with their frequency. Then applying, the function of the “speech” data frame to produce the top 20 most frequent unigrams, bigrams, and trigrams, respectively.

def get_top_ngrams(corpus, ngram_range, stop_words=None, n=None):
    vec = CountVectorizer(stop_words=stop_words, ngram_range=ngram_range).fit(corpus)
    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

    common_words = words_freq[:n]
    words = []
    freqs = []
    for word, freq in common_words:
        words.append(word)
        freqs.append(freq)

    df = pd.DataFrame({"Word": words, "Freq": freqs})
    return df


stop_words = "english"
n = 20
unigrams_st = get_top_ngrams(speeches_2020["Joined_Tokens"], (1, 1), stop_words, n)
bigrams_st = get_top_ngrams(speeches_2020["Joined_Tokens"], (2, 2), stop_words, n)
trigrams_st = get_top_ngrams(speeches_2020["Joined_Tokens"], (3, 3), stop_words, n)

Next, using the following code, we will create the bar charts, with the x-axis representing the frequency of the n-grams, the y-axis representing the n-grams themselves, and the color of the bars representing the frequency of the n-grams mapped to a color scheme.

# Now Plot
# Create a function to map bar length to color
cmap = plt.cm.get_cmap("RdPu")


def map_color(x):
    return cmap(x / max(unigrams_st["Freq"]))


# Plot unigram and bigram distribution side by side
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
axes[0].set_title("Top 20 Bigrams", size=10)
sns.barplot(
    x="Freq",
    y="Word",
    data=bigrams_st,
    ax=axes[0],
    palette=sns.color_palette([map_color(x) for x in unigrams_st["Freq"]]),
)
axes[1].set_title("Top 20 Trigrams", size=10)
sns.barplot(
    x="Freq",
    y="Word",
    data=trigrams_st,
    ax=axes[1],
    palette=sns.color_palette([map_color(x) for x in unigrams_st["Freq"]]),
)
fig.tight_layout()
plt.show()

The above code should return the following graph:

Bi-grams and trigrams word frequency at UN’s General Debate Corpus in 2020. Image by author.

Comparing the top 20 bigrams with trigram frequencies also gives an additional view of what is talked about during the United Nations General Assembly debates. Focusing on bigrams, we would assume that the General Assembly talks in detail about human rights, peace, security, and sustainable development. At the same time, the trigrams add more contexts to the picture, such as peace and security in the Middle East or stories on small island developing states which are not present in the top 20 bigrams.

2.5. Top TF-IDF terms

While word count refers to the number of times a word appears in a document and tells us something about the topic in that document, TF-IDF is a more sophisticated method of determining the importance of a word in a document. If you are curious about how TF-IDF works, I have an article that might be useful. Let’s check what the most important words using TF-IDF are and whether they will tell a different story from the term frequencies. The following block of code can be used to create the top TF-IDF words:

class DenseTfIdf(TfidfVectorizer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        for k, v in kwargs.items():
            setattr(self, k, v)

    def transform(self, x, y=None) -> pd.DataFrame:
        res = super().transform(x)
        df = pd.DataFrame(res.toarray(), columns=self.get_feature_names_out() )
        return df

    def fit_transform(self, x, y=None) -> pd.DataFrame:
        # run sklearn's fit_transform
        res = super().fit_transform(x, y=y)
        df = pd.DataFrame(res.toarray(), columns=self.get_feature_names_out(), index=x.index)
        return df
    
df_docs_terms_corpus = DenseTfIdf(
    sublinear_tf=True, lowercase=True, stop_words="english"
).fit_transform(speech['Joined_Tokens'])

df_docs_terms_corpus.sum(axis=0).nlargest(n=30).reset_index().rename(
    {0: "score"}, axis=1
).plot.bar(x="index")
The plot of top 30 most important words in the UN Debates Corpus using TF-IDF. Image by author.

Conclusion

In this article, we have explored several powerful methods for visualizing text data, including ScatterText for term frequency and dispersion, word distribution, and topic distribution in topic modeling using LDA and BERTopic, and n-gram word clouds and word frequency. By using these visualization techniques, we can gain deeper insights into the structure and meaning of text data, enabling us to uncover hidden patterns and trends.

Reference

  1. https://github.com/JasonKessler/scattertext
  2. https://pyvis.readthedocs.io/en/latest/documentation.html
  3. https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-documents
NLP
Text Mining
Visualizing Data
Topic Modeling
Recommended from ReadMedium