Topic Modeling with BERTopic: A Cookbook with an End-to-end Example (Part 1)

You may already be familiar with BERTopic, but if not, it is a highly useful tool for topic modeling within the field of natural language processing (NLP). As described on BERTopic’s GitHub page:

BERTopic supports guided, supervised, semi-supervised, manual, long-document, hierarchical, class-based, dynamic, and online topic modeling. It even supports visualizations similar to LDAvis!

What is BERTopics and Why using it?

BERTopics (Bidirectional Encoder Representations from Transformers) is a state-of-the-art topic modeling technique that utilizes transformer-based deep learning models to identify topics in large text collections. There are several benefits of using BERTopics for topic modeling:

Improved topic quality: BERTopics has been shown to produce more coherent and interpretable topics compared to traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA).
Better handling of large text collections: BERTopics can handle large text collections effectively, which is crucial for modern text data.
Ability to capture semantic relationships between words: BERTopics leverages the representation power of transformers to capture the semantic relationships between words in the text, which results in more accurate and meaningful topics.
Fine-grained control of the number of topics: Unlike other topic modeling techniques, BERTopics provides fine-grained control over the number of topics to be extracted, which can be useful in applications where the number of topics is critical.
Easy to fine-tune: BERTopics is trained on large text corpora, which means it can be fine-tuned on specific text collections, leading to improved performance on these collections.

The Scenario

In this article, we are going to learn by example. The example also serves as the Cookbook which we can use a template for the future use cases. We are going to use a dataset of “restaurant reviews” from Kaggle to categorize the reviews from the patrons into different meaningful categories: “atmosphere”, “food”, “staff”, “service” and so on.

Setting Up Notebook and Importing the Data

Before that, let’s import the packages we need.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP

import os
import pandas as pd

df = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')

Import and Explore the nature of the dataset

These are a few samples from the dataset.

This is a very well-balanced data, probably purposely-sampled to be so. With 500 records of negative comments (value 0) and 500 records of positive comments (value 1) .

Checking the length of the documents (or texts)

Next, we calculate how many characters are there in each record. The maximum characters is 149, There will be no major concerns to use Sentence Transformer as the model for embedding since there are mostly short paragraph(s).

Before we proceed to next step, let’s convert the “Review” into a list (you can also store the documents as numpy array).

docs = df.Review.to_list()

Pre-processing

As opposed to many traditional NLP methods where we need to remove stop words (a set of commonly used words in any language. For example, in English, “the”, “is” and “and”), BERTopic’s documentation suggests not to remove the stop words before the documents are used to generate the embeddings.

⚠️ DO NOT REMOVE STOP WORDS BEFORE EMBEDDINGS

At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context in order to create accurate embeddings.

Instead, we can use the CountVectorizer to preprocess our documents after having generated embeddings and clustered our documents. There is almost no disadvantages to using the CountVectorizer to remove stopwords.

This is how to do it in BERTopic so the embeddings are generated based on the full texts.

Next, we pre-calculate the embeddings, so the embeddings can be reuse without the needs of re-calculating them. This is handy, especially if your documents are huge which leads to long calculation of the embeddings (which we will in a few places below). Take note that:

model_embedding : embedding model based on Sentence Transformer
corpus_embeddings : embeddings generated from the documents (i.e. reviews from the customers)

For the block below, this is where we instantiate and train the BERTopic model. Here are explanations for some of parameters used for this model:

n_gram_range parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example: fast food is an n-gram of 2.
nr_topics can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced. For example, if your topic model results in 50 topics but you have set nr_topics to 10 then the topic model will try to reduce the number of topics based the specified integer. For this case, we use “auto” to automatically reduce topics using HDBSCAN.
min_topic_size is an important parameter. It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! Set this value too low and you will get many microclusters. I find this is a trial-and-error thing that I will have first generate the initial model to review and to come back here later try out with different min_topic_size.
seed_topic_list. By defining the above topics BERTopic is more likely to model the defined seeded topics. However, BERTopic is merely nudged towards creating those topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will not be modeled. Thus, seed topics need to be accurate to accurately converge towards them. Read more about “guided topic modeling” in BERTopic documentation here.

For more details on the parameters for BERTopic model, please refer to the original documentation here.

After the model is trained, we can generate the predicted topic and the probabilities (of how confident is the topic assigned) of each of the documents.

topics, probabilities = model.transform(docs, corpus_embeddings)

Understand the Topics Generated

Up to this point, we have generated the predicted topics for each of the document (i.e. Reviews) in our dataset. Is the work done?

Hold the horse! There are more to be done. First of all, we need to understand the topics (i.e. clusters) generated by the model. Remember, this is unsupervised learning model, it’s up to the human users to interpret and make sense out of the topics generated.

BERTopic model provide a useful function get_topic_freq to generate the topic frequency. A few key observations from the frequency table:

Topic -1 is the outliers. It means those are the 329 (~33% out of 1000) reviews that cannot be assigned to any specific cluster.
There are a total of 9 topics generated (there are actually more but they aren’t visible here). One quick way is to use len(df_topic_freq) — 1 to find out how many topics have been generated.

After having our BERTopic model trained based on the documents (in this case, the 1000 reviews), we can iteratively go through all the topics (about 20), but in some other cases, hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, BERTopic provide a way to visualize the topics that were generated in a way very similar to LDAvis.

In the previous step, we can be LDA-like visualization to understand each topic. Alternatively, we can also visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. The bar charts will give us the top N most representative keywords for each of the topics. For example, if we refer to Topic 1 below, with the keywords: service , slow , server, it’s quite reasonable to say this topic is likely to have reviews related to “service-related”. From here, we can have good understanding what each topic is about. This visualization is also useful for us to easily compare topic representations to each other. To visualize this, run the following:

Need more detailed insights? A step further, we can use a fine-grained approach where we can visualize the documents inside the topics to see if they were assigned correctly or whether they make sense. This is a very powerful approach to under a particular topic, because we can read the exact documents that being clustered under that topic, by quickly hovering over the dots (each dot is a document).

To do so, we can use the topic_model.visualize_documents() function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization purposes.

Visualize Topics per Class

You might want to extract and visualize the topic representation per class. In this restaurant dataset, there are two classes: 0 — negative review and 1- positive review. This visualization allows to see, for a particular selected topic, what is the proportion of the documents (i.e. reviews) that fall under each of the two classes (0 or 1). In other words, this is simply creating a topic representation for certain classes that you might have in your data.

Outliers Reduction

Looking at the topic frequency table, there are (almost) always the number of outlier documents might be created that do not fall within any of the created topics. These are labeled as -1. For our case, there are about 30% of the Reviews are the outliers.

Depending on your use case, there are possibly two main options:

To decrease the number of documents that are labeled as outliers. For this, we will look at what BERTopic has to offer.
To ignore or remove the number of documents that are labeled as outliers. Thus, no action is needed. Why is this the case? It could be that after reviewing the result of the “outliers reduction” process, we find the topics assigned the outliers are rather of low accuracy. Many of the topics are largely off. For such cases, we might want to only use those records (i.e. documents) that are more confidently assigned to the right topic.

The main way to reduce your outliers in BERTopic is by using the .reduce_outliers function. To make it work without too much tweaking, you will only need to pass the docs and their corresponding topics.

# Comment out this line below if you decided to use the "propbabilities" strategy
new_topics = model.reduce_outliers(docs, topics, strategy="c-tf-idf")


# Reduce outliers using the `probabilities` strategy (Uncomment to use this)
#new_topics = model.reduce_outliers(docs, topics, probabilities=probabilities, strategy="probabilities")

# This line is to update the model with the latest topic assignment 
model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model)

As you can see, at the end of this stage, all the outliers are now being assigned to the various topics. There is no longer the -1 cluster when we generate the topic frequency.

🕡 We are half way through the process. In part 2 of this series, we will look into how can we optimize the topics. In a simpler term, how we can choose to group similar topics together and rename topics into a list of final topics that make sense for the business use case.

🆕 Updated 18 Feb 2023: Part of this series can be found here:

Topic Modeling with BERTopic: A Cookbook with an End-to-end Example (Part 2)

We are going to use a dataset of “restaurant reviews” from Kaggle to categorize the reviews from the patrons into…

medium.com

⚠️ Disclaimer: This write-up is not intended to provide a comprehensive overview of all topics and methods related to BERTopic. For the most up-to-date and accurate information, it is recommended to refer to the official BERTopic documentation as the single source of truth.