avatarHimanshu Sharma

Summary

The website content provides a guide on using pyLDAvis, a Python library, to visualize topic models generated by Latent Dirichlet Allocation (LDA), illustrating how to install the necessary libraries, preprocess text data, create LDA models, and produce interactive visualizations to analyze topic clusters.

Abstract

The article titled "Topic Model Visualization using pyLDAvis" delves into the process of creating interactive visualizations for topic models using the pyLDAvis library in Python. It begins by explaining the concept of topic modeling and its application in clustering documents based on relevant topics, such as grouping job listings and candidates in the recruitment industry. The focus then shifts to LDA, a popular topic modeling technique, and the role of pyLDAvis in visualizing the topics derived from LDA. The article outlines the steps to install pyLDAvis via pip, import required libraries, load a dataset suitable for topic modeling, preprocess the text data using vectorizers, and finally, create the LDA model. It also demonstrates how to use pyLDAvis to prepare and display highly interactive and visually appealing topic cluster visualizations with just a single line of code. The author encourages readers to experiment with the code and discuss their experiences, while also acknowledging a collaboration with Piyush Ingale.

Opinions

  • The author emphasizes the utility of topic modeling in uncovering hidden patterns within text data and its practical applications in various industries.
  • pyLDAvis is highlighted as an essential tool for analyzing and visualizing LDA topic models due to its ease of use and the quality of its visualizations.
  • The article suggests that the combination of LDA and pyLDAvis can significantly enhance the interpretability of topic models.
  • The author provides a hands-on approach, encouraging readers to engage with the provided code examples and to reach out for further discussion or assistance.
  • The inclusion of a YouTube video demonstrates the author's commitment to providing interactive and engaging content to complement the written tutorial.

Topic Model Visualization using pyLDAvis

Creating Interactive Topic Model Visualizations

Source: By Author

Topic Modelling is a part of Machine Learning where the automated model analyzes the text data and creates the clusters of the words from that dataset or a combination of documents. It works on finding out the topics in the text and find out the hidden patterns between words relates to those topics.

By using topic modeling we can create clusters of documents that are relevant, for example, It can be used in the recruitment industry to create clusters of jobs and job seekers that have similar skill sets. There are several ways of obtaining the topics from the model but in this article, we will talk about LDA-Latent Dirichlet Allocation.

LDA works on the matrix factorization technique in which it assumes a is a mixture of topics and it backtracks to figure what topics would have created these documents. The important part is that in this article we will create visualizations where we can analyze the clusters created by LDA.

pyLDAvis is an open-source python library that helps in analyzing and creating highly interactive visualization of the clusters created by LDA. In this article, we will see how to use LDA and pyLDAvis to create Topic Modelling Clusters visualizations.

Let’s get started…

Installing Required Libraries

This article will mainly focus on pyLDAvis for visualization, in order to install it we will use pip installation and the command given below will perform the installation.

pip install pyldavis

Importing Required Libraries

In this article, we will start by creating the model by using a predefined dataset from sklearn. In order to do all these steps, we need to import all the required libraries.

from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

As the main focus of this article is to create visualizations you can check this link on getting a better understanding of how to create a topic model.

Loading the Dataset

Now we will load the dataset that we have already imported. The newsgroup is a textual dataset so it will be helpful for this article and understanding the cluster formation using LDA.

newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print(len(docs_raw))

Here we will see that the dataset contains 11314 rows of data. Next, we will apply CountVectorizer, TFID, etc., and create the model which we will visualize.

Preprocessing the Data

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)

Creating the Model

In this step, we will create the Topic Model of the current dataset so that we can visualize it using the pyLDAvis.

# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)
Model(Source: By Author)

Creating Visualization

This is the final step where we will create the visualizations of the topic clusters. The best thing about pyLDAvis is that it is easy to use and creates visualization in a single line of code.

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
Clusters Visualization(Source: By Author)

Check out the video below showing how interactive and visually appealing visualization is created by pyLDAvis.

Similarly, you can also create visualizations for TF-IDF vectorizer, etc. Go ahead try this and let me know your comments or any difficulty that you face in the comments section.

This post is in collaboration with Piyush Ingale.

Before You Go

Thanks for reading! If you want to get in touch with me, feel free to reach me at [email protected] or my LinkedIn Profile. You can view my Github profile for different data science projects and packages tutorials. Also, feel free to explore my profile and read different articles I have written related to Data Science.

Data Science
Data Visualization
Python
NLP
Machine Learning
Recommended from ReadMedium