Are Topics Also Communities of Words?
Leveraging Community Detection in Graph Analytics for Topic Analysis

[Spoiler Alert] When Daenerys Targaryen died, murdered by her nephew and lover Jon Snow, the Game of Thrones community went ape.
“So upset right now that I’m literally crying. Ugh. F#%k (sic)” — Angry GOT Twitter Fan
“…she had felt betrayed all season and she died betrayed…” — Another Angry GOT Twitter Fan
Okay, full stop. I have never actually watched the GOT HBO series.
*pause for collective gasp*
Yeah, I only know the gravity of the sentences I wrote at the top because I have a brother who was an avid watcher and would always qualify his explanations with:
“You just gotta watch the series bro. Like, give it time and get through the first season and then you’ll be hooked.”
But just because I never watched the show doesn’t mean it didn’t have an impact on me. I first learned about the cult following of the show when I watched a talk by a young grad student, Milan Janosov, who created a network graph out of the relationships among GOT characters and used it to predict who was most likely to die in the next episode. What caught my attention was the accuracy of the model and the novel application of graph theory.
It goes without saying that words help to form relationships. And not just human relationships but they also form relationships with each other in a sentence, a paragraph, and on a page.
What better way to represent those relationships than in a network graph? Janosov helped to convey that message to a large popular fan base and inspired others since to build networks from stories as diverse as Harry Potter and the Goblet of Fire, to Lord of the Rings, and even the Marvel universe.
A great deal of the existing projects linking words and relationships has focused on the people in those words and their relationships with other people. What about words themselves? Do they have relationships that can be usefully understood from a graphical perspective? Herein I look at an application of a graph algorithm for finding topics in documents. Let’s get started.
Topic Analysis, The Words-as-Data Version of Clustering
In data science, we use unsupervised algorithms to help us find natural (data-drive) groupings of data. Probably the most applied clustering algorithm is K-Means. When those data are words however, other algorithms like Latent Dirichlet Allocation (LDA) are more popular. LDA is more popular than K-Means because LDA will assign multiple topics to a single document whereas K-Means optimizes for mutually exclusive groups (aka, hard clustering).
The drawback with both approaches is that each require the user to input a specific number of clusters/topics for the model to then attempt to “find” in the data. Having to input the number of topics a-priori can be a challenge because we often don’t know what the optimal number of groupings should be.
*feeling like we’ve been a bit robbed of the “unsupervised” part of the description? Me too 😊*
Anyway, graph algorithms may provide one possible answer to a more truly unsupervised approach to topic modeling. In the remaining sections, I demonstrate how we can create a topic model using the Louvain community detection algorithm, visualize the network, and provide a few hints for future directions.
Building a Network Graph of Words
Network graphs are simply a series of nodes that are connected via edges. In the context of words on a page, we may connect two words based on their shared mentions in a sentence or paragraph. For example, if both the words “apple” and “eat” show up in the same paragraph we could graph the relationship like this:

In this simple example, we may conclude that the given text passage has a topic about eating apples.
Let’s examine how to create a word network using Python.
To achieve this first goal, we need to identify a source of data. For this example, I use a handful of Word documents from students in a social psychology course. Hopefully our topic model will help us to learn what they were writing about. Here are the environments I am working in for this example:
Windows 10 OS
Python 3.6
The packages we will need are:
networkx==2.5nltk==3.5python_louvain==0.14pandas==0.25.0community==1.0.0b1scikit_learn==0.24.2To build our word network, I assume that each paragraph may represent a separate topic and so word relationships are built at the paragraph level. Let’s load the data, subset to just the top 6 paragraphs, and clean the data:








