s="gist-iframe" src="/gist/marvinlanhenke/0ce2557d4196cc86ee477baf5fcde037.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="b055">First of all, we import the necessary libraries and create our corpus, containing 4 documents. For demonstrating purposes, we work with a simple nursery rhyme.</p><p id="ffdc">Once we created our corpus we’re ready to proceed.</p><p id="5628">We will use the TreebankWordTokenizer to build our lexicon, containing all unique tokens, except the punctuations.</p>
<figure id="981a">
<div>
<div>
<iframe class="gist-iframe" src="/gist/marvinlanhenke/f729c527f24b8301be5afc2a4a3967a4.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="41a8">Now, we can compute the normalized term frequency for every token in each document. We store the results in a data frame which makes it easier to visualize the results.</p>
<figure id="f4e2">
<div>
<div>
<iframe class="gist-iframe" src="/gist/marvinlanhenke/307918ba52cf7bdea493b1dae96ed916.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="8b0b">We create an empty data frame with an index for every document and a separate column for each unique token in our lexicon.</p><p id="1519">Next, we iterate over all documents and create a Bag-Of-Words by utilizing the <code>collections.Counter</code> object. Dividing the word count for a specific term by the number of unique tokens, we obtain the normalized term frequency.</p><p id="7892">The results, stored in our data frame look like the following.</p><figure id="45aa"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tbIlg-tbdgZR224TTTc9ig.png"><figcaption>TF data frame [Screenshot by Author]</figcaption></figure><p id="2883">By looking at the results, we can already tell that the term frequency alone doesn’t provide enough information to distinguish or determine the topic.</p><p id="8410">Thus, we need to compute the inverse document frequency, in order to improve our representation.</p>
<figure id="a58d">
<div>
<div>
<iframe class="gist-iframe" src="/gist/marvinlanhenke/ad4f50baea695fc03bf277b35a762d33.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="efe7">We start by creating an empty data frame and obtaining the number of documents in our corpus.</p><p id="0707">Next, we iterate through each document and term in our lexicon. If the term appears in a document, we store the value <code>1</code> in our data frame. By taking the sum for each column over all documents, we obtain the number of documents containing a specific term.</p><p id="4039">Finally, we calculate the inverse document frequency.</p><p id="0ade">We simply divide the number of documents by the number of documents containing a term. After applying the <code>numpy.log()</code>function we store the values in our data frame.</p><p id="c7da">It’s time for the last step. Tying everything together.</p>
<figure id="6507">
<div>
<div>
<iframe class="gist-iframe" src="/gist/marvinlanhenke/a04390651a50af89ac33701c1441baf9.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="e5df">Pretty straightforward.</p><p id="5222">We basically just multiply both data frames (TF, IDF) and obtain a new one with the term frequency-inverse document frequency. Normalizing the data with <code>sklearn.preprocessing.normalize()</code> creates a unit vector of length 1 for each row.</p><blockquote id="b16c"><p><b>Note</b>: We apply the normalizing in order to compare our solution later to the sklearn’s implementation.</p></blockquote><p id="5cfe">Now, we can visualize the results.</p><figure id="c4b1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2Y8AhG2RIEa7pss-mCZJpg.png"><figcaption>TF-IDF data frame [Screenshot by Author]</figcaption></figure><p id="f91e">This looks pre
Options
tty nice.</p><p id="d36b">Just by looking at the TF-IDF scores in the second row, we can tell the document covers something about a “jumping cow”. Sounds reasonable enough.</p><p id="583b">TF-IDF seems to work very well. However, we had to do a lot of work to get there. Luckily, there exist several tools and libraries to make our lives a lot easier.</p><h1 id="0ae2">Libraries: TfidfVectorizer</h1><p id="06bf">In the last section, we computed the TF-IDF scores from scratch.</p><p id="d6e9">This was a lot of work. And a lot of code.</p><p id="37a1">Fortunately for us, sklearn comes with an implementation of the <code>TfidfVectorizer</code> which basically does the same job. In a few lines of code.</p><p id="fa46">Let’s take a look and compare the results with our implementation.</p>
<figure id="8830">
<div>
<div>
<iframe class="gist-iframe" src="/gist/marvinlanhenke/fad92940eece82132311720901ec002b.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="ea4f">And this is it. Much less code. Much faster.</p><p id="13ef">Now, we can visualize the data frame and compare the results.</p><figure id="2b46"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-fHTXAGmfcX161i5chDlrw.png"><figcaption>TF-IDF sklearn’s implementation [Screenshot by Author]</figcaption></figure><p id="661d">Looks like we nailed it.</p><p id="27cd">These are the same results as in our implementation from scratch.</p><h1 id="25a0">Conclusion</h1><p id="e311">In this article, we learned what TF-IDF means, why it’s useful, and even how to compute it from scratch. We also compared our solution to the much easier to use implementation of the sklearn library.</p><p id="2965">We improved our numerical representation of a given text a lot. However, we still just consider only the number of times a word appears.</p><p id="b02d">In the following episodes, we’re going to take the next step: Entering the world of semantic analysis, providing us with new ways to model the meaning, the topic of a document.</p><p id="05ca">So take a seat, dust off your notebooks, make sure to follow, and never miss a single day of the ongoing series <b>#30DaysOfNLP.</b></p><div id="1c3f" class="link-block">
<a href="https://medium.com/@marvinlanhenke/list/3974a0c731d6">
<div>
<div>
<h2>#30DaysOfNLP</h2>
<div><h3> </h3></div>
<div><p>medium.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0c359a988423c5ec038328541ff037163fbe2902.jpeg)"></div>
</div>
</div>
</a>
</div><p id="fb87"><i>Enjoyed the article? Become a <a href="https://medium.com/@marvinlanhenke/membership">Medium member</a> and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.</i></p><div id="5917" class="link-block">
<a href="https://medium.com/@marvinlanhenke/membership">
<div>
<div>
<h2>Join Medium with my referral link — Marvin Lanhenke</h2>
<div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div>
<div><p>medium.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*2j7Gh9xg1m7xh4bq)"></div>
</div>
</div>
</a>
</div><p id="9a47"><b>References / Further Material:</b></p><ul><li>Hobson Lane, Cole Howard, Hannes Max Hapke. Natural Language Processing in Action. New York: Manning, 2019.</li></ul><div id="a859" class="link-block">
<a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb">
<div>
<div>
<h2>Mlearning.ai Submission Suggestions</h2>
<div><h3>How to become a writer on Mlearning.ai</h3></div>
<div><p>medium.com</p></div>
</div>
<div>
<div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*ib0DX0UzRoFcNuZILb7rNA.jpeg)"></div>
</div>
</div>
</a>
</div></article></body>
#30DaysOfNLP
NLP-Day 7: Topic Modeling With TF-IDF
Implementing a term frequency-inverse document frequency vectorizer from scratch.
Topic Modelling #30DaysOfNLP [Image by Author]
Yesterday, we improved our numerical representation of a text by creating a Bag-Of-Words containing the normalized term frequencies. We also vectorized our frequency dictionary, allowing us to perform some more interesting mathematical operations.
However, we still have no way to distinguish between documents across the whole corpus. We have no way of determining the topic by simply relying on the term frequency.
In the following sections, we’re going to tackle that problem.
We will learn about the concept of the term frequency-inverse document frequency (TF-IDF). As complicated as it may sound, it’s not. And to understand things even better, we will compute TF-IDF by hand in a step-by-step fashion.
So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: Topic Modeling With TF-IDF.
Term frequency-inverse document frequency
Up until now, we exclusively relied on the word count or term frequency to determine the importance of certain tokens to a particular document.
This approach, however, is flawed.
In the previous episodes, we saw that the term frequency of common stop words can skew the overall picture quite heavily.
Just because the token "the" appears multiple times in a document doesn’t mean its story is revolving around a commonly used grammatical article in the English language. What kind of story would that be, anyway?
So we need a way to measure the importance of a word to a particular document with respect to the total occurrences across the whole corpus.
Let’s assume we possess a corpus of every book ever written about artificial intelligence. “Artificial intelligence” would almost always appear multiple times in every book or document. But that doesn’t provide any new information to determine the topic or to distinguish the documents.
Something like “Decision Trees”, or “Support Vector Machines” might not occur across the whole corpus. But in the document where it does, we have new information to know what the text might be all about.
TF-IDF provides a “rarity” measure.
The score increases the more frequently a term occurs. However, it decreases the more often the term appears across multiple documents. We can also think of TF-IDF in this way: “How strange is it that this token is in this document?”
TF-IDF is our first step into the realm of topic analysis.
Before implementing an example by hand, we need to know how to calculate TF-IDF.
Pretty straightforward calculations.
We already know how to compute the term frequency (TF) by simply dividing the word count for a certain term by the total word count of the document.
The inverse document frequency (IDF) can be computed by dividing the total number of documents of a corpus by the number of documents containing our specific term.
Note: We apply a log transformation in order to avoid exponential differences in frequency.
TF-IDF is simply the product of TF and IDF.
TF-IDF from scratch
We know how to calculate TF-IDF, what it means, and why it’s a useful measure.
Now, let’s apply our knowledge and compute the term frequency-inverse document frequency from scratch.
First of all, we import the necessary libraries and create our corpus, containing 4 documents. For demonstrating purposes, we work with a simple nursery rhyme.
Once we created our corpus we’re ready to proceed.
We will use the TreebankWordTokenizer to build our lexicon, containing all unique tokens, except the punctuations.
Now, we can compute the normalized term frequency for every token in each document. We store the results in a data frame which makes it easier to visualize the results.
We create an empty data frame with an index for every document and a separate column for each unique token in our lexicon.
Next, we iterate over all documents and create a Bag-Of-Words by utilizing the collections.Counter object. Dividing the word count for a specific term by the number of unique tokens, we obtain the normalized term frequency.
The results, stored in our data frame look like the following.
TF data frame [Screenshot by Author]
By looking at the results, we can already tell that the term frequency alone doesn’t provide enough information to distinguish or determine the topic.
Thus, we need to compute the inverse document frequency, in order to improve our representation.
We start by creating an empty data frame and obtaining the number of documents in our corpus.
Next, we iterate through each document and term in our lexicon. If the term appears in a document, we store the value 1 in our data frame. By taking the sum for each column over all documents, we obtain the number of documents containing a specific term.
Finally, we calculate the inverse document frequency.
We simply divide the number of documents by the number of documents containing a term. After applying the numpy.log()function we store the values in our data frame.
It’s time for the last step. Tying everything together.
Pretty straightforward.
We basically just multiply both data frames (TF, IDF) and obtain a new one with the term frequency-inverse document frequency. Normalizing the data with sklearn.preprocessing.normalize() creates a unit vector of length 1 for each row.
Note: We apply the normalizing in order to compare our solution later to the sklearn’s implementation.
Now, we can visualize the results.
TF-IDF data frame [Screenshot by Author]
This looks pretty nice.
Just by looking at the TF-IDF scores in the second row, we can tell the document covers something about a “jumping cow”. Sounds reasonable enough.
TF-IDF seems to work very well. However, we had to do a lot of work to get there. Luckily, there exist several tools and libraries to make our lives a lot easier.
Libraries: TfidfVectorizer
In the last section, we computed the TF-IDF scores from scratch.
This was a lot of work. And a lot of code.
Fortunately for us, sklearn comes with an implementation of the TfidfVectorizer which basically does the same job. In a few lines of code.
Let’s take a look and compare the results with our implementation.
And this is it. Much less code. Much faster.
Now, we can visualize the data frame and compare the results.
TF-IDF sklearn’s implementation [Screenshot by Author]
Looks like we nailed it.
These are the same results as in our implementation from scratch.
Conclusion
In this article, we learned what TF-IDF means, why it’s useful, and even how to compute it from scratch. We also compared our solution to the much easier to use implementation of the sklearn library.
We improved our numerical representation of a given text a lot. However, we still just consider only the number of times a word appears.
In the following episodes, we’re going to take the next step: Entering the world of semantic analysis, providing us with new ways to model the meaning, the topic of a document.
So take a seat, dust off your notebooks, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.
Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.