avatarMikhail Salnikov

Summary

This article explains the application of TF-IDF (Term Frequency-Inverse Document Frequency) in text clustering using K-means, providing Python code examples and emphasizing the importance of TF-IDF in determining word relevance within a corpus for improved clustering outcomes.

Abstract

The article delves into the concept of text clustering with a focus on the significance of TF-IDF as a weighting factor in determining the importance of words in text strings for clustering purposes. It begins with an explanation of TF-IDF, detailing how term frequency and inverse document frequency are calculated and why their product is crucial for text analysis. The author provides a practical Python example using a simple corpus of text strings, illustrating the process of transforming a bag of words representation into a TF-IDF weighted one, which highlights important words while downplaying common but less informative words like "and." The article then transitions to demonstrating how these TF-IDF weights can be utilized in K-means clustering, a popular unsupervised learning algorithm. By leveraging the scikit-learn library, the author showcases the implementation of TF-IDF Vectorizer and KMeans to cluster text documents, concluding with an example that reveals the effectiveness of the method in distinguishing between different clusters of text.

Opinions

  • The author advocates for the use of TF-IDF in text clustering tasks, suggesting it as a means to understand the nature of a document.
  • It is implied that using raw counts of terms (simple term frequency) is insufficient for effective text analysis, and that TF-IDF provides a more nuanced approach.
  • The author recommends using established libraries like scikit-learn for implementing TF-IDF and KMeans due to their reliability and ease of use, which can reduce potential errors in custom code.
  • There is an emphasis on the practicality of the examples provided, with the expectation that readers can apply similar methods to real-world tasks.
  • The author values reader feedback and encourages engagement with the content, indicating a commitment to improving the quality of future posts.

Text clustering with K-means and tf-idf

In this post, I’ll try to describe how to clustering text with knowledge, how important word is to a string. Same words in different strings can be badly affected to clustering this kind of data isn’t important for deciding. The first part of this publication is the general information about TF-IDF with examples on Python. In the second part, I’ll provide you the example showed how this approach can be applied to real tasks.

TF-IDF is useful for clustering tasks, like a document clustering or in other words, tf-idf can help you understand what kind of document you got now.

TF-IDF

Term Frequency-Inverse Document Frequency is a numerical statistic that demonstrates how important a word is to a corpus.

Term Frequency is just ratio number of current word to the number of all words in document/string/etc.

Frequency of term t_i, where n_t — the number of t_i in current document/string, the sum of n_k is the number of all terms in current document/string.

Inverse Document Frequency is a log of the ratio of the number of all documents/string in the corpus to the number of documents with term t_i.

tf-idf(t, d, D) is the product tf(t, d) to idf(t, D).

If you want more theoretic information about TF-IDF I want advice you read publication on Wikipedia about it or read NLP Stanford post.

Well, now time for a real example on Python.

TF-IDF example on Python

For all code below you need python 3.5 or newer and scikit-learn and pandas packages.

Firstly, let’s talk about a data set. For this really simple example, I just set a simple corpus with 3 strings. In this example, strings play a role documents.

After that lets make bags of words for our corpus and for every string too. But before we have to clear the data.

that’s what we get

In the case of the term frequency, the simplest choice is to use the raw count of a term in a string. For calculating tf for all terms, we must fill a dictionary as follows.

idf is a measure of how much information the token or word in our case, provides. For calculating idf we need fill dict too.

Now, I remain you that tf-idf is the product of tf to idf. For our python example, tf-idf it dict with the corresponding products.

OK, now we have tf-idf weights for each word in our corpus. Below you can clearly see the difference between the original bag of words and the new bag of words with tf-idf weights. For example ‘dogs’, ‘cats’ and ‘mouse’ is important words, but word ‘and’ is not important, because this word is in all the strings and we can’t understand what is a string by the word ‘and’.

TF-IDF bag of words
original bag of words

KMeans clustering with TF-IDF weights

Now, when we understand how TF-IDF work the time has come for almost real example of clustering with TF-IDF weights. For real life we can use scikit-learn implementation of TF-IDF and KMeans and I suggest you use implementations from scikit-learn or from another popular libraries or frameworks because it’s reducing a number of potential errors in your code.

For this example, we must import TF-IDF and KMeans, added corpus of text for clustering and process its corpus.

After that let’s fit Tfidf and let’s fit KMeans, with scikit-learn it’s really easy.

Now we have learned KMeans model with k = 2 for clustering strings, it’s easy, right? For predicting, just use predict method as follows.

There we can see, that string ‘tf and idf is awesome!’ and ‘some androids is there’ from different clusters and it’s right.

In addition, you can read Jupyter notebook with this examples.

Thanks for the reading, please leave a feedback. This can help me improve the quality of my future posts.

And don’t forget to follow me on twitter.

Data Science
Machine Learning
Tf Idf
Scikit Learn
Recommended from ReadMedium