
Text clustering with K-means and tf-idf
In this post, I’ll try to describe how to clustering text with knowledge, how important word is to a string. Same words in different strings can be badly affected to clustering this kind of data isn’t important for deciding. The first part of this publication is the general information about TF-IDF with examples on Python. In the second part, I’ll provide you the example showed how this approach can be applied to real tasks.
TF-IDF is useful for clustering tasks, like a document clustering or in other words, tf-idf can help you understand what kind of document you got now.
TF-IDF
Term Frequency-Inverse Document Frequency is a numerical statistic that demonstrates how important a word is to a corpus.
Term Frequency is just ratio number of current word to the number of all words in document/string/etc.

Frequency of term t_i, where n_t — the number of t_i in current document/string, the sum of n_k is the number of all terms in current document/string.
Inverse Document Frequency is a log of the ratio of the number of all documents/string in the corpus to the number of documents with term t_i.

tf-idf(t, d, D) is the product tf(t, d) to idf(t, D).

If you want more theoretic information about TF-IDF I want advice you read publication on Wikipedia about it or read NLP Stanford post.
Well, now time for a real example on Python.
TF-IDF example on Python
For all code below you need python 3.5 or newer and scikit-learn and pandas packages.
Firstly, let’s talk about a data set. For this really simple example, I just set a simple corpus with 3 strings. In this example, strings play a role documents.







