avatarRashida Nasrin Sucky

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5126

Abstract

(name) = 1</p><p id="329d">IDF(name) = =LN((1+3)/(1+1)) +1 = 1.69</p><p id="a9a6">TFIDF(name) = 1* 1.69 = 1.69</p><p id="b8f8">In the same way, the TFIDF for all the words are :</p><figure id="9930"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0QN_4j5IDA0JpwWhb443Eg.jpeg"><figcaption>Image By Author</figcaption></figure><p id="b535">The sklearn’s tfidf vectorizer normalizes the values to bring them in a 0 to 1 scale. For that, we need to have SS(Sum of Squared) for the tfidfs of each document:</p><figure id="0a64"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xZBQFWZPbPnZJHC78W5Txw.jpeg"><figcaption></figcaption></figure><p id="12b0">The normalized tfidf is:</p><p id="862e">The tfidf value for the word/ the SS of the document</p><p id="1817">If we take the word My. normalized tfidf for ‘My’ in the document-1 is:</p><p id="b617">tfidf_normalized(My) = 1.00 / 7.871 = 0.356</p><p id="18ac">tfidf for the word ‘mom’ in document-3 is:</p><p id="b071">tfidf_normalized(name) = 1.42 / 9.005 = 0.472</p><p id="b591">Again, tfidf for the word ‘mom’ in document-2 is:</p><p id="cb23">tfidf_normalized(name) = 1.42 / 13.009 = 0.392</p><p id="f6a5" type="7">Looks like the word ‘mom’ has a bit more relevance in the document 2 than in the document 3</p><p id="0705">The normalized tfidf for all the words are here:</p><figure id="a1d1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*f0qNGkQqsxA9SGpiTL9n0A.jpeg"><figcaption>Image By Author</figcaption></figure><p id="fe45">Now, we should check how the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">tfidf vectorizer in the sklearn library</a> work.</p><p id="8e7e">First, import the Tfidf vectorizer from sklearn library and define the text to be used for feature extraction:</p><div id="770c"><pre><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer text = [<span class="hljs-string">"My name is Lilly"</span>, <span class="hljs-string">"Lilly is my mom’s favorite flower"</span>, <span class="hljs-string">"My mom loves flowers"</span>]</pre></div><p id="66d7">In the next code block,</p><p id="6049">the first line calls the TfidfVectorizer method and saves it in a variable named vectorizer.</p><p id="c8e8">the second line fit_transform the text into the vectorizer</p><p id="7840">the third line converts that to an array to display</p><div id="7a89"><pre>vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(text) X.toarray()</pre></div><p id="ec31">Output:</p><div id="4b2d"><pre>array([[<span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.4804584</span> , <span class="hljs-number">0.4804584</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.37311881</span>, <span class="hljs-number">0.63174505</span>], [<span class="hljs-number">0.49482971</span>, <span class="hljs-number">0.49482971</span>, <span class="hljs-number">0.</span> , <span class="hljs-number">0.37633075</span>, <span class="hljs-number">0.37633075</span>, <span class="hljs-number">0.</span> , <span class="hljs-number">0.37633075</span>, <span class="hljs-number">0.2922544</span> , <span class="hljs-number">0.</span> ], [<span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.5844829</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.</span> , <span class="hljs-number">0.5844829</span> , <span class="hljs-number">0.44451431</span>, <span class="hljs-number">0.34520502</span>, <span class="hljs-number">0.</span> ]])</pre></div><p id="e6ac">It will be helpful to convert this array into a DataFrame and use the words as the column names.</p><div id="bcad"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())</pre></div><figure id="1901"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*FTVaT5lcJSgMIugzLmZXNA.jpeg"><figcaption>Image By Author</figcaption></figure><p id="09be">You can use the same parameters as I explained in <a href="https://towardsdatascience.com/countvectorizer-to-extract-features-from-texts-in-python-in-detail-0e7147c10753">my tutorial on CountVectorizer</a> to refine or limit the number of features. Please feel free to check that.</p><h2 id="0254">Conclusion</h2><p id="9b8a">This tutorial explained in detail how Tfidf Vectorizer works. Though it is very simple to use the Tfidf Vectorizer from sklearn library, it is important to understand the concept behind it. When you know how a vectorizer works, it becomes easier to make the decision on what kind of vectorizer is suitable.</p><p id="1d68">Feel free to follow me on <a href="https://twitter.com/rash

Options

ida048">Twitter</a> and like my <a href="https://www.facebook.com/rashida.smith.161">Facebook</a> page.</p><p id="4da0"><b>The video version of this tutorial:</b></p> <figure id="7687"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FnB8RnKwl3KI%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnB8RnKwl3KI&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FnB8RnKwl3KI%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" allowfullscreen="" frameborder="0" height="480" width="854"> </div> </div> </figure></iframe></div></div></figure><h2 id="7426">More Reading</h2><div id="8b79" class="link-block"> <a href="https://towardsdatascience.com/a-complete-step-by-step-tutorial-on-sentiment-analysis-in-keras-and-tensorflow-ea420cc8913f"> <div> <div> <h2>A Complete Step by Step Tutorial on Sentiment Analysis in Keras and Tensorflow</h2> <div><h3>Complete Working Code for Data Preparation, Deep Learning Model Development, and Training the network</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*P8PwsZYlOrwVZKcC)"></div> </div> </div> </a> </div><div id="7b3a" class="link-block"> <a href="https://pub.towardsai.net/a-complete-exploratory-data-analysis-in-python-a2148daac072"> <div> <div> <h2>A Complete Exploratory Data Analysis in Python</h2> <div><h3>Data Cleaning, Analysis, Visualization, Feature Selection, Predictive Modeling</h3></div> <div><p>pub.towardsai.net</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*E9BxT5kqPUV9nxrL)"></div> </div> </div> </a> </div><div id="97ef" class="link-block"> <a href="https://towardsdatascience.com/30-very-useful-pandas-functions-for-everyday-data-analysis-tasks-f1eae16409af"> <div> <div> <h2>30 Very Useful Pandas Functions for Everyday Data Analysis Tasks</h2> <div><h3>Pandas Cheatsheet</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jpQsM-g5l4CBAcxg)"></div> </div> </div> </a> </div><div id="122b" class="link-block"> <a href="https://towardsdatascience.com/how-to-define-custom-layer-activation-function-and-loss-function-in-tensorflow-bdd7e78eb67"> <div> <div> <h2>How to Define Custom Layer, Activation Function, and Loss Function in TensorFlow</h2> <div><h3>Step-by-step explanation and examples with complete code</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*rFRE8Ah6U6f87-fa)"></div> </div> </div> </a> </div><div id="6799" class="link-block"> <a href="https://towardsdatascience.com/morphological-operations-for-image-preprocessing-in-opencv-in-detail-15fccd1e5745"> <div> <div> <h2>Morphological Operations for Image Preprocessing in OpenCV, in Detail</h2> <div><h3>Erosion, dilation, opening, closing, morphological gradient, tophat / whitehat, and blackhat explained with examples</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*_iKidBQWqizGHPBa)"></div> </div> </div> </a> </div><div id="893c" class="link-block"> <a href="https://towardsdatascience.com/anomaly-detection-in-tensorflow-and-keras-using-the-autoencoder-method-5600aca29c50"> <div> <div> <h2>Anomaly Detection in TensorFlow and Keras Using the Autoencoder Method</h2> <div><h3>A cutting-edge unsupervised method for noise removal, dimensionality reduction, anomaly detection, and more</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*vbwQk8jYxD0FSMKv)"></div> </div> </div> </a> </div></article></body>

Photo by Mohamed Nohassi on Unsplash

Converting Texts to Numeric Form with TfidfVectorizer: A Step-by-Step Guide

How to calculate Tfidf values manually and using sklearn

TFIDF is a method to convert texts to numeric form for machine learning or AI models. In other words, TFIDF is a method to extract features from texts. This is a more sophisticated method than the CountVectorizer() method I discussed in my last article.

The TFIDF method provides a score for each word that represents the usefulness of that word or the relevance of the word. It measures the usage of the word compared to the other words present in the document.

This article will calculate the TFIDF scores manually so that you understand the concept of TFIDF clearly. Toward the end, we will see how to use the TFIDF vectorizer from the sklearn library as well.

There are two parts to it: TF and IDF. Let’s see how each part works.

TF

TF is elaborated as ‘Term Frequency’. TF can be calculated as:

TF = # of occurrence of a word in a Document

OR

TF = (# of occurrence in a document) / (# of words in a document)

Let’s work on an example. We will find the TF for each word for this document:

My name is Lilly

Let’s see an example for each of the formulas.

TF = # of occurrence of a word in a Document

If we take the first formula here which is simply the number of occurrences of a word in a document, TF for the word ‘MY’ is 1 as it appeared only once.

In the same way, the TF for the word

‘name’ = 1, ‘is’ = 1, ‘Lilly’ = 1

Now, let’s use the second formula.

TF = (# of occurrence in a document) / (# of words in a document)

If we take the second formula, the first part of the formula (# of occurrences in a document) is 1, and the second part (# of words in a document) is 4.

So, the TF for the word ‘MY’ is 1/4 or 0.25.

In the same way, the TF for the words

name = ¼ = 0.25, is = ¼ = 0.25, Lilly = ¼ = 0.25.

IDF

The elaboration of IDF is Inverse Document Frequency.

Here is the formula,

idf = 1 + LN[n/df(t)]

or

idf = LN[n/df(t)]

Where, n = Number of documents available, and

df = Number of documents where the term appears

As per sklearn library’s documentation

idf = LN[(1+n) / (1+df(t))] + 1 (default setting)

or

idf = LN[n / df(t)] + 1 (when smooth_idf = True)

We won’t work on all four formulas here. Let’s just work on the 2 formulas. You will get the idea.

To demonstrate the IDF, only one document is not enough. I will these three documents:

My name is Lilly

Lilly is my mom’s favorite flower

My mom loves flowers

Let’s use this formula to practice this time:

IDF = LN[n/df(t)]

If we take the word ‘My’ first, n is 3 because we have 3 documents here, and df(t) is also 3 because ‘My’ appeared in all three documents.

IDF(MY) = LN(3/3) = 0 (As ln(1) is 0)

We will work on one more word to understand it clearly. Take the word ‘name’.

For the word ‘name’ again the n will be the same as before because the number of documents is 3 but the df(t) will be 1. Because the word ‘name’ is present in only one document.

IDF(name) = ln(3/1) = 1.1 (I used Excel’s LN function for this)

How does the sklearn library calculate TFIDF?

the sklearn library uses these two formulas for TF and IDF:

TF = # of occurrence of a word in a Document

idf = LN[(1+n) / (1+df(t))] + 1

If I use the same three documents, for the word ‘MY’:

TF(My) = 1

IDF(My) =LN((1+3)/(1+3)) + 1 = 1

The formula for TFIDF is :

TFIDF = TF*IDF

So, the TFIDF for ‘My’ is :

TFIDF(My) = 1* 1 = 1

For the word ‘name’:

TF(name) = 1

IDF(name) = =LN((1+3)/(1+1)) +1 = 1.69

TFIDF(name) = 1* 1.69 = 1.69

In the same way, the TFIDF for all the words are :

Image By Author

The sklearn’s tfidf vectorizer normalizes the values to bring them in a 0 to 1 scale. For that, we need to have SS(Sum of Squared) for the tfidfs of each document:

The normalized tfidf is:

The tfidf value for the word/ the SS of the document

If we take the word My. normalized tfidf for ‘My’ in the document-1 is:

tfidf_normalized(My) = 1.00 / 7.871 = 0.356

tfidf for the word ‘mom’ in document-3 is:

tfidf_normalized(name) = 1.42 / 9.005 = 0.472

Again, tfidf for the word ‘mom’ in document-2 is:

tfidf_normalized(name) = 1.42 / 13.009 = 0.392

Looks like the word ‘mom’ has a bit more relevance in the document 2 than in the document 3

The normalized tfidf for all the words are here:

Image By Author

Now, we should check how the tfidf vectorizer in the sklearn library work.

First, import the Tfidf vectorizer from sklearn library and define the text to be used for feature extraction:

from sklearn.feature_extraction.text import TfidfVectorizer 
text = ["My name is Lilly",
       "Lilly is my mom’s favorite flower",
       "My mom loves flowers"]

In the next code block,

the first line calls the TfidfVectorizer method and saves it in a variable named vectorizer.

the second line fit_transform the text into the vectorizer

the third line converts that to an array to display

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)
X.toarray()

Output:

array([[0.        , 0.        , 0.        , 0.4804584 , 0.4804584 ,
        0.        , 0.        , 0.37311881, 0.63174505],
       [0.49482971, 0.49482971, 0.        , 0.37633075, 0.37633075,
        0.        , 0.37633075, 0.2922544 , 0.        ],
       [0.        , 0.        , 0.5844829 , 0.        , 0.        ,
        0.5844829 , 0.44451431, 0.34520502, 0.        ]])

It will be helpful to convert this array into a DataFrame and use the words as the column names.

import pandas as pd 
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
Image By Author

You can use the same parameters as I explained in my tutorial on CountVectorizer to refine or limit the number of features. Please feel free to check that.

Conclusion

This tutorial explained in detail how Tfidf Vectorizer works. Though it is very simple to use the Tfidf Vectorizer from sklearn library, it is important to understand the concept behind it. When you know how a vectorizer works, it becomes easier to make the decision on what kind of vectorizer is suitable.

Feel free to follow me on Twitter and like my Facebook page.

The video version of this tutorial:

More Reading

Data Science
Machine Learning
Programming
Tf Idf
Tfidf Vectorizer
Recommended from ReadMedium