avatarUniqtech

Summary

This webpage provides an in-depth explanation of the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, a key concept in information retrieval and text analysis, along with its formula, intuition, and usage in Natural Language Processing (NLP) and sklearn.

Abstract

The webpage titled "tf-idf basics of information retrieval" offers a comprehensive guide to understanding the TF-IDF algorithm, which is a crucial component in information retrieval and text analysis. The algorithm is explained in detail, including its formula, intuition, and usage in NLP and sklearn. The page also discusses the importance factor, term frequency, inverse document frequency, and tokenization. Additionally, it provides examples using sklearn's TfidfVectorizer and a brief overview of Langchain's TF-IDF utility. The page also includes a section on data cleaning and preprocessing, emphasizing the importance of removing stop words and stemming for accurate results.

Bullet points

  • The TF-IDF algorithm is a key concept in information retrieval and text analysis.
  • The algorithm measures the importance of a word in a document within a collection of documents (corpus).
  • The importance factor is proportional to the frequency of the keyword in the document, normalized by the length of the document, and inversely proportional to the frequency of the word in other documents in the corpus.
  • The TF-IDF formula includes term frequency (TF) and inverse document frequency (IDF).
  • Term frequency is the count of keywords/terms in the current document normalized by the length of the document words.
  • Inverse document frequency (IDF) is the logarithm of the total number of documents in the collection divided by the number of documents with such term.
  • The TF-IDF algorithm is used in NLP, and sklearn provides a feature extractor called TfidfVectorizer.
  • Data cleaning and preprocessing, including removing stop words and stemming, are important for accurate TF-IDF results.
  • Langchain also offers a TF-IDF utility for retrieving relevant documents.

tf-idf basics of information retrieval

Title: TDIDF (definition) tf–idf, tfidf, information retrieval, term frequency–inverse document frequency. Understanding TF-IDF formula in minutes. Uniqtech Guide to TF-IDF.

Introduction

TFIDF models how important keywords are within a document and also in the context of a collection of documents and texts known as a corpus. TFIDF is the key algorithm used in information retrieval. For the definition of Information Retrieval, read our flash card on Information Retrieval (IR) Definition. It is used in document retrieval.

Importance Factor explained in plain English: The importance factor is proportional to the frequency of the keyword appearance in the document, normalized by the length of the document (long docs don’t get advantages over short docs), and inversely proportional to the frequency of the word appearance in other documents in the corpus (importance factor is offset by how frequently the word appears in other documents). For the math formula, see wikipedia screenshot below. For the tfidf function scroll down to the tf-idf function section. Update: September 2023, thank you Langchain for linking to us. LOVE 🦜⛓️ . We are honored. We added a section on using Langchain for TF-IDF. Made the article flow better. Added clarifications.

Intuition: Let’s explain the intuition behind the offset calculation in this section. Using this discount formula, naturally more frequently appearing words can be discounted such as economics, economy, etc in the economist magazine collection. Nearly every document would mention econ in that collection, hence it shouldn’t affect the importance factor greatly. Therefore commonly appearing words would be “ignored”, and topic-specific words, unique keywords, specialized knowledge will be highlighted.

tf-idf versus state of art NLP:

It’s important to understand tf-idf is not the state of art NLP algorithm any more, though it is still used to prototype search, for some research papers, NLP tasks, and often taught in class as a classic NLP algorithm. It is an important concept to understand in information retrieval (IR). In scikit-learn machine learning library, tfidfvectorizer is a feature extractor used in feature engineering. More modern, popular NLP models include: word2vec, BERT, GPT-3 etc — they are high dimensional, large parameter deep learning models.

tf-idf is still a popular term weighting scheme used in part for Google Search, SEO, ranking of search results, NYTimes article text summarization and countless websites. One can definitely develop fancier and more advanced algorithms on top of this elegant and powerful concept.

TF-IDF Formula, Math Notation

Term frequency (TF): It’s intuitive. The more often a word appears in a document, the more likely that’s a part of its main topic. Caveat 1: keyword spamming. Caveat 2: what if document_1 is much longer than document_2? You can normalize the term frequency by document length.

Data cleaning, preprocessing can change TF-IDF results. We may see TF-IDF cannot group related words without developers’ help. It is sensitive to casing, stemming, capitalization. For best practice, more accurate result, it is also important to remove stop words (e.g. a, and, however) during the data cleaning phase. Text data and NLP libraries usually come with stopword functions and utilities. Stop words can differ by language. It’s also a common practice to convert all texts to lower case using .lower() . Be aware there may be information loss and caveats. Sometimes, lower casing can remove nuanced meaning of words such as Fossil. Capitalized Fossil could mean the clothing brand. Lower case fossil refers to “the remains of a prehistoric organism”. If we use lower case .lower() , we may not know a social media post is talking about the brand. Stemming (cutting off words to appropriate prefix or suffix or partial word) can lead to information loss too.

Inverse document frequency (IDF): Stop words like (the, and, a) appear very frequently in English texts, so regardless of whether they are useful in determining the actual meaning of the document, they will score high in Term Frequency. Remember our Economist Magazine? The word “Economist” may appear in the margin of every page spread. It’s not helpful for us distinguish article_1 and article_2. We may want to discount words that stem to econ.

How to calculate TF-IDF by hand? See this wikipedia screenshot https://en.wikipedia.org/wiki/Tf%E2%80%93idf

tf idf by hand

Note the very interesting case where “the” appears in every document so inverse document frequency = log (number of docs in the corpus divide by number of docs containing the word “the”) = log(2/2) = log(1) which is 0! So this stop word does not matter at all in our text analysis task.

tf-idf function

tfidf(t, d, D) = tf(t, d) • idf(t, D)

Term frequency can be defined using

tf(t, d) = count(t, d) / |d|

the count of keywords/terms in the current document normalized by the length of the document words. This way, it is robust for keyword spamming in lengthy documents. Lengthy documents don’t automatically have higher term frequency.

inverse document frequency (idf)

idf(t, D) = log(|D| / |document(s) with t|)

Discount with the number of documents also with this term.

another version of the formula

count of a term, normalized by the length of the current document, offset by the logarithm of the total number of documents in the collection with such term.

A brief paragraph about us: Like what you read so far? You can support us by capping for us on Medium. We write beginner friendly, bootcamp graduate friendly machine learning, deep learning and data science articles on Medium. Follow our profile 1200+ followers. Our top publication Data Science Bootcamp . You can also find our paid newsletter on Substack.com where to post Machine Learning Resources, paid subscriber easter eggs for the best internet resources for ML DL and data, trend, summary of conferences and seminars. Read more about our offering here. We are developing a machine learning course as we speak. Thank you for your support. Claps and followers are always appreciated. New articles from all sites are tweeted out @siliconlikes

Natural Language Processing (NLP) in general and with sklearn:

Tokenization: breaking sentences into words, and often take the count of the words. sklearn.CountVectorizer()

Here’s a nice tutorial series on how to tokenize, stem, remove stop words using nltk library, a popular python natural language processing library. https://www2.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html

It also shows how to marry tokenization and stemming with sklearn tf-idf frequency inverse document frequency. For tf-idf with sklearn (scikit-learn) check out TfidfVectorizer.

sklearn.feature_extraction.text.TfidfVectorizer

The scikit-learn tutorial on tfidf vectorizer is also helpful. Read more here.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
 'This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()
# array(['and', 'document', 'first', 'is', 'one', # 'second', 'the', 'third',
#       'this'], ...)
print(X.shape)
# (4, 9)

There are different implementations of Term Frequency–Inverse Document Frequency (TF-IDF). One implementation is : using t for term, d for current document, D for all documents of the corpus.

TF-IDF and Langchain 🦜⛓️

We are thrilled that Langchain documentation thought we have done a good job explaining TF-IDF and linked to our blog post. Thank you! Langchain also offers a TF-IDF utility, read about it here. “This notebook goes over how to use a retriever that under the hood uses TF-IDF using scikit-learn package."

# import libs
from langchain.schema import Document

# init a TFIDF retriever
retriever = TFIDFRetriever.from_documents(
    [
        Document(page_content="foo"),
        Document(page_content="bar"),
        ...
    ]
)

# retrieve results
result = retriever.get_relevant_documents("foo")

result

Sources:

Information Retrieval
Tf Idf
NLP
Information Technology
Machine Learning
Recommended from ReadMedium
avatarnoplaxochia
Implement TF-IDF

For NLP

2 min read