NLP: Term Frequency-Inverse Document Frequency
Getting Started with Natural Language Processing

In Natural Language Processing (NLP), we are often interested in analyzing a document for various metrics, such as how usually a word occurs in a document, how rare some words are, or the overall importance of some words. One such way is to use Term Frequency-Inverse Document Frequency (TF-IDF) technique.
Term Frequency
Term frequency is used to measure the frequency of a word in a document. It gives an estimate of how important a word is in a document. If a word has a higher frequency, they are more important.
We can calculate Term Frequency in Python as follows:
#!/usr/bin/env python
# coding: utf-8
def calculate_tf(paragraph):
word_list = paragraph.lower().split() # Convert paragraph to lowercase and split into words
total_words = len(word_list) # Total number of words in the paragraph
word_count = {}
for word in word_list:
if word not in word_count:
word_count[word] = 0
word_count[word] += 1
term_frequency = {}
for word, count in word_count.items():
term_frequency[word] = count / total_words
return term_frequency
# Example usage
paragraph = "A thinker sees his own actions as experiments and questions--as attempts to find out something. Success and failure are for him answers above all."
tf = calculate_tf(paragraph)
# Print the term frequency for each word
for word, frequency in tf.items():
print(f"Word: {word} | TF: {frequency}")Its output would like as follows:
Word: a | TF: 0.041666666666666664
Word: thinker | TF: 0.041666666666666664
Word: sees | TF: 0.041666666666666664
Word: his | TF: 0.041666666666666664
Word: own | TF: 0.041666666666666664
Word: actions | TF: 0.041666666666666664
Word: as | TF: 0.041666666666666664
Word: experiments | TF: 0.041666666666666664
Word: and | TF: 0.08333333333333333
Word: questions--as | TF: 0.041666666666666664
Word: attempts | TF: 0.041666666666666664
Word: to | TF: 0.041666666666666664
Word: find | TF: 0.041666666666666664
Word: out | TF: 0.041666666666666664
Word: something. | TF: 0.041666666666666664
Word: success | TF: 0.041666666666666664
Word: failure | TF: 0.041666666666666664
Word: are | TF: 0.041666666666666664
Word: for | TF: 0.041666666666666664
Word: him | TF: 0.041666666666666664
Word: answers | TF: 0.041666666666666664
Word: above | TF: 0.041666666666666664
Word: all. | TF: 0.041666666666666664Inverse Document Frequency (IDF)
IDF determines how rare a word is in the document. Words that occur more frequently are assigned with lower scores but rare words are assigned higher scores.
We can calculate IDF in Python as follows:
import math
def calculate_idf(paragraphs):
total_documents = len(paragraphs) # Total number of documents
word_count = {}
for paragraph in paragraphs:
words_in_paragraph = set(paragraph.lower().split()) # Convert paragraph to lowercase and split into words
for word in words_in_paragraph:
if word not in word_count:
word_count[word] = 0
word_count[word] += 1
inverse_document_frequency = {}
for word, count in word_count.items():
inverse_document_frequency[word] = math.log(total_documents / (count + 1))
return inverse_document_frequency
# Example usage
paragraphs = [
"Bill short, depressed, very wide at base, commissure straight. Nostrils basal, oval, partly closed by membrane.",
"Bill with notch in upper mandible; nostrils placed well in front of base of bill and quite bare.",
"Bill broad, flattened horizontally depressed, slightly toothed and adapted for catching small."
"Bill compressed towards tip, with scarcely perceptible notch at point; nostrils basal."
]
idf = calculate_idf(paragraphs)
# Print the IDF for each word
for word, idf_value in idf.items():
print(f"Word: {word} | IDF: {idf_value}")Here, we use formulae math.log(total_documents / (count + 1)) to calculate the IDF score. You may come up with another kind of IDF score as well. total_documents is the number of paragraphs in the above code snippet and count is the number of paragraphs above that have that specific word.
Finally, the TF-IDF score can be calculated by multiplying TF and IDF score as TF*IDF. Such a score can be used as a numerical representation of words and can help in text classification, clustering, and information retrieval.
Follow me on LinkedIn, and Medium for more content like this.
Did you enjoy this article? Buy me a Coffee.
Love my writing? Join my email list.
