Day 2: 30 Days of Natural Language Processing Series with Projects

Most important topics in NLP…

Welcome back peeps. As we move forward in 30 days of NLP with Projects series, in this post we will focus on basic terminologies/techniques to build a good foundation.

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Engineering with projects Series

60 days of Data Science and ML Series with projects

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Projects Videos —

Subscribe today!

Ignito

Excited to share that we have launched our Youtube channel — Ignito to cover all the projects and coding exercise for …

www.youtube.com

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

Ignito

Data Science, ML, AI and more… Click to read Ignito, by Naina Chaturvedi, a Substack publication. Launched 7 months…

naina0405.substack.com

Day 1 post which covers the pre-requisites before you dive in further —

Day 1–30 Days of Natural Language Processing Series with Projects

Get set go…

medium.com

So let’s get started!

Bag of Words —

Bag of Words is the NLP technique which is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

The process of creating a bag-of-words representation of text data typically involves the following steps:

Tokenization: The text is split into individual words or tokens. This step is also known as text segmentation or word segmentation.
Removing Stop words: Stop words are common words that do not add much meaning to the text such as “the”, “is”, “and” etc. These are typically removed from the text as they do not contribute much to the meaning of the text.
Stemming or Lemmatization: The process of reducing inflected or derived words to their word stem, base or root form. It is done to reduce the dimensionality of the data and to group similar words together.
Building the vocabulary: The set of all unique words in the text is built, also known as the vocabulary.
Counting the words: The number of occurrences of each word in the vocabulary is counted.
Building the BoW representation: A vector is created where each element represents a word in the vocabulary, and the value of the element represents the number of occurrences of that word in the text.

Once the bag-of-words representation of the text is created, it can be used as input for various NLP tasks such as text classification, sentiment analysis, topic modeling, and information retrieval. Since the order of the words is not considered, the BoW model is not suitable for tasks that require understanding the context or meaning of the text, such as language translation or text generation.

In summary, the bag-of-words model is a method of representing text data where the order of the words is not considered. It is commonly used in text classification and information retrieval tasks and it is done by Tokenization, Removing Stop words, Stemming or Lemmatization, Building the vocabulary, Counting the words, Building the BoW representation.

Bag of Words Vectors

Real world data is messy and unstructured and in order to use ML algorithms that prefer structured, well defined fixed-length inputs we use the bag of word technique using which we can convert variable-length texts into a fixed-length vector.

It works by creating a vocabulary of all the unique words in the text data, and then for each document or sentence, creating a vector where each dimension represents the count of each word in the vocabulary.

For example, let’s say we have a vocabulary of three words: [“dog”, “cat”, “mouse”]. If a sentence is “The dog chased the cat”, the BoW vector would be [1, 1, 0], representing one occurrence of “dog” and one occurrence of “cat” in the sentence, and no occurrences of “mouse”.

BoW is a simple and efficient representation of text data, but it does not take into account the order of words in a sentence or any context beyond the occurrence of individual words.

Term Frequency-Inverse Document Frequency (TF-IDF) —

Term frequency–inverse document frequency is used to identify the importance of a word in in a document in a collection or corpus. TF is the number of times a word appears in a document divided by the total number of words in the document. IDF is the log of the numbers of documents divided by the number of documents that contain the word w.

T.F = No of rep of words in sentence/No of words in sentence

IDF = No of sentences / No of sentences containing words

The TF part of the method calculates the frequency of a word in a given document, with the idea that a word that appears more often in a document is more important to the meaning of the document. This frequency is typically normalized by the total number of words in the document.
The IDF part of the method calculates the inverse of the document frequency of a word, with the idea that a word that appears in many documents is less important to the meaning of a specific document. This is typically calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing the word.
The final TF-IDF score for a word in a document is the product of its TF and IDF scores. These scores can then be used to rank the importance of words in a document or to compare the similarity of documents.

It’s widely used in information retrieval and text mining because of its ability to give more weight to terms that are more specific to the given document and less weight to terms that are common across all documents, allowing for more relevant results.

Code —

from sklearn.feature_extraction.text import TfidfVectorizer 
cv = TfidfVectorizer() 
X = cv.fit_transform(counters).toarray()

N-grams is a sequence of N tokens

1 gram is one word sequence

2 gram is a two word sequence of words

3 gram is three word sequence of words

Stop Words

These are most common words which do not add much value to the meaning of the document.

The idea behind this is that these words are considered “unimportant” for these tasks and are unlikely to provide useful information. Examples of stop words include “the,” “an,” “and,” “is,” “are,” “was,” etc.
The process of removing stop words from text data typically involves first creating a list of stop words, and then iterating through the text data, comparing each word to the list of stop words, and removing any that match. This is often done as a pre-processing step to improve the efficiency and accuracy of NLP tasks, as removing stop words can reduce the size of the text data and potentially remove “noise” that could negatively impact the performance of the task.

Code —

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stopwords = set(stopwords.words('english'))

Word embedding

It’s a dense representation of words in the form of numeric vectors.

The process of creating word embeddings typically involves training a neural network on a large corpus of text data. The neural network, such as a skip-gram or a continuous bag-of-words model, is trained to predict a word given the context of surrounding words. During training, the weights of the network’s layers are adjusted so that the embeddings for semantically similar words are close to each other in the space.
Once the embeddings are learned, they can be used as input to other NLP models, such as language models or sentiment analysis models. By representing words as vectors, the embeddings can capture the meaning of the words and the relationships between them. This makes it possible to use the embeddings to improve the performance of other NLP models by providing them with a more rich and semantically meaningful representation of the text data.

Word embeddings can also be used for tasks such as word similarity comparison, word analogy, and clustering of similar words. These embeddings can be pre-trained on large corpora and fine-tuned for specific NLP tasks for better performance.

Tokenization

Tokenization is the text preprocessing technique used to break text into smaller components of text which are known as tokens.

There are several ways to tokenize words in NLP, including:

Using whitespace and punctuation as delimiters: This method splits the text into tokens based on spaces and punctuation marks.
Using regular expressions: This method uses a regular expression to define a pattern for tokenizing text.
Using pre-trained tokenizers: There are many pre-trained tokenizers available in different NLP libraries such as NLTK, spaCy and nltk word_tokenize,spacy tokenizer
Using the Treebank tokenizer: This tokenizer is based on the Penn Treebank Project, which uses a set of rules to tokenize text in a way that is similar to how humans would tokenize it.
Using BPE tokenizer : Byte pair encoding tokenizer used for sub-word tokenization.

Code —

from nltk.tokenize import word_tokenize
 
text = "Today is a beautiful day"
t = word_tokenize(text)
 
print(t)

Output —

["Today", "is", "a", "beautiful", "day"]

Stemming

It’s a technique to get to the root form of a word by removing the prefix and suffix of a word.

There are different algorithms for stemming, the most common of which are the Porter stemmer and the Snowball stemmer. The Porter stemmer uses a set of heuristic rules to remove common suffixes from words, such as “ing,” “ed,” and “s.” The Snowball stemmer is based on the Porter stemmer but also supports other languages and has additional features.
Stemming can be useful in NLP tasks such as text classification, information retrieval, and language modeling. By reducing words to their base form, stemming can help to reduce the dimensionality of the text data and make it more manageable. However, it is important to note that stemming can also lead to information loss and errors, as different words may have the same stem but different meanings.
In addition to stemming, there is another technique called Lemmatization which aims to obtain the root word of the token but unlike stemming it uses a dictionary or a morphological analyzer to obtain the root word, this way it is more accurate and less aggressive than stemming.

Code —

from nltk.stem import PorterStemmer
 
tokenized = ["booking", "studying", "jumping"]
 
stemmer = PorterStemmer()
s = [stemmer.stem(token) for token in tokenized]

Output —

['book','studi','jump]

Lemmatization

Another text preprocessing technique using which we reduce the words down to their root forms.

The process of lemmatization, similar to stemming, reduces the inflected words to their word stem, base or root form, which is called the root word, but unlike stemming, it makes sure that the root word is a real word in the language, known as a lemma.
The process of lemmatization typically uses a dictionary-based approach, where words are looked up in a dictionary and replaced with their corresponding lemma. The lemma of a word is the base form of the word, which can be found by removing inflectional endings and by bringing the word to its normal form. The lemma of the word “running” would be “run”.
Lemmatization can be useful in NLP tasks such as text classification, information retrieval, and machine translation, where it is important to reduce the dimensionality of the data by mapping different forms of a word to a common base form. It can also improve the performance of text-based models such as language models, by reducing the number of unique words in the vocabulary and making the model more robust to inflected forms of words.

Lemmatization can be done by using pre-trained libraries such as NLTK, spaCy and WordNetLemmatizer.

from nltk.stem import WordNetLemmatizer
 
tokenized = ["booking", "studying", "jumping"]
 
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

Output —

['book','study','jump]

Stemming vs Lemmatization

Stemming and lemmatization are both techniques used to reduce words to their base or root form in natural language processing (NLP). However, they are slightly different in their approach and the results they produce.
Stemming is a process of removing the suffixes or prefixes from a word to reduce it to its root form. It uses a set of rules or heuristics to chop off the ends of words to arrive at the root form. It is a simple, fast and crude method. For example, the stem of the word “running” would be “run”.
Lemmatization, on the other hand, is a more sophisticated method. It uses a dictionary-based approach, where words are looked up in a dictionary and replaced with their corresponding lemma. The lemma of a word is the base form of the word, which can be found by removing inflectional endings and by bringing the word to its normal form. The lemma of the word “running” would be “run”.

In summary, the key difference between stemming and lemmatization is that stemming is a crude heuristic method while lemmatization uses a dictionary to map words to their base form, which is more accurate but also more computationally expensive. While stemming may produce non-real words, lemmatization will produce real words.

Part of Speech Tagging

It’s a technique of assigning a part of speech to every word in a string. It tells us the way in which a word is used in a sentence and relationship between the words.

This is an important step in natural language processing (NLP) because it helps disambiguate the meaning of words and can be used as a feature in many NLP tasks, such as syntactic parsing and named entity recognition.
There are two main approaches to POS tagging: rule-based and statistical. Rule-based approaches use a set of manually defined rules to assign POS tags to words, while statistical approaches use machine learning algorithms to learn the most likely POS tag for a given word based on its context. The most common algorithm used in statistical POS tagging is the Hidden Markov Model (HMM).
HMM algorithm takes into account the sequence of words and the probability of a word given its previous words, and uses this information to assign the most likely POS tag. This approach is more accurate than rule-based methods and can handle the ambiguity and variability of natural language.

Additionally, there are pre-trained models for POS tagging like BERT, ELMO, and GPT-2, which can be fine-tuned on specific tasks, providing high accuracy.

Day 3 : Coming soon!

Follow for more updates, stay tuned and of-course let me end this post with a quote by Steve Jobs ;)

“Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it.”

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Build Machine Learning Pipelines( With Code) — Part 1

Complete implementation…

medium.datadriveninvestor.com

Recurrent Neural Network with Keras

Recurrent Neural Network with Keras

Project Implementation and cheatsheet…

medium.datadriveninvestor.com

Clustering Geolocation Data in Python using DBSCAN and K-Means

Clustering Geolocation Data in Python using DBSCAN and K-Means

Project Implementation…

medium.datadriveninvestor.com

Facial Expression Recognition using Keras

Facial Expression Recognition using Keras

Project Implementation…

medium.datadriveninvestor.com

Hyperparameter Tuning with Keras Tuner

Hyperparameter Tuning with Keras Tuner

Project Implementation….

medium.datadriveninvestor.com

Custom Layers in Keras

Custom Layers in Keras

Code implementation …

medium.datadriveninvestor.com