Data Science with Python — Natural Language Processing

The basics of the technology behind ChatGPT

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Today, with social media posts, news articles, customer reviews, etc... Natural language processing (NLP) has become a crucial tool in the data scientist’s toolkit. Especially since technologies like ChatGPT are now available to everyone (because of course, ChatGPT is an application of NLP).

NLP is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It involves using algorithms and statistical models to understand and generate human language. With the help of NLP, we can extract valuable insights from text data, such as sentiment analysis, text classification, and topic modeling.

We’ll see today how to perform NLP with Python.

Basics of NLP

Python has emerged as a popular language for NLP tasks due to its simplicity, ease of use, and the availability of powerful libraries such as Natural Language Toolkit (NLTK), spaCy, and Gensim. These libraries provide a range of tools for performing NLP tasks, such as tokenization, stemming, lemmatization, and stop word removal.

Tokenization is the process of breaking down a text into individual words or phrases, which are called tokens. It helps to standardize the input and create a manageable structure for analysis. Tokens are often created by splitting a text on whitespace, punctuation, or other delimiters.
Stemming is the process of reducing a word to its base or root form, known as a stem. This is done by removing the suffixes from the end of a word. For example, the word “jumping” might be stemmed to “jump”. The purpose of stemming is to reduce the dimensionality of the data and group together words that have the same root.
Lemmatization is similar to stemming in that it reduces words to their base form, but it does so using a dictionary or morphological analysis instead of just removing suffixes. For example, the word “went” might be lemmatized to “go”. The advantage of lemmatization over stemming is that it produces a more meaningful and accurate base form.
Stop word removal is the process of removing words that are meaningless on their own (“the”, “and”, “a”, etc…) from a text before analysis. The purpose of this is to reduce noise in the data and focus on the more meaningful words that are important for analysis.

Before performing any NLP task, it’s important to preprocess the text data to remove any noise and make it suitable for analysis. Preprocessing techniques for text data can include converting text to lowercase, removing punctuations, and removing stop words. Once the text data is preprocessed, it can be further processed and analyzed using the appropriate NLP techniques.

In Python, the NLTK library provides a range of tools for performing NLP tasks. For example, to tokenize a sentence, we can use the word_tokenize() function as follows:

from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print(fdist.most_common(10))

This will output the 10 most common words and their frequencies:

[('Natural', 1), ('Language', 1), ('Processing', 1), ('is', 1), ('a', 1), ('subfield', 1), ('of', 1), ('artificial', 1), ('intelligence', 1), ('that', 1)]

Note: when using nltk, you sometimes will have to download some additional packages. You can do it this way, for example here, to install punkt:

import nltk
nltk.download('punkt')

Exploring Text Data

Before performing any NLP task, it’s important to explore the text data to gain insights into its characteristics and properties. This can help in selecting appropriate preprocessing and analysis techniques, as well as identifying potential issues such as data imbalance or bias.

There are several techniques for exploring text data, such as visualizations or frequency distributions.

Visualizing text data can provide a quick overview of its characteristics and patterns. One common visualization technique is the word cloud, which shows the most frequent words in the text data in a visually appealing way. Another technique is to use scatterplots or heatmaps to visualize the relationships between words or phrases.

On the other hand, a frequency distribution shows the number of occurrences of each word or phrase in the text data. This can help in identifying the most common words, as well as outliers or rare words. Frequency distributions can also be visualized using histograms or bar charts.

To visualize text data in Python, we can use libraries such as Matplotlib, Seaborn, or WordCloud. For example, to create a word cloud of the most frequent words in the text data, we can use the WordCloud library as follows:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

wordcloud = WordCloud(width=800, height=800, background_color='white', stopwords=set()).generate(text)

plt.figure(figsize=(8, 8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Note: if you can’t install wordcloud with pip, it’s probably because there is no release compatible with your Python version, or because you don’t have a C compiler installed.

To create a frequency distribution in Python, we can use the NLTK library’s FreqDist class as follows:

from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print(fdist.most_common(10))

Sentiment Analysis

Sentiment analysis is a common NLP task that involves determining the sentiment or emotional tone of a piece of text. Sentiment analysis can be used in various applications such as social media monitoring, customer feedback analysis, and political opinion mining.

There are two main approaches to sentiment analysis: lexicon-based and machine learning-based. Lexicon-based approaches use pre-built sentiment dictionaries to assign sentiment scores to words or phrases in the text data. Machine learning-based approaches, on the other hand, use supervised or unsupervised learning algorithms to train models that can predict the sentiment of text data.

In Python, there are several libraries that provide tools for performing sentiment analysis, such as TextBlob, Vader, and Scikit-learn.

Let’s take a look at how to perform sentiment analysis using TextBlob.

TextBlob is a Python library that provides a simple API for performing common NLP tasks such as sentiment analysis, part-of-speech tagging, and noun phrase extraction. To perform sentiment analysis using TextBlob, we can use its sentiment analysis feature as follows:

from textblob import TextBlob

text = "I really enjoyed this product. It exceeded my expectations."

blob = TextBlob(text)

print(blob.sentiment)

This will output the sentiment polarity and subjectivity scores:

Sentiment(polarity=0.5, subjectivity=0.7)

The sentiment polarity score ranges from -1 (most negative) to 1 (most positive), while the subjectivity score ranges from 0 (most objective) to 1 (most subjective).

Text Classification

Text classification is the process of categorizing text data into predefined classes or categories. Text classification can be used in various applications such as spam filtering, sentiment analysis, topic modeling, and language detection.

There are two main approaches to text classification: rule-based and machine learning-based. Rule-based approaches use a set of handcrafted rules to classify text data based on certain patterns or keywords. Machine learning-based approaches, on the other hand, use supervised or unsupervised learning algorithms to train models that can predict the class of text data.

In Python, you can use Scikit-learn, NLTK, and Keras to perform text classification.

Here, I will use Scikit-Learn.

Scikit-learn is a popular Python library for machine learning that provides tools for data preprocessing, feature extraction, and model selection. To perform text classification using Scikit-learn, we first need to preprocess the text data and convert it into numerical features using techniques such as bag-of-words or TF-IDF.

Let’s say we have a dataset of movie reviews and we want to classify them into positive or negative sentiment classes. We can use Scikit-learn to preprocess the text data and train a logistic regression model for classification as follows:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

import pandas as pd

df = pd.read_csv('movie_reviews.csv')

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# create a pipeline for text classification
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression())])

# train the model on the training set
text_clf.fit(X_train, y_train)

# evaluate the model on the testing set
accuracy = text_clf.score(X_test, y_test)
print('Accuracy:', accuracy)

Pipeline is a tool that allows you to chain together several text processing steps into a single object that can be trained on a dataset and used to make predictions on new data.

Here, the pipeline is composed of the following elements:

CountVectorizer: This component transforms the text data into a numerical representation by counting the frequency of each word in the text. It converts the text into a matrix of word counts, which can then be used as input to a machine learning algorithm.
TfidfTransformer: This component applies a technique called term frequency-inverse document frequency (TF-IDF) to the output of the CountVectorizer. This technique adjusts the word counts to reflect the importance of each word in the text relative to the rest of the corpus. It is a way of weighting the importance of each word in the text.
LogisticRegression: This component is a classification algorithm that can be trained on the transformed data to predict the target labels for new text data.

Topic Modeling

Topic modeling is a technique for discovering latent topics or themes in a collection of text documents. The goal of topic modeling is to identify the underlying topics or concepts that are discussed in the text data without prior knowledge of the topics.

There are several algorithms and methods for topic modeling, including Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Hierarchical Dirichlet Process (HDP).

In Python, the Gensim library provides tools for performing topic modeling using LDA and other algorithms.

To perform topic modeling with Gensim, we first need to preprocess the text data and convert it into a bag-of-words or TF-IDF representation. Then, we can train an LDA model to extract the topics from the text data.

Here’s an example code for performing topic modeling with Gensim:

import gensim
from gensim import corpora

text_data = ['text document 1', 'text document 2', 'text document 3']

# preprocess the text data
processed_data = [doc.lower().split() for doc in text_data]

# create a dictionary of the text data
dictionary = corpora.Dictionary(processed_data)

# create a bag-of-words representation of the text data
corpus = [dictionary.doc2bow(doc) for doc in processed_data]

# train an LDA model on the text data
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            passes=10)

# print the topics learned by the LDA model
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Fine-Tuning

Fine-tuning is the process of further improving the performance of a pre-trained language model on a specific task or domain. Fine-tuning can be useful when the pre-trained model is not specialized enough for the specific task or domain, or when there is a limited amount of labeled data available for training a task-specific model from scratch.

Fine-Tuning with Hugging Face Transformers

Hugging Face Transformers is a popular Python library for natural language processing that provides pre-trained language models and tools for fine-tuning them on specific tasks. With Hugging Face Transformers, you can fine-tune a pre-trained language model on tasks such as sentiment analysis, named entity recognition, and question answering.

Here’s an example code for fine-tuning a pre-trained BERT model on a sentiment analysis task with Hugging Face Transformers:

from transformers import BertForSequenceClassification, BertTokenizer
import torch

# load the pre-trained BERT model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# load the sentiment analysis dataset
train_dataset = ... # load the training dataset
dev_dataset = ... # load the development dataset

# fine tune the BERT model on the sentiment analysis task
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)
for epoch in range(3):
    # train the model on the training dataset
    train_loss = ...
    # evaluate the model on the development dataset
    dev_loss = ...
    print('Epoch {}: Train Loss = {:.4f}, Dev Loss = {:.4f}'.format(epoch+1, train_loss, dev_loss))

This code loads a pre-trained BERT model and tokenizer from Hugging Face Transformers, and fine-tunes the model on a sentiment analysis task using an optimizer and training and development datasets.

Note: torch is a Python library used to develop deep-learning models.

Evaluating a Model

To evaluate the performance of a model, we should use appropriate metrics for the specific task. For example, for a sentiment analysis task, metrics such as accuracy, precision, recall, and F1 score can be used.

Also, it’s important to evaluate the performance of the model on a held-out test set to ensure that the model generalizes well to unseen data, allowing to check is the model is over-fitted or not.

Final Note

This article is just a starting point for NLP. With the knowledge and tools provided here, you can start exploring natural language processing and building your own NLP applications.

Whether you are interested in analyzing social media data, building chatbots, or improving search engines, natural language processing provides a lot of opportunities to explore and innovate.

In a next article, we’ll study one concrete example of NLP, so be sure to follow me if you don’t want to miss it!

To explore the other stories of this series, click below!