avatarEsteban Thilliez

Summary

This article presents an overview of Natural Language Processing (NLP) with Python, exploring its basics, essential techniques, applications such as sentiment analysis, text classification, and topic modeling, and how to fine-tune pre-trained models for better performance, emphasizing the use of libraries like NLTK, spaCy, Gensim, and Hugging Face's Transformers to perform various NLP tasks.

Abstract

The article begins by highlighting the prominence of NLP in data science due to the proliferation of textual data from various sources such as social media and customer reviews. The author emphasizes the increasing relevance of NLP with the advent of technology like ChatGPT and outlines Python as an ideal language for NLP due to its simplicity, ease of use, and powerful libraries such as NLTK, spaCy, and Gensim. The essential NLP processes of tokenization, stemming, lemmatization, and stop word removal are explained, alongside the preprocessing steps critical for achieving meaningful insights from text. The article then delves into practical examples of applying these techniques for text analysis, visualization, frequency distribution, sentiment analysis (using TextBlob), text classification (with Scikit-Learn), and topic modeling with Gensim's LDA implementation. To conclude, the author underscores the importance of fine-tuning NLP models to enhance accuracy for specific tasks or data sets, and demonstrates how to do this with Hugging Face's Transformers library for deep learning applications. Model evaluation metrics like accuracy and F1-score, and the use of train-test data splits, are mentioned as measures to ensure robustness of NLP models. Overall, the article serves both as a guide for practicing data scientists and as inspiration for new projects in NLP.

Opinions

  • The author of the article maintains the opinion that NLP, particularly with the use of Python, has become crucial for data scientists dealing with textual data.
  • There is a positive view of the Python programming language, praising it for its libraries and capabilities that facilitate NLP-related tasks.
  • The author appears to consider visualizations such as word clouds as a valuable step in text data exploration despite their limitations.
  • The sentiment towards preprocessing text data is that it is a critical step which should not be overlooked, as it directly contributes to the quality and accuracy of the analysis results.
  • The sentiment analysis approach with TextBlob is presented in a simplistic manner, suggesting the ease of application for sentimental analysis.
  • The author presents machine learning-based approaches for both text classification and topic modeling as superior, or more effective, when compared to rule-based methods due to their reliance on data-driven learning methods.
  • There's an apparent preference or endorsement of the BERT model, provided through Hugging Face's Transformers library, for fine-tuning in NLP tasks which require high-level, context-based understanding and accuracy.
  • Throughout the article, there seems to be a recurrent theme of leveraging powerful libraries and frameworks readily available to researchers and professionals, to avoid reinventing the wheel and for the practical advantage of expanding the scope and efficiency of NLP work.

Data Science with Python — Natural Language Processing

The basics of the technology behind ChatGPT

Photo by nadi borodina on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Today, with social media posts, news articles, customer reviews, etc... Natural language processing (NLP) has become a crucial tool in the data scientist’s toolkit. Especially since technologies like ChatGPT are now available to everyone (because of course, ChatGPT is an application of NLP).

NLP is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It involves using algorithms and statistical models to understand and generate human language. With the help of NLP, we can extract valuable insights from text data, such as sentiment analysis, text classification, and topic modeling.

We’ll see today how to perform NLP with Python.

Basics of NLP

Python has emerged as a popular language for NLP tasks due to its simplicity, ease of use, and the availability of powerful libraries such as Natural Language Toolkit (NLTK), spaCy, and Gensim. These libraries provide a range of tools for performing NLP tasks, such as tokenization, stemming, lemmatization, and stop word removal.

  • Tokenization is the process of breaking down a text into individual words or phrases, which are called tokens. It helps to standardize the input and create a manageable structure for analysis. Tokens are often created by splitting a text on whitespace, punctuation, or other delimiters.
  • Stemming is the process of reducing a word to its base or root form, known as a stem. This is done by removing the suffixes from the end of a word. For example, the word “jumping” might be stemmed to “jump”. The purpose of stemming is to reduce the dimensionality of the data and group together words that have the same root.
  • Lemmatization is similar to stemming in that it reduces words to their base form, but it does so using a dictionary or morphological analysis instead of just removing suffixes. For example, the word “went” might be lemmatized to “go”. The advantage of lemmatization over stemming is that it produces a more meaningful and accurate base form.
  • Stop word removal is the process of removing words that are meaningless on their own (“the”, “and”, “a”, etc…) from a text before analysis. The purpose of this is to reduce noise in the data and focus on the more meaningful words that are important for analysis.

Before performing any NLP task, it’s important to preprocess the text data to remove any noise and make it suitable for analysis. Preprocessing techniques for text data can include converting text to lowercase, removing punctuations, and removing stop words. Once the text data is preprocessed, it can be further processed and analyzed using the appropriate NLP techniques.

In Python, the NLTK library provides a range of tools for performing NLP tasks. For example, to tokenize a sentence, we can use the word_tokenize() function as follows:

from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print(fdist.most_common(10))

This will output the 10 most common words and their frequencies:

[('Natural', 1), ('Language', 1), ('Processing', 1), ('is', 1), ('a', 1), ('subfield', 1), ('of', 1), ('artificial', 1), ('intelligence', 1), ('that', 1)]

Note: when using nltk, you sometimes will have to download some additional packages. You can do it this way, for example here, to install punkt:

import nltk
nltk.download('punkt')

Exploring Text Data

Before performing any NLP task, it’s important to explore the text data to gain insights into its characteristics and properties. This can help in selecting appropriate preprocessing and analysis techniques, as well as identifying potential issues such as data imbalance or bias.

There are several techniques for exploring text data, such as visualizations or frequency distributions.

Visualizing text data can provide a quick overview of its characteristics and patterns. One common visualization technique is the word cloud, which shows the most frequent words in the text data in a visually appealing way. Another technique is to use scatterplots or heatmaps to visualize the relationships between words or phrases.

On the other hand, a frequency distribution shows the number of occurrences of each word or phrase in the text data. This can help in identifying the most common words, as well as outliers or rare words. Frequency distributions can also be visualized using histograms or bar charts.

To visualize text data in Python, we can use libraries such as Matplotlib, Seaborn, or WordCloud. For example, to create a word cloud of the most frequent words in the text data, we can use the WordCloud library as follows:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

wordcloud = WordCloud(width=800, height=800, background_color='white', stopwords=set()).generate(text)

plt.figure(figsize=(8, 8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Note: if you can’t install wordcloud with pip, it’s probably because there is no release compatible with your Python version, or because you don’t have a C compiler installed.

To create a frequency distribution in Python, we can use the NLTK library’s FreqDist class as follows:

from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and human language."

tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print(fdist.most_common(10))

Sentiment Analysis

Sentiment analysis is a common NLP task that involves determining the sentiment or emotional tone of a piece of text. Sentiment analysis can be used in various applications such as social media monitoring, customer feedback analysis, and political opinion mining.

There are two main approaches to sentiment analysis: lexicon-based and machine learning-based. Lexicon-based approaches use pre-built sentiment dictionaries to assign sentiment scores to words or phrases in the text data. Machine learning-based approaches, on the other hand, use supervised or unsupervised learning algorithms to train models that can predict the sentiment of text data.

In Python, there are several libraries that provide tools for performing sentiment analysis, such as TextBlob, Vader, and Scikit-learn.

Let’s take a look at how to perform sentiment analysis using TextBlob.

TextBlob is a Python library that provides a simple API for performing common NLP tasks such as sentiment analysis, part-of-speech tagging, and noun phrase extraction. To perform sentiment analysis using TextBlob, we can use its sentiment analysis feature as follows:

from textblob import TextBlob

text = "I really enjoyed this product. It exceeded my expectations."

blob = TextBlob(text)

print(blob.sentiment)

This will output the sentiment polarity and subjectivity scores:

Sentiment(polarity=0.5, subjectivity=0.7)

The sentiment polarity score ranges from -1 (most negative) to 1 (most positive), while the subjectivity score ranges from 0 (most objective) to 1 (most subjective).

Text Classification

Text classification is the process of categorizing text data into predefined classes or categories. Text classification can be used in various applications such as spam filtering, sentiment analysis, topic modeling, and language detection.

There are two main approaches to text classification: rule-based and machine learning-based. Rule-based approaches use a set of handcrafted rules to classify text data based on certain patterns or keywords. Machine learning-based approaches, on the other hand, use supervised or unsupervised learning algorithms to train models that can predict the class of text data.

In Python, you can use Scikit-learn, NLTK, and Keras to perform text classification.

Here, I will use Scikit-Learn.

Scikit-learn is a popular Python library for machine learning that provides tools for data preprocessing, feature extraction, and model selection. To perform text classification using Scikit-learn, we first need to preprocess the text data and convert it into numerical features using techniques such as bag-of-words or TF-IDF.

Let’s say we have a dataset of movie reviews and we want to classify them into positive or negative sentiment classes. We can use Scikit-learn to preprocess the text data and train a logistic regression model for classification as follows:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

import pandas as pd

df = pd.read_csv('movie_reviews.csv')

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# create a pipeline for text classification
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression())])

# train the model on the training set
text_clf.fit(X_train, y_train)

# evaluate the model on the testing set
accuracy = text_clf.score(X_test, y_test)
print('Accuracy:', accuracy)

Pipeline is a tool that allows you to chain together several text processing steps into a single object that can be trained on a dataset and used to make predictions on new data.

Here, the pipeline is composed of the following elements:

  1. CountVectorizer: This component transforms the text data into a numerical representation by counting the frequency of each word in the text. It converts the text into a matrix of word counts, which can then be used as input to a machine learning algorithm.
  2. TfidfTransformer: This component applies a technique called term frequency-inverse document frequency (TF-IDF) to the output of the CountVectorizer. This technique adjusts the word counts to reflect the importance of each word in the text relative to the rest of the corpus. It is a way of weighting the importance of each word in the text.
  3. LogisticRegression: This component is a classification algorithm that can be trained on the transformed data to predict the target labels for new text data.

Topic Modeling

Topic modeling is a technique for discovering latent topics or themes in a collection of text documents. The goal of topic modeling is to identify the underlying topics or concepts that are discussed in the text data without prior knowledge of the topics.

There are several algorithms and methods for topic modeling, including Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Hierarchical Dirichlet Process (HDP).

In Python, the Gensim library provides tools for performing topic modeling using LDA and other algorithms.

To perform topic modeling with Gensim, we first need to preprocess the text data and convert it into a bag-of-words or TF-IDF representation. Then, we can train an LDA model to extract the topics from the text data.

Here’s an example code for performing topic modeling with Gensim:

import gensim
from gensim import corpora

text_data = ['text document 1', 'text document 2', 'text document 3']

# preprocess the text data
processed_data = [doc.lower().split() for doc in text_data]

# create a dictionary of the text data
dictionary = corpora.Dictionary(processed_data)

# create a bag-of-words representation of the text data
corpus = [dictionary.doc2bow(doc) for doc in processed_data]

# train an LDA model on the text data
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=dictionary,
                                            num_topics=5,
                                            passes=10)

# print the topics learned by the LDA model
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Fine-Tuning

Fine-tuning is the process of further improving the performance of a pre-trained language model on a specific task or domain. Fine-tuning can be useful when the pre-trained model is not specialized enough for the specific task or domain, or when there is a limited amount of labeled data available for training a task-specific model from scratch.

Fine-Tuning with Hugging Face Transformers

Hugging Face Transformers is a popular Python library for natural language processing that provides pre-trained language models and tools for fine-tuning them on specific tasks. With Hugging Face Transformers, you can fine-tune a pre-trained language model on tasks such as sentiment analysis, named entity recognition, and question answering.

Here’s an example code for fine-tuning a pre-trained BERT model on a sentiment analysis task with Hugging Face Transformers:

from transformers import BertForSequenceClassification, BertTokenizer
import torch

# load the pre-trained BERT model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# load the sentiment analysis dataset
train_dataset = ... # load the training dataset
dev_dataset = ... # load the development dataset

# fine tune the BERT model on the sentiment analysis task
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)
for epoch in range(3):
    # train the model on the training dataset
    train_loss = ...
    # evaluate the model on the development dataset
    dev_loss = ...
    print('Epoch {}: Train Loss = {:.4f}, Dev Loss = {:.4f}'.format(epoch+1, train_loss, dev_loss))

This code loads a pre-trained BERT model and tokenizer from Hugging Face Transformers, and fine-tunes the model on a sentiment analysis task using an optimizer and training and development datasets.

Note: torch is a Python library used to develop deep-learning models.

Evaluating a Model

To evaluate the performance of a model, we should use appropriate metrics for the specific task. For example, for a sentiment analysis task, metrics such as accuracy, precision, recall, and F1 score can be used.

Also, it’s important to evaluate the performance of the model on a held-out test set to ensure that the model generalizes well to unseen data, allowing to check is the model is over-fitted or not.

Final Note

This article is just a starting point for NLP. With the knowledge and tools provided here, you can start exploring natural language processing and building your own NLP applications.

Whether you are interested in analyzing social media data, building chatbots, or improving search engines, natural language processing provides a lot of opportunities to explore and innovate.

In a next article, we’ll study one concrete example of NLP, so be sure to follow me if you don’t want to miss it!

To explore the other stories of this series, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data
Data Science
Python
Artificial Intelligence
ChatGPT
Recommended from ReadMedium