Introduction to Natural Language Processing (NLP) Part II: Text Preprocessing

In our previous introduction to NLP, we defined NLP as a branch of AI that pushes the boundaries of how humans can interact with computers through natural language.

We also talked about the academic foundations of NLP and introduced concepts like Natural Language Understanding (NLU) and Natural Language Generation (NLG).

In this article, we’ll discuss a critical step in translating human language into computer code: Text Preprocessing.

What is Text Preprocessing?

Text preprocessing is preparing and cleaning text data for analysis, involving steps like removing unwanted elements, breaking text into words, and standardizing word forms.

Text preprocessing is the first step in NLP.

Without preprocessing, a computer would interpret the text “house”, “House”, and “House” as distinct words due to case sensitivity and the presence of HTML tags, despite them having a similar meaning.

Text Preprocessing Tasks

Noise Removal

In this case, noise would be characters that don’t contribute to the message.

Punctuation Marks: Characters like commas, periods, exclamation points, and question marks can be considered noise depending on the NLP task.
Special Characters: Symbols like &, %, $, @, or ^ often don’t contribute to the overall meaning and can be removed.
Numerical Values: Numbers may be irrelevant in some text analysis tasks and can be classified as noise.
Email Headers and Footers: In processing email data, headers (like “To:”, “From:”, “Subject:”) and footers or signatures often don’t contribute to the main content.
Whitespace: Excessive spacing, tab spaces, or newline characters can be considered noise.
Emojis and Emoticons: In certain analyses, emojis and emoticons might be irrelevant and need to be removed.
Formatting Codes: Similar to HTML tags, other formatting elements like Markdown or BBCode tags used in forums can be noise.
Stop Words: Commonly used words like “is”, “are”, “and”, “the” are often filtered out in NLP tasks.
Metadata: Information like timestamps, author names, or URLs that may be part of the text but not relevant to the analysis.

Before Noise Removal:

“Welcome to Our Newsletter!!! 🎉 Date: 01/01/2023 To: [email protected] From: [email protected] Subject: Exciting News & Updates! Dear Subscriber, We’re thrilled to share our latest updates & news with you! Our new products are 20% off — visit our website: www.ourcompany.com. Best regards, The Our Company Team Follow us on Twitter: @OurCompany 😊”

After Noise Removal:

“Welcome to Our Newsletter Dear Subscriber, We’re thrilled to share our latest updates news with you Our new products are off visit our website www.ourcompany.com Best regards, The Our Company Team”

Tokenization

Tokenization involves dividing text into smaller pieces, known as tokens, which can be words, numbers, or punctuation marks.

How tokenization is done depends on the requirements of the specific NLP task. Generally, it follows these steps:

Choosing the Unit of Tokenization: Decide whether tokens should be words, characters, or subwords (parts of words). For most tasks, words serve as the primary unit.
Splitting the Text: The text is split into the chosen units. This is often done using simple rules like splitting by spaces and punctuation for word tokens. For instance, the sentence “Hello, world!” would be split into [“Hello”, “,”, “world”, “!”].
Handling Special Cases: Adapt the process for special cases like contractions (e.g., “don’t” might be split into “do” and “n’t”) or complex punctuation patterns.

Before Tokenization:

“Welcome to Our Newsletter Dear Subscriber, We’re thrilled to share our latest updates news with you Our new products are off visit our website www.ourcompany.com Best regards, The Our Company Team”

After Tokenization:

[“Welcome”, “to”, “Our”, “Newsletter”, “Dear”, “Subscriber”, “,”, “We’re”, “thrilled”, “to”, “share”, “our”, “latest”, “updates”, “news”, “with”, “you”, “Our”, “new”, “products”, “are”, “off”, “visit”, “our”, “website”, “www.ourcompany.com", “Best”, “regards”, “,”, “The”, “Our”, “Company”, “Team”]

In the ‘after’ version:

The entire text is broken down into individual tokens.
Each word is separated, including contractions like “We’re” which are kept as single tokens.
Punctuation marks like commas are treated as separate tokens.
The URL is recognized as one token.

Normalization

Normalization refers to the process of converting different variations of words into a standard form. This process includes:

Lowercasing

Converting all characters in the text to lowercase to treat words like “House,” “house,” and “HOUSE” as the same.

Stemming

Reducing words to their root form.

“Writing”, “writes”, “written”, “wrote” → Stemmed to “write”.

“Caring”, “careful”, “carefully”, “cared” → Stemmed to “care”.

“Connection”, “connective”, “connected”, “connecting” → Stemmed to “connect”.

“Running”, “runs”, and “runner” → Stemmed to “run”.

“Happier”, “happiest”, “happiness”, “happily” → Stemmed to “happi”.

In this case, stemming results in “happi”, which isn’t a standard English word but serves as the root for all forms of “happy”. This highlights a limitation of stemming: it can sometimes produce stems that are not actual dictionary words but effectively group-related words.

Stemming is a quick and dirty way to group similar words, but it’s not always perfect because the stemmed word might not be a real word.

Lemmatization

In linguistics, a “lemma” is the canonical, dictionary form of a set of words.

For example, “run,” “runs,” “ran,” and “running” are different forms of the same lemma: “run.” Thus, “lemmatization” refers to the process of reducing different forms of a word to this base or dictionary form, taking into consideration the context of the word in the sentence.

The process ensures that words are analyzed in their simplest form, regardless of their tense, number, or case.

Consider the words “better,” “best,” and “good.”

At first glance, they might seem unrelated. However, in lemmatization, these words are recognized as different forms of the same base word.

In this case, “better” and “best” would be lemmatized to “good.” This is because lemmatization understands the comparative (ex. better, smaller, bigger) and superlative (ex. bests, smallest, biggest) forms of adjectives and relates them to their base or dictionary form.

“Am”, “are”, “is” → Lemmatized to “be”.

“Better”, “best” → Lemmatized to “good”.

“Saw”, “seeing”, “seen” → Lemmatized to “see”.

“Mice”, “mouse” → Lemmatized to “mouse”.

“Feet”, “foot” → Lemmatized to “foot”.

Lemmatization is more accurate than stemming but slower as understanding the context of the word in the sentence is required.

Before Normalization:

[“Welcome”, “to”, “Our”, “Newsletter”, “Dear”, “Subscriber”, “,”, “We’re”, “thrilled”, “to”, “share”, “our”, “latest”, “updates”, “news”, “with”, “you”, “Our”, “new”, “products”, “are”, “off”, “visit”, “our”, “website”, “www.ourcompany.com", “Best”, “regards”, “,”, “The”, “Our”, “Company”, “Team”]

After Normalization:

[“welcome”, “to”, “our”, “newsletter”, “dear”, “subscriber”, “,”, “we”, “be”, “thrill”, “to”, “share”, “our”, “latest”, “update”, “news”, “with”, “you”, “our”, “new”, “product”, “be”, “off”, “visit”, “our”, “website”, “www.ourcompany.com", “best”, “regard”, “,”, “the”, “our”, “company”, “team”]

In the ‘after’ version:

All tokens are converted to lowercase.
Words like “We’re” and “are” are lemmatized to “we” and “be” respectively.
Words such as “thrilled”, “updates”, “regards” are stemmed to their root forms “thrill”, “update”, “regard”.

Conclusion

We’ve now covered the basics of text preprocessing, which is getting text ready for computers to understand.

We looked at important steps like getting rid of unnecessary parts of the text (noise removal), breaking the text into smaller pieces (tokenization), and making the text more uniform (normalization) using methods like changing all letters to lowercase, simplifying words to their basic form (stemming), and converting words to their dictionary form (lemmatization).

Removing noise helps us focus on the important parts of the text. Tokenization is like chopping the text into bite-sized pieces. Normalization is about making these pieces neat and consistent.

It’s important to remember, though, that this process has its challenges. When we deal with everyday language — which includes slang, different ways of speaking in different places (dialects), and the subtle meanings in how we say things — it can be tough for the methods we use today in NLP to fully grasp and handle this variety and depth in language. This shows us we still have plenty of room for improvement.

What do you think?

What did you think about this piece?

Seriously!

Let me know in the content section what you think and what you wish to see more of.

If you enjoyed this article, found it helpful, or think it might benefit others, give it some 👏 claps.

Your claps will help more people discover this piece and expand their understanding of NLP.

If you want to stay updated with more insights and discussions about AI, NLP, and Machine Learning, don’t forget to follow me.

I love this stuff!

(And was like…into it before it became a thing.)

Thank you for reading, and I look forward to responding to seeing your name in my notification box!