Getting Started with Natural Language Processing NLP for Beginners
Natural Language Processing or NLP is a subset of the field of Artificial Intelligence. It is a field that analyzes our human language, takes texts as input. The entire text dataset, the input data is called the corpus. For example we calculate how many times a word appears in the corpus. This count is called term frequency. Natural Language Processing (NLP) is not supposed to be easy! But let’s try to simplify for beginners. Follow us for more beginner friendly articles like this. Updated October 2022.
Liberal arts, humanities studies graduates may not think programming, AI and machine learning is for them. Natural Language Processing (NLP) is actually very interdisciplinary, requires analytical, writing, and research skills in linguistics, social science, English language / English text / literature, philosophy of representation, morality, transparency, justice etc. It’s a great field in AI that all can shine and contribute.
“Hi there! It’s good to see you. I just wanted to say hi.” # The sentence is the corpus. Term frequency of ‘hi’ is 2, because it appears twice in the corpus, if our analysis case insensitive (‘Hi’ equals to ‘hi’). If it is case sensitive, then the term frequency of ‘Hi’ is one, and TF of ‘hi’ is also one.

We will elaborate on term frequency later.
Practical tip: Sometimes it is important to be case sensitive. For example, Trump may refer to Donald Trump, trump is a verb often used in card games describing one card outranks another. When cases don’t matter, a common preprocessing, data cleaning technique is to change all text of the corpus to lower case. Loweringlower_case_corpus = corpus.lower() The function .lower() is a python string method. For example “Hello there!” will become “hello there!”. Often assuming upper and lower case are equivalent makes the text easier to search, process, and result in smaller more robust corpus. It’s nice the method handled punctuations and other characters without lower case gracefully — returns them unchanged, no error.
Common natural language process tasks | common NLP tasks: sentence segmentation, tokenization, stopword removal, part of speech tagging, name entity recognition.
Bag of Words — a common, introductory model for Natural Language Processing NLP
Codecademy.com explains bag-of-words model: “A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value.”
Pro tip: BoW or bow stands for bag-of-words in NLP context.
If you haven’t studied Machine Learning the word feature makes no sense. There are tricks that may help you understand. We can imagine the output of a bag of word model as python dictionary / hashmap of key value pairs or as an Excel sheet. The features are the keys in the dictionary or the column headers in the Excel sheet. Features are meaningful representations of the data. Machine Learning learns features and predicts outcomes called labels.
For example useful features of Person data — information that describes people — may include: height, gender, name, government issued ID number etc.
Pro Tip: what is the feature dimension? What is the size or the number of features? It equals to the size of vocabulary found in the corpus.
corpus = ["You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"]
# if use corpus = "..."
# receive error
# ValueError: Iterable over raw text documents expected, string object received.from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(corpus)
bow.shape #(1,22)count_vect.get_feature_names()
#[u'about', u'aka', u'are', u'by', u'language', u'learn', u'learning', u'like', u'machine', u'more', u'natural', u'nlp', u'processing', u'reading', u'talking', u'to', u'today', u'tutorial', u'uniqtech', u'we', u'would', u'you']bow.toarray()
#array([[2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]])Pro tip: what does CountVectorizer do per the Sklearn documentation? “Convert a collection of text documents to a matrix of token counts” and returns a sparse matrix scipy.sparse.csr_matrix Just an FYI. Don’t think too hard about it now.
The feature names are returned by count_vect.get_feature_names() and bow.toarray() gives us the frequency of corresponding features. For example, the first word ‘about’ appears twice in the corpus so its frequency is 2. The last word ‘you’ also appears twice.
How is it useful? This common model is surprisingly powerful. There are some criticism of the author of a popular pop culture novel on the internet: the claim is that the author is not a sophisticated author because the books only utilize a limited set English vocabulary, there’s lack of variation, over usage of certain simple, plain adjectives, nouns with limited descriptive power. Apparently people have found that the author uses some simple non-descriptive words too often, such as love. How did people know the author uses love a lot? Word count, word frequency of course! If we read through this Word Frequency Analysis of the book (which we didn’t name here on purpose), indeed we have to scroll down quite far to see a complex word used such as murmur.
Some argue however precisely because the author uses easy-to-read colloquial style the series has gained wide readership and popularity. Surprisingly, this simple model — word frequency — is quite insightful and already generates a good discussion.
Another example usage of word frequency is using it to plot word cloud — a simple visualization for word frequency. If a word is used frequently, it appears bigger in this NLP visualization.
More on bag of words
Stop Word Removal … not : Not all words in the corpus are considered important enough to be features. Some such as a, the, and are called stop words, which are sometimes removed from the feature dataset to improve machine learning model results. The appeared nearly 5000 times in the book but it does not mean anything in particular, thus it’s okay to remove it from our dataset. To remove or not remove stop words depending on our use case and domain needs.
Data preprocessing like stop word removal and stemming are common standard techniques in NLP.
In the bag of words model, grammar does not matter so much, nor does word order.
Pro Tip: the bag-of-words model instance is often stored in a variable called bow , which can be confusing because you may be thinking of bow and arrow, but it is the acronym for bag of words!
Bag of words translates to making a histogram (probability distribution) of each word given the class label. Then one may use a bayesian model to calculate the posterior probabilities of class label, for example, being 0 or 1, given a word(s) is seen.
Sample natural language processing workflow and NLP pipeline:
Data cleaning pipeline for text data
- cleaning (regular expressions)
- sentence splitting
- change to lower case
- stopword removal (most frequent words in a language)
- stemming — demo porter stemmer
- POS tagging (part of speech) — demo
- noun chunking
- NER (name entity recognition) — demo opencalais
- deep parsing — try to “understand” text.
Note there’s subtle yet important difference between stemming and lemmatization. Stem kind of cuts off the suffix of a word and leaving only the beginning of the word. The resulting word may not be meaningful.
We can think of some of these tasks as feature selection, feature engineering tasks : including identifying language features such as nouns, verbs, and entities.
Important Natural Language Processing Concepts
Stop Words Removal / Stopwords Removal
Stop words are words that may not carry valuable information. An example stopword is “the”.
In some cases stop words matter. For example researchers found that stop words are useful in identifying negative reviews or recommendations. People use sentences such as “This is not what I want.” “This may not be a good match.” People may use stop words more in negative reviews. Researchers found this out by keeping the stop words and achieving better prediction results.
While it is common practice to remove stop words and only returned clean text, removing stop words do not always give better prediction results. For example, not is considered in some NLP libraries, but not is a very significant word in negative reviews or recommendations in sentiment analysis. For example, if a customer states “I would not buy this product again, and would not accept any refund. Really not a good match at all.”, the word “not” is a strong signal that this review is negative. A positive review may sound, well, positive! “I really like the product! I enjoyed it very much. Not what I expected at all.” In this case, negative reviews use the “not” word 3x more.
Removing punctuation may also yield better results in some situations.
Stop word removal can also make the input dataset smaller making the dataset easier and faster to process and compute.
remove stop words help shorten corpus, make dataset smaller easier to compute
It is possible to use a pre-existing library to identify known, common stopwords. It is also possible to add your own stopwords to the stopwords glossary list. Most modern natural language models supports this functionality.
NLP Techniques — Removing punctuations with Regex
Punctuations are not always useful in predicting the meaning of texts. Often they are removed along with stop words. What does removing punctuation mean? It means keeping only the alpha numeric characters. Regex programming lessons can fill books! Just use this nifty function below for short texts. For longer texts that require more processing power, use iterable generators to iterate through each line of text and keep only alpha numeric characters. For big data, use parallel processing to process multiple lines of texts at once.
This process of removing numbers and punctuation is called pruning.
Regex removes punctuation
#import regex
import recorpus = "You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"corpus = re.sub("[^a-zA-Z0-9]+", "",corpus)
corpus
# 'YouarereadingatutorialbyUniqtechWearetalkingaboutNaturalLanguageProcessingakaNLPWouldyouliketolearnmoreLearnmoreaboutMachineLearningtoday'
#note space is also removed# ^\s means DO NOT MATCH SPACE
corpus = re.sub("[^a-zA-Z0-9\s]+", "",corpus)
corpus
#returns 'You are reading a tutorial by Uniqtech We are talking about Natural Language Processing aka NLP Would you like to learn more Learn more about Machine Learning today'Go ahead, just use the above method and avoid reinventing the wheel.
Pro Tip: python also has a build in alpha numeric checker function ialnum() . There is another .isalpha() only returns true for alphabets, a number will not evaluate to true.
There are always hackers coming up with fancy regex code! It keeps getting fancier.
from nltk.tokenize import RegexpTokenizer a regex tokenization
RegexpTokenizer(r'\w+') #tokenize any word that has length > 1,
#effectively removing all punctuationsTokenization
Tokenization: breaking texts into tokens. Example: breaking sentences into words, and more group words based on scenarios. There’s also the n gram model and skip gram model.
Basic tokenization is 1 gram, n gram or multi gram is useful when a phrase yields better result than one word, for example “I do not like Banana.” one gram is I _space_ do _space_ not _space_ like _space_ banana. It may yield better result with 3 gram model: I do not, do not like, not like banana, like banana _space_, banana _space.
ngram : n is the number of words we want in each token. Frequently, n =1
Did you know that Google digitized many books and generated and analyzed literature based on the n gram model? Nice work Google!
Lemmatization
Lemmatization: transforming words into its roots. Example: economics, micro-economics, macro-economists, economists, economist, economy, economical, economic forum can all be transformed back to its root econ, which can mean this text or article is largely about economics, finance or economic issues. Useful in situations such as topic labeling. Common libraries: WordNetLemmatizer, Porter-Stemmer. Two key terms to pay attention to are prefix and suffix.
Sentence Tagging
Sentence tagging is like the part of speech exercises your grammar teacher made you do in high school. Here’s an illustration of that:

Part of Speech (PoS) tagging can help us understand words, meanings, context and much more. We can also use PoS to identify particular tokens. For example if we want to remove all verbs in a doc, we can use PoS tags to identify verbs. We can potentially remove all the verbs in a name entity recognition task. Because we only need the nouns anyway, we can potentially save a lot of space and improve compute speed by removing unnecessary data — verbs.
Sections Coming soon…
To be notified, sign up here: [email protected]
- Information Retrieval Basics : Term Frequency Inverse Document Frequency TFIDF
Shameless self plug below, please support us :)
Like what you read so far? Join our $5/month membership to get in-depth Silicon Valley job intelligence, beginner friendly tutorials, training courses for a tech career in Silicon Valley. [email protected]
Our members only blog includes searchable in-depth analysis of Silicon Valley job postings such as Product Manager, Machine Learning Engineer. Information on tech interviews, technical interviews for bootcamp graduates. Tips and tricks to pass phone interviews. Our tutorials aim to be fast and beginner friendly. Check out our Medium article and Youtube video on Softmax — a function frequently used in Deep Learning, Artificial Intelligence and Machine Learning.
One hot encoding
Previously most popular with categorical data encoding.
The output dimension of one hot encoder is row number — the number of data samples, column number — the number of unique values. In natural language processing (NLP), bag of words, it can be the number of unique words, often called the vocabulary.
One hot encoding assumes each of the label is independent from each other. For example, a sample cannot be both cat and dogs. A familiar example is that a coin is either head or tail (in none quantum computing world). It has to be either a cat or dog. There is no overlap of categories. It result in a sparse matrix, each row should have all entries as 0, except for the one corresponding, correct column label as 1. For example, if cat is the zeroth position [1,0,0, 0…..] there are are as many zeroes as there are unique values to encode. In the cat, dog, bird, horse example, the encoding for cat is [1,0,0,0], for dog is [0,1,0,0]. The dot product of two rows of different labels will always be zero. The dot product of two rows of the same label will always be 1.
“one hot encoding each row should only have 1 label while the others are 0”
Encoding data using one hot encoding can turn the original one column into many new columns, hence can take up space, time, cost and increasing search space, computing time, requiring more training data.
See paid subscriber only blog for One Hot Encoding in Pandas and an example of CountVectorizer which is a related concept in Natural Language Processing (NLP). Link here. Note you must accept the blogger invite to access the private blog.
See embedding section to understand how big the one hot encoding matrix can get.
Embedding
Currently most popular with categorical data encoding, especially in the field of natural language processing (NLP).
Word embedding: “words or phrases from the vocabulary are mapped to vectors of real numbers” (wikipedia) for example [23, 55, 72]
“Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method,[6] and explicit representation in terms of the context in which words appear.”
To understand word embedding how it conserves space especially when the NLP vocabulary is large, think about the one hot encoding scenario above, each row, has all zeroes except for one column entry, resulting in large sparse matrices.
Imagine if we have a vocabulary of 3 [“Ginger Cat”, “Russian Blue Cat”, “Black Cat”]. To represent in one hot encoding
[[1, 0,0 ],
[0,1,0],
[0,0,1]]Then to represent that using word embedding [1],[2],[3] in fact each vector element in word embedding can account for any where from 2 to 10, 16 values or more depending on the number system we use. For example if binary, each slot can be either 0, or 1 so it accounts for 2 numbers. If decimal, each slot can be a number 0 through 9 so it accommodate 10 numbers.
If we have two slots [__, __] in binary, it can account for 2x2=4 numbers. If in decimal number system, it can account for 10x10=100 numbers!!
In contrast, a one hot encoded matrix, always have as many columns as the size of the vocabulary. Each unique token will take up one column. There are as many rows as there are data samples. Most of the elements will be zero.
NLP Use Cases
- Sentiment analysis of tweets, amazon reviews. Classifying whether a short text is positive or negative.
- Writing style analysis analysis: authors’ favorite vocabulary choice, singers’ lyrics style. For example, style analysis has identified JK Rowling as the author of a book even though she used male a pen name after passionate readers analyzed and found parallels and similarity in the text styles.
- Entity tagging: find organizations or people’s names in articles
- Text summarization: summarize main points of news articles
Getting Started with NLP Now!
You can use the Python nltk library to analyze texts. It’s a popular and a powerful library. It includes lists of stop words in several languages.
from nltk.corpus import stopwords
clean_tokens = [token for token in tokens if token not in stop_words] #important pattern forremoving stop words iteratively#source: Towards Data Science Emma Grimaldi How Machines understand our languageSklearn conveniently has a build-in text dataset for you to experiment with! These news articles can be classified into different topics. Sklearn provides cleaned training data for this classification task.
Glossary
- SOS start of sentence
- EOS end of sentence
- padding usually 0
- word2index
- index2word
- word2count
Additional NLP Tools
- Python text cloud using the world cloud library
Further Reading
- link to Sklearn documentation for CountVectorizer useful function to calculate term frequency
- link to a long list of Natural Language Processing NLP and Machine Learning papers
- link to Term Frequency Inverse Document Frequency blog post
- Towards Data Science Emma Grimaldi How Machines understand our language: an introduction to Natural Language processing
- Book — Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning
- Word model, letter model
- Using NLP, to identify spoilers and redact spoilers. A Primer on Neural Network Models for Natural Language Processing — a white paper on NLP by Y Goldberg
- Deep Learning textbook by Ian Goodfellow
- Amazing HTML textbook on NLP by Stanford
- What is Gensim? Related flash cards on our website: What’s Gensim [public] https://ml.learn-to-code.co/skillView.html?skill=Auu5TaqxaXK43gYAC06J






