Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

like stop word removal and stemming are common standard techniques in NLP.</p><p id="1d40">In the bag of words model, grammar does not matter so much, nor does word order.</p><p id="13cf">Pro Tip: the bag-of-words model instance is often stored in a variable called <code>bow</code> , which can be confusing because you may be thinking of bow and arrow, but it is the acronym for bag of words!</p><p id="4529">Bag of words translates to making a histogram (probability distribution) of each word given the class label. Then one may use a bayesian model to calculate the posterior probabilities of class label, for example, being 0 or 1, given a word(s) is seen.</p><h2 id="348b">Sample natural language processing workflow and NLP pipeline:</h2><p id="3492">Data cleaning pipeline for text data</p><ul><li>cleaning (regular expressions)</li><li>sentence splitting</li><li>change to lower case</li><li>stopword removal (most frequent words in a language)</li><li>stemming — demo porter stemmer</li><li>POS tagging (part of speech) — demo</li><li>noun chunking</li><li>NER (name entity recognition) — demo opencalais</li><li>deep parsing — try to “understand” text.</li></ul><p id="3fbc">Note there’s subtle yet important difference between stemming and lemmatization. Stem kind of cuts off the suffix of a word and leaving only the beginning of the word. The resulting word may not be meaningful.</p><p id="5fb2">We can think of some of these tasks as feature selection, feature engineering tasks : including identifying language features such as nouns, verbs, and entities.</p><h1 id="7d75">Important Natural Language Processing Concepts</h1><h1 id="b34a">Stop Words Removal / Stopwords Removal</h1><p id="183f">Stop words are words that may not carry valuable information. An example stopword is “the”.</p><p id="d1d2">In some cases stop words matter. For example researchers found that stop words are useful in identifying negative reviews or recommendations. People use sentences such as “This is not what I want.” “This may not be a good match.” People may use stop words more in negative reviews. Researchers found this out by keeping the stop words and achieving better prediction results.</p><p id="ec60">While it is common practice to remove stop words and only returned clean text, removing stop words do not always give better prediction results. For example, <i>not</i> is considered in some NLP libraries, but <i>not</i> is a very significant word in negative reviews or recommendations in sentiment analysis. For example, if a customer states “<i>I would not buy this product again, and would not accept any refund. Really not a good match at all.”, the word “not” is a strong signal that this review is negative. A positive review may sound, well, positive! “I really like the product! I enjoyed it very much. Not what I expected at all.” </i>In this case, negative reviews use the “not” word 3x more.</p><p id="6ecc">Removing punctuation may also yield better results in some situations.</p><p id="6410">Stop word removal can also make the input dataset smaller making the dataset easier and faster to process and compute.</p><blockquote id="743d"><p>remove stop words help shorten corpus, make dataset smaller easier to compute</p></blockquote><p id="133b">It is possible to use a pre-existing library to identify known, common stopwords. It is also possible to add your own stopwords to the stopwords glossary list. Most modern natural language models supports this functionality.</p><h1 id="3580">NLP Techniques — Removing punctuations with Regex</h1><p id="ee4a">Punctuations are not always useful in predicting the meaning of texts. Often they are removed along with stop words. What does removing punctuation mean? It means keeping only the alpha numeric characters. Regex programming lessons can fill books! Just use this nifty function below for short texts. For longer texts that require more processing power, use iterable generators to iterate through each line of text and keep only alpha numeric characters. For big data, use parallel processing to process multiple lines of texts at once.</p><p id="31b6">This process of removing numbers and punctuation is called <i>pruning</i>.</p><p id="9b75"><b>Regex removes punctuation</b></p><div id="13d7"><pre>#<span class="hljs-keyword">import</span> regex <span class="hljs-keyword">import</span> re</pre></div><div id="ebea"><pre>corpus = <span class="hljs-comment">"You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"</span></pre></div><div id="4da1"><pre>corpus = re.sub(<span class="hljs-string">"[^a-zA-Z0-9]+"</span>, <span class="hljs-string">""</span>,corpus) corpus <span class="hljs-comment"># 'YouarereadingatutorialbyUniqtechWearetalkingaboutNaturalLanguageProcessingakaNLPWouldyouliketolearnmoreLearnmoreaboutMachineLearningtoday'</span> <span class="hljs-comment">#note space is also removed</span></pre></div><div id="a9a2"><pre><span class="hljs-comment"># ^\s means DO NOT MATCH SPACE</span> corpus = re.sub(<span class="hljs-string">"[^a-zA-Z0-9\s]+"</span>, <span class="hljs-string">""</span>,corpus) corpus <span class="hljs-comment">#returns 'You are reading a tutorial by Uniqtech We are talking about Natural Language Processing aka NLP Would you like to learn more Learn more about Machine Learning today'</span></pre></div><p id="ece9">Go ahead, just use the above method and avoid reinventing the wheel.</p><p id="37b2">Pro Tip: python also has a build in alpha numeric checker function <code>ialnum()</code> . There is another <code>.isalpha()</code> only returns true for alphabets, a number will not evaluate to true.</p><p id="5284">There are always hackers coming up with fancy regex code! It keeps getting fancier.</p><div id="f74b"><pre><span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> RegexpTokenizer a regex tokenization RegexpTokenizer(<span class="hljs-string">r'\w+'</span>) </pre></div><div id="3134"><pre><span class="hljs-comment">#tokenize any word that has length > 1, </span> <span class="hljs-comment">#effectively removing all punctuations</span></pre></div><h1 id="a754">Tokenization</h1><p id="0be1">Tokenization: breaking texts into tokens. Example: breaking sentences into words, and more group words based on scenarios. There’s also the <b>n gram model</b> and <b>skip gram model</b>.</p><p id="9790"><b>Basic tokenization is 1 gram</b>, n gram or multi gram is useful when a phrase yields better result than one word, for example “I do not like Banana.” one gram is I space do space not space like space banana. It may yield better result with 3 gram model: I do not, do not like, not like banana, like banana space, banana _space.</p><p id="5f01"><b>ngram</b> : n is the number of words we want in each token. Frequently, n =1</p><p id="b234">Did you know that Google digitized many books and generated and analyzed literature based on the n gram model? Nice work Google!</p><h1 id="99ad">Lemmatization</h1><p id="6d5c">Lemmatization: transforming words into its roots. Example: economics, micro-economics, macro-economists, economists, economist, economy, economical, economic forum can all be transformed back to its root econ, which can mean this text or article is largely about economics, finance or economic issues. Useful in situations such as topic labeling. Common libraries: WordNetLemmatizer, Porter-Stemmer. Two key terms to pay attention to are prefix and suffix.</p><h1 id="3d1e">Sentence Tagging</h1><p id="a74e">Sentence tagging is like the part of speech exercises your grammar teacher made you do in high school. Here’s an illustration of that:</p><figure id="2665"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*185-N0aMLQnyevlSneMj3Q.png"><figcaption>this is also called a sentence structure graph. source : todo insert citation</figcaption></figure><p id="1ceb">Part of Speech (PoS) tagging can help us understand words, meanings, context and much more. We can also use PoS to identify particular tokens. For example if we want to remove all verbs in a doc, we can use PoS tags to identify verbs. We can potentially remove all the verbs in a name entity recognition task. Because we only need the nouns anyway, we can potentially save a lot of space and improve compute speed by removing unnecessary data — verbs.</p><h1 id="fa43">Sections Coming soon…</h1><p id="8e69">T

Options

o be notified, sign up here: [email protected]</p><ul><li>Information Retrieval Basics : Term Frequency Inverse Document Frequency TFIDF</li></ul><p id="8dfa"><i>Shameless self plug below, please support us :)</i></p><p id="5fa1">Like what you read so far? Join our $5/month membership to get in-depth Silicon Valley job intelligence, beginner friendly tutorials, training courses for a tech career in Silicon Valley. [email protected]</p><p id="64eb">Our members only blog includes searchable in-depth analysis of Silicon Valley job postings such as Product Manager, Machine Learning Engineer. Information on tech interviews, technical interviews for bootcamp graduates. Tips and tricks to pass phone interviews. Our tutorials aim to be fast and beginner friendly. <a href="https://readmedium.com/understand-the-softmax-function-in-minutes-f3a59641e86d">Check out our Medium article and Youtube video on Softmax </a>— a function frequently used in Deep Learning, Artificial Intelligence and Machine Learning.</p><h1 id="ddc7">One hot encoding</h1><p id="9017">Previously most popular with categorical data encoding.</p><p id="564c">The output dimension of one hot encoder is row number — the number of data samples, column number — the number of unique values. In natural language processing (NLP), bag of words, it can be the number of unique words, often called the vocabulary.</p><p id="2f0d">One hot encoding assumes each of the label is independent from each other. For example, a sample cannot be both cat and dogs. A familiar example is that a coin is either head or tail (in none quantum computing world). It has to be either a cat or dog. There is no overlap of categories. It result in a sparse matrix, each row should have all entries as 0, except for the one corresponding, correct column label as 1. For example, if cat is the zeroth position [1,0,0, 0…..] there are are as many zeroes as there are unique values to encode. In the cat, dog, bird, horse example, the encoding for cat is [1,0,0,0], for dog is [0,1,0,0]. The dot product of two rows of different labels will always be zero. The dot product of two rows of the same label will always be 1.</p><p id="4ddd"><i>“one hot encoding each row should only have 1 label while the others are 0”</i></p><p id="ba20">Encoding data using one hot encoding can turn the original one column into many new columns, hence can take up space, time, cost and increasing search space, computing time, requiring more training data.</p><p id="ea9d">See paid subscriber only blog for One Hot Encoding in Pandas and an example of CountVectorizer which is a related concept in Natural Language Processing (NLP). <a href="https://data-bootcamp.blogspot.com/2020/06/numeric-representation-of-data-numeric.html">Link here</a>. Note you must accept the blogger invite to access the private blog.</p><p id="df93">See embedding section to understand how big the one hot encoding matrix can get.</p><h1 id="bac0">Embedding</h1><p id="48b2">Currently most popular with categorical data encoding, especially in the field of natural language processing (NLP).</p><p id="6dde">Word embedding: “words or phrases from the vocabulary are mapped to vectors of real numbers” (wikipedia) for example [23, 55, 72]</p><p id="7fb3" type="7">“Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method,[6] and explicit representation in terms of the context in which words appear.”</p><p id="6b74">To understand word embedding how it conserves space especially when the NLP vocabulary is large, think about the one hot encoding scenario above, each row, has all zeroes except for one column entry, resulting in large sparse matrices.</p><p id="7a7f">Imagine if we have a vocabulary of 3 [“Ginger Cat”, “Russian Blue Cat”, “Black Cat”]. To represent in one hot encoding</p><div id="0da7"><pre><span class="hljs-string">[[1, 0,0 ], [0,1,0], [0,0,1]]</span></pre></div><p id="416e">Then to represent that using word embedding <code>[1],[2],[3]</code> in fact each vector element in word embedding can account for any where from 2 to 10, 16 values or more depending on the number system we use. For example if binary, each slot can be either 0, or 1 so it accounts for 2 numbers. If decimal, each slot can be a number 0 through 9 so it accommodate 10 numbers.</p><p id="2f1f">If we have two slots <code>[__, __]</code> in binary, it can account for <code>2x2=4</code> numbers. If in decimal number system, it can account for <code>10x10=100</code> numbers!!</p><p id="45ac">In contrast, a one hot encoded matrix, always have as many columns as the size of the vocabulary. Each unique token will take up one column. There are as many rows as there are data samples. Most of the elements will be zero.</p><h1 id="d70a">NLP Use Cases</h1><ul><li>Sentiment analysis of tweets, amazon reviews. Classifying whether a short text is positive or negative.</li><li>Writing style analysis analysis: authors’ favorite vocabulary choice, singers’ lyrics style. For example, style analysis has identified JK Rowling as the author of a book even though she used male a pen name after passionate readers analyzed and found parallels and similarity in the text styles.</li><li>Entity tagging: find organizations or people’s names in articles</li><li>Text summarization: summarize main points of news articles</li></ul><h1 id="36f9">Getting Started with NLP Now!</h1><p id="8413">You can use the Python <code>nltk</code> library to analyze texts. It’s a popular and a powerful library. It includes lists of stop words in several languages.</p><div id="7bd8"><pre><span class="hljs-built_in">from</span> nltk.corpus import stopwords clean_tokens = [<span class="hljs-keyword">token</span> <span class="hljs-keyword">for</span> <span class="hljs-keyword">token</span> <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> <span class="hljs-keyword">token</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stop_words] </pre></div><div id="dc09"><pre><span class="hljs-comment">#important pattern forremoving stop words iteratively</span></pre></div><div id="7aba"><pre><span class="hljs-comment">#source: Towards Data Science Emma Grimaldi How Machines understand our language</span></pre></div><p id="2d8c">Sklearn conveniently has a build-in text dataset for you to experiment with! These news articles can be classified into different topics. Sklearn provides cleaned training data for this classification task.</p><p id="4ffc"><a href="http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html">Link here</a></p><h1 id="c154">Glossary</h1><ul><li><b>SOS</b> start of sentence</li><li><b>EOS</b> end of sentence</li><li>padding usually 0</li><li>word2index</li><li>index2word</li><li>word2count</li></ul><h1 id="6b67">Additional NLP Tools</h1><ul><li>Python text cloud using the world cloud library</li></ul><h2 id="032d">Further Reading</h2><ul><li><a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">link</a> to Sklearn documentation for CountVectorizer useful function to calculate term frequency</li><li><a href="http://www.siliconvanity.com/2018/07/list-of-natural-language-processing-nlp.html">link</a> to a long list of Natural Language Processing NLP and Machine Learning papers</li><li><a href="http://www.siliconvanity.com/2018/03/tfidf-time-frequency-inverse-document.html">link</a> to Term Frequency Inverse Document Frequency blog post</li><li>Towards Data Science Emma Grimaldi How Machines understand our language: an introduction to Natural Language processing</li><li>Book — Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning</li><li>Word model, letter model</li><li>Using NLP, to identify spoilers and redact spoilers. A Primer on Neural Network Models for Natural Language Processing — a white paper on NLP by Y Goldberg</li><li>Deep Learning textbook by Ian Goodfellow</li><li><a href="https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html">Amazing HTML textbook on NLP by Stanford</a></li><li>What is Gensim? Related flash cards on our website: What’s Gensim [public] <a href="https://ml.learn-to-code.co/skillView.html?skill=Auu5TaqxaXK43gYAC06J">https://ml.learn-to-code.co/skillView.html?skill=Auu5TaqxaXK43gYAC06J</a></li></ul></article></body>

Getting Started with Natural Language Processing NLP for Beginners

Natural Language Processing or NLP is a subset of the field of Artificial Intelligence. It is a field that analyzes our human language, takes texts as input. The entire text dataset, the input data is called the corpus. For example we calculate how many times a word appears in the corpus. This count is called term frequency. Natural Language Processing (NLP) is not supposed to be easy! But let’s try to simplify for beginners. Follow us for more beginner friendly articles like this. Updated October 2022.

Liberal arts, humanities studies graduates may not think programming, AI and machine learning is for them. Natural Language Processing (NLP) is actually very interdisciplinary, requires analytical, writing, and research skills in linguistics, social science, English language / English text / literature, philosophy of representation, morality, transparency, justice etc. It’s a great field in AI that all can shine and contribute.

“Hi there! It’s good to see you. I just wanted to say hi.” # The sentence is the corpus. Term frequency of ‘hi’ is 2, because it appears twice in the corpus, if our analysis case insensitive (‘Hi’ equals to ‘hi’). If it is case sensitive, then the term frequency of ‘Hi’ is one, and TF of ‘hi’ is also one.

We will elaborate on term frequency later.

Practical tip: Sometimes it is important to be case sensitive. For example, Trump may refer to Donald Trump, trump is a verb often used in card games describing one card outranks another. When cases don’t matter, a common preprocessing, data cleaning technique is to change all text of the corpus to lower case. Loweringlower_case_corpus = corpus.lower() The function .lower() is a python string method. For example “Hello there!” will become “hello there!”. Often assuming upper and lower case are equivalent makes the text easier to search, process, and result in smaller more robust corpus. It’s nice the method handled punctuations and other characters without lower case gracefully — returns them unchanged, no error.

Common natural language process tasks | common NLP tasks: sentence segmentation, tokenization, stopword removal, part of speech tagging, name entity recognition.

Bag of Words — a common, introductory model for Natural Language Processing NLP

Codecademy.com explains bag-of-words model: “A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value.”

Pro tip: BoW or bow stands for bag-of-words in NLP context.

If you haven’t studied Machine Learning the word feature makes no sense. There are tricks that may help you understand. We can imagine the output of a bag of word model as python dictionary / hashmap of key value pairs or as an Excel sheet. The features are the keys in the dictionary or the column headers in the Excel sheet. Features are meaningful representations of the data. Machine Learning learns features and predicts outcomes called labels.

For example useful features of Person data — information that describes people — may include: height, gender, name, government issued ID number etc.

Pro Tip: what is the feature dimension? What is the size or the number of features? It equals to the size of vocabulary found in the corpus.

corpus = ["You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"]
# if use corpus = "..."
# receive error
# ValueError: Iterable over raw text documents expected, string object received.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(corpus)
bow.shape #(1,22)

count_vect.get_feature_names()
#[u'about', u'aka', u'are', u'by', u'language', u'learn', u'learning', u'like', u'machine', u'more', u'natural', u'nlp', u'processing', u'reading', u'talking', u'to', u'today', u'tutorial', u'uniqtech', u'we', u'would', u'you']

bow.toarray()
#array([[2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]])

Pro tip: what does CountVectorizer do per the Sklearn documentation? “Convert a collection of text documents to a matrix of token counts” and returns a sparse matrix scipy.sparse.csr_matrix Just an FYI. Don’t think too hard about it now.

The feature names are returned by count_vect.get_feature_names() and bow.toarray() gives us the frequency of corresponding features. For example, the first word ‘about’ appears twice in the corpus so its frequency is 2. The last word ‘you’ also appears twice.

How is it useful? This common model is surprisingly powerful. There are some criticism of the author of a popular pop culture novel on the internet: the claim is that the author is not a sophisticated author because the books only utilize a limited set English vocabulary, there’s lack of variation, over usage of certain simple, plain adjectives, nouns with limited descriptive power. Apparently people have found that the author uses some simple non-descriptive words too often, such as love. How did people know the author uses love a lot? Word count, word frequency of course! If we read through this Word Frequency Analysis of the book (which we didn’t name here on purpose), indeed we have to scroll down quite far to see a complex word used such as murmur.

Some argue however precisely because the author uses easy-to-read colloquial style the series has gained wide readership and popularity. Surprisingly, this simple model — word frequency — is quite insightful and already generates a good discussion.

Another example usage of word frequency is using it to plot word cloud — a simple visualization for word frequency. If a word is used frequently, it appears bigger in this NLP visualization.

More on bag of words

Stop Word Removal … not : Not all words in the corpus are considered important enough to be features. Some such as a, the, and are called stop words, which are sometimes removed from the feature dataset to improve machine learning model results. The appeared nearly 5000 times in the book but it does not mean anything in particular, thus it’s okay to remove it from our dataset. To remove or not remove stop words depending on our use case and domain needs.

Data preprocessing like stop word removal and stemming are common standard techniques in NLP.

In the bag of words model, grammar does not matter so much, nor does word order.

Pro Tip: the bag-of-words model instance is often stored in a variable called bow , which can be confusing because you may be thinking of bow and arrow, but it is the acronym for bag of words!

Bag of words translates to making a histogram (probability distribution) of each word given the class label. Then one may use a bayesian model to calculate the posterior probabilities of class label, for example, being 0 or 1, given a word(s) is seen.

Sample natural language processing workflow and NLP pipeline:

Data cleaning pipeline for text data

cleaning (regular expressions)
sentence splitting
change to lower case
stopword removal (most frequent words in a language)
stemming — demo porter stemmer
POS tagging (part of speech) — demo
noun chunking
NER (name entity recognition) — demo opencalais
deep parsing — try to “understand” text.

Note there’s subtle yet important difference between stemming and lemmatization. Stem kind of cuts off the suffix of a word and leaving only the beginning of the word. The resulting word may not be meaningful.

We can think of some of these tasks as feature selection, feature engineering tasks : including identifying language features such as nouns, verbs, and entities.

Important Natural Language Processing Concepts

Stop Words Removal / Stopwords Removal

Stop words are words that may not carry valuable information. An example stopword is “the”.

In some cases stop words matter. For example researchers found that stop words are useful in identifying negative reviews or recommendations. People use sentences such as “This is not what I want.” “This may not be a good match.” People may use stop words more in negative reviews. Researchers found this out by keeping the stop words and achieving better prediction results.

While it is common practice to remove stop words and only returned clean text, removing stop words do not always give better prediction results. For example, not is considered in some NLP libraries, but not is a very significant word in negative reviews or recommendations in sentiment analysis. For example, if a customer states “I would not buy this product again, and would not accept any refund. Really not a good match at all.”, the word “not” is a strong signal that this review is negative. A positive review may sound, well, positive! “I really like the product! I enjoyed it very much. Not what I expected at all.” In this case, negative reviews use the “not” word 3x more.

Removing punctuation may also yield better results in some situations.

Stop word removal can also make the input dataset smaller making the dataset easier and faster to process and compute.

remove stop words help shorten corpus, make dataset smaller easier to compute

It is possible to use a pre-existing library to identify known, common stopwords. It is also possible to add your own stopwords to the stopwords glossary list. Most modern natural language models supports this functionality.

NLP Techniques — Removing punctuations with Regex

Punctuations are not always useful in predicting the meaning of texts. Often they are removed along with stop words. What does removing punctuation mean? It means keeping only the alpha numeric characters. Regex programming lessons can fill books! Just use this nifty function below for short texts. For longer texts that require more processing power, use iterable generators to iterate through each line of text and keep only alpha numeric characters. For big data, use parallel processing to process multiple lines of texts at once.

This process of removing numbers and punctuation is called pruning.

Regex removes punctuation

#import regex
import re

corpus = "You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"

corpus = re.sub("[^a-zA-Z0-9]+", "",corpus)
corpus
# 'YouarereadingatutorialbyUniqtechWearetalkingaboutNaturalLanguageProcessingakaNLPWouldyouliketolearnmoreLearnmoreaboutMachineLearningtoday'
#note space is also removed

# ^\s means DO NOT MATCH SPACE
corpus = re.sub("[^a-zA-Z0-9\s]+", "",corpus)
corpus
#returns 'You are reading a tutorial by Uniqtech We are talking about Natural Language Processing aka NLP Would you like to learn more Learn more about Machine Learning today'

Go ahead, just use the above method and avoid reinventing the wheel.

Pro Tip: python also has a build in alpha numeric checker function ialnum() . There is another .isalpha() only returns true for alphabets, a number will not evaluate to true.

There are always hackers coming up with fancy regex code! It keeps getting fancier.

from nltk.tokenize import RegexpTokenizer a regex tokenization
RegexpTokenizer(r'\w+')

#tokenize any word that has length > 1, 
#effectively removing all punctuations

Tokenization

Tokenization: breaking texts into tokens. Example: breaking sentences into words, and more group words based on scenarios. There’s also the n gram model and skip gram model.

Basic tokenization is 1 gram, n gram or multi gram is useful when a phrase yields better result than one word, for example “I do not like Banana.” one gram is I _space_ do _space_ not _space_ like _space_ banana. It may yield better result with 3 gram model: I do not, do not like, not like banana, like banana _space_, banana _space.

ngram : n is the number of words we want in each token. Frequently, n =1

Did you know that Google digitized many books and generated and analyzed literature based on the n gram model? Nice work Google!

Lemmatization

Lemmatization: transforming words into its roots. Example: economics, micro-economics, macro-economists, economists, economist, economy, economical, economic forum can all be transformed back to its root econ, which can mean this text or article is largely about economics, finance or economic issues. Useful in situations such as topic labeling. Common libraries: WordNetLemmatizer, Porter-Stemmer. Two key terms to pay attention to are prefix and suffix.

Sentence Tagging

Sentence tagging is like the part of speech exercises your grammar teacher made you do in high school. Here’s an illustration of that:

this is also called a sentence structure graph. source : todo insert citation

Part of Speech (PoS) tagging can help us understand words, meanings, context and much more. We can also use PoS to identify particular tokens. For example if we want to remove all verbs in a doc, we can use PoS tags to identify verbs. We can potentially remove all the verbs in a name entity recognition task. Because we only need the nouns anyway, we can potentially save a lot of space and improve compute speed by removing unnecessary data — verbs.

Sections Coming soon…

To be notified, sign up here: [email protected]

Information Retrieval Basics : Term Frequency Inverse Document Frequency TFIDF

Shameless self plug below, please support us :)

Like what you read so far? Join our $5/month membership to get in-depth Silicon Valley job intelligence, beginner friendly tutorials, training courses for a tech career in Silicon Valley. [email protected]

Our members only blog includes searchable in-depth analysis of Silicon Valley job postings such as Product Manager, Machine Learning Engineer. Information on tech interviews, technical interviews for bootcamp graduates. Tips and tricks to pass phone interviews. Our tutorials aim to be fast and beginner friendly. Check out our Medium article and Youtube video on Softmax — a function frequently used in Deep Learning, Artificial Intelligence and Machine Learning.

One hot encoding

Previously most popular with categorical data encoding.

The output dimension of one hot encoder is row number — the number of data samples, column number — the number of unique values. In natural language processing (NLP), bag of words, it can be the number of unique words, often called the vocabulary.

One hot encoding assumes each of the label is independent from each other. For example, a sample cannot be both cat and dogs. A familiar example is that a coin is either head or tail (in none quantum computing world). It has to be either a cat or dog. There is no overlap of categories. It result in a sparse matrix, each row should have all entries as 0, except for the one corresponding, correct column label as 1. For example, if cat is the zeroth position [1,0,0, 0…..] there are are as many zeroes as there are unique values to encode. In the cat, dog, bird, horse example, the encoding for cat is [1,0,0,0], for dog is [0,1,0,0]. The dot product of two rows of different labels will always be zero. The dot product of two rows of the same label will always be 1.

“one hot encoding each row should only have 1 label while the others are 0”

Encoding data using one hot encoding can turn the original one column into many new columns, hence can take up space, time, cost and increasing search space, computing time, requiring more training data.

See paid subscriber only blog for One Hot Encoding in Pandas and an example of CountVectorizer which is a related concept in Natural Language Processing (NLP). Link here. Note you must accept the blogger invite to access the private blog.

See embedding section to understand how big the one hot encoding matrix can get.

Embedding

Currently most popular with categorical data encoding, especially in the field of natural language processing (NLP).

Word embedding: “words or phrases from the vocabulary are mapped to vectors of real numbers” (wikipedia) for example [23, 55, 72]

“Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method,[6] and explicit representation in terms of the context in which words appear.”

To understand word embedding how it conserves space especially when the NLP vocabulary is large, think about the one hot encoding scenario above, each row, has all zeroes except for one column entry, resulting in large sparse matrices.

Imagine if we have a vocabulary of 3 [“Ginger Cat”, “Russian Blue Cat”, “Black Cat”]. To represent in one hot encoding

[[1, 0,0 ],
[0,1,0],
[0,0,1]]

Then to represent that using word embedding [1],[2],[3] in fact each vector element in word embedding can account for any where from 2 to 10, 16 values or more depending on the number system we use. For example if binary, each slot can be either 0, or 1 so it accounts for 2 numbers. If decimal, each slot can be a number 0 through 9 so it accommodate 10 numbers.

If we have two slots [__, __] in binary, it can account for 2x2=4 numbers. If in decimal number system, it can account for 10x10=100 numbers!!

In contrast, a one hot encoded matrix, always have as many columns as the size of the vocabulary. Each unique token will take up one column. There are as many rows as there are data samples. Most of the elements will be zero.

NLP Use Cases

Sentiment analysis of tweets, amazon reviews. Classifying whether a short text is positive or negative.
Writing style analysis analysis: authors’ favorite vocabulary choice, singers’ lyrics style. For example, style analysis has identified JK Rowling as the author of a book even though she used male a pen name after passionate readers analyzed and found parallels and similarity in the text styles.
Entity tagging: find organizations or people’s names in articles
Text summarization: summarize main points of news articles

Getting Started with NLP Now!

You can use the Python nltk library to analyze texts. It’s a popular and a powerful library. It includes lists of stop words in several languages.

from nltk.corpus import stopwords
clean_tokens = [token for token in tokens if token not in stop_words]

#important pattern forremoving stop words iteratively

#source: Towards Data Science  Emma Grimaldi How Machines understand our language

Sklearn conveniently has a build-in text dataset for you to experiment with! These news articles can be classified into different topics. Sklearn provides cleaned training data for this classification task.

Link here

Glossary

SOS start of sentence
EOS end of sentence
padding usually 0
word2index
index2word
word2count

Additional NLP Tools

Python text cloud using the world cloud library