avatarAbhay Parashar

Summary

The provided web content offers a concise introduction to the basics of Natural Language Processing (NLP), covering key concepts and techniques such as tokenization, stopwords, stemming, lemmatization, WordNet, part of speech tagging, and bag of words.

Abstract

The article "Basics Of Natural Language Processing in 10 Minutes" is designed to quickly familiarize readers with the foundational aspects of NLP. It begins by outlining the necessary Python environment setup, including the installation of Python, an IDE like Jupyter Notebook, and essential libraries such as NLTK. The author then delves into various NLP techniques: breaking down text into tokens, filtering out irrelevant words known as stopwords, simplifying words to their root form through stemming and lemmatization, utilizing WordNet for understanding word relationships, tagging parts of speech, and transforming text into numerical vectors with the bag of words model. The article aims to equip readers with the knowledge to process and analyze text data effectively, emphasizing the importance of these techniques in handling the vast amounts of textual information available.

Opinions

  • The author believes that understanding NLP is crucial for enabling computers to communicate with humans in their own language.
  • There is an emphasis on the practical application of NLP techniques, suggesting their importance in real-world text analysis tasks.
  • The article suggests that stemming is best applied to single words and may not always result in a meaningful word, whereas lemmatization is more sophisticated and returns dictionary forms of words.
  • The author expresses that the bag of words model is a key step in text processing, converting cleaned and tokenized text into numerical vectors that can be used for machine learning.
  • By providing code examples and outputs, the author conveys a hands-on approach to learning NLP, encouraging readers to actively engage with the material.
  • The inclusion of a "Future Readings" section indicates the author's view that continuous learning is essential, and they provide resources for further study in related fields like OpenCV and Google search optimization.
  • The author's suggestion to follow them on Medium, connect on LinkedIn, become a Medium member, or subscribe to their email list reflects a desire to build a community and share knowledge beyond the scope of the article.

Basics Of Natural Language Processing in 10 Minutes

Photo by Alex Knight on Unsplash

Hello, there

You are here because you also want to learn natural language processing as quickly as possible, like me.

Let’s start

The first thing we need is to install some dependency

  1. Python >3.7

2. Download an IDE or install Jupyter notebook

To install Jupyter notebook, just open your cmd(terminal) and type pip install jupyter-notebook after that type jupyter notebook to run it then you can see that your notebook is open at http://127.0.0.1:8888/ token .

3. Install packages

pip install nltk

NLTK: It is a python library that can we used to perform all the NLP tasks(stemming, lemmatization, etc..)

In this blog, we are going to learn about

  1. Tokenization
  2. Stopwords
  3. Stemming
  4. Lemmatizer
  5. WordNet
  6. Part of speech tagging
  7. Bag of Words

Before learning anything let’s first understand NLP.

Natural Language refers to the way we humans communicate with each other and processing is basically proceeding the data in an understandable form. so we can say that NLP (Natural Language Processing) is a way that helps computers to communicate with humans in their own language.

It is one of the broadest fields in research because there is a huge amount of data out there and from that data, a big amount of data is text data. So when there is so much data available so we need some technique threw which we can process the data and retrieve some useful information from it.

Now, we have an understanding of what is NLP, let’s start understanding each topic one by one.

1. Tokenization

Tokenization is the process of dividing the whole text into tokens.

It is mainly of two types:

  • Word Tokenizer (separated by words)
  • Sentence Tokenizer (separated by sentence)
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
example_text = "Hello there, how are you doing today? The weather is great today. The sky is blue. python is awsome"
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

In the above code First, we are importing nltk , in the second line, we are importing our tokenizers sent_tokenize,word_tokenizfrom library nltk.tokenize , then to use the tokenizer on a text we just need to pass the text as a parameter in the tokenizer.

The output will look something like this

##sent_tokenize (Separated by sentence)
['Hello there, how are you doing today?', 'The weather is great today.', 'The sky is blue.', 'python is awsome']
##word_tokenize (Separated by words)
['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'today', '.', 'The', 'sky', 'is', 'blue', '.', 'python', 'is', 'awsome']

2. Stopwords

In general stopwords are the words in any language which does not add much meaning to a sentence. In NLP stopwords are those words which are not important in analyzing the data. Example : he,she,hi,and etc. Our main task is to remove all the stopwords for the text to do any further processing.

There are a total of 179 stopwords in English, using NLTK we can see all the stopwords in English. We Just need to import stopwords from the library nltk.corpus .

from nltk.corpus import stopwords
print(stopwords.words('english'))
######################
######OUTPUT##########
######################
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

To remove Stopwords for a particular text

from nltk.corpus import stopwords
text = 'he is a good boy. he is very good in coding'
text = word_tokenize(text)
text_with_no_stopwords = [word for word in text if word not in stopwords.words('english')]
text_with_no_stopwords
##########OUTPUT##########
['good', 'boy', '.', 'good', 'coding']

3. Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. In simple words, we can say that stemming is the process of removing plural and adjectives from the word. Example : loved → love, learning →learn

In python, we can implement stemming by using PorterStemmer . we can import it from the library nltk.stem .

One thing to remember from Stemming is that it works best with single words.

from nltk.stem import PorterStemmer
ps = PorterStemmer()    ## Creating an object for porterstemmer
example_words = ['earn',"earning","earned","earns"]  ##Example words
for w in example_words:
    print(ps.stem(w))    ##Using ps object stemming the word
##########OUTPUT##########
earn
earn
earn
earn
Here we can see that earning,earned and earns are stem to there lemma or root word earn.

4. Lemmatizing

Lemmatization usually refers to doing things properly with the use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. In simple words lemmatization does the same work as stemming, the difference is that lemmatization returns a meaningful word. Example: Stemming history → histori Lemmatizing history → history

It is Mostly used when designing chatbots, Q&A bots, text prediction, etc.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
    print(lemmatizer.lemmatize(w))
#########OUTPUT############
----Lemmatizer-----
history
formality
change
-----Stemming------
histori
formal
chang

5. WordNet

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing. We can use wordnet for finding synonyms and antonyms.

In python, we can import wordnet from nltk.corpus . Code For Finding Synonym and antonym for a given word

from nltk.corpus import wordnet
synonyms = []   ## Creaing an empty list for all the synonyms
antonyms =[]    ## Creaing an empty list for all the antonyms
for syn in wordnet.synsets("happy"): ## Giving word 
    for i in syn.lemmas():        ## Finding the lemma,matching 
        synonyms.append(i.name())  ## appending all the synonyms       
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name()) ## antonyms
print(set(synonyms)) ## Converting them into set for unique values
print(set(antonyms))
#########OUTPUT##########
{'felicitous', 'well-chosen', 'happy', 'glad'}
{'unhappy'}

6. Part of Speech Tagging

It is a process of converting a sentence to forms — a list of words, a list of tuples (where each tuple is having a form (word, tag)). The tag in the case is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.

Part of Speech Tag List

 CC coordinating conjunction
 CD cardinal digit
 DT determiner
 EX existential there (like: “there is” … think of it like “there”)
 FW foreign word
 IN preposition/subordinating conjunction
 JJ adjective ‘big’
 JJR adjective, comparative ‘bigger’
 JJS adjective, superlative ‘biggest’
 LS list marker 1)
 MD modal could, will
 NN noun, singular ‘desk’
 NNS noun plural ‘desks’
 NNP proper noun, singular ‘Harrison’
 NNPS proper noun, plural ‘Americans’
 PDT predeterminer ‘all the kids’
 POS possessive ending parent’s
 PRP personal pronoun I, he, she
 PRP possessive pronoun my, his, hers
 RB adverb very, silently,
 RBR adverb, comparative better
 RBS adverb, superlative best
 RP particle give up
 TO to goto’ the store.
 UH interjection errrrrrrrm
 VB verb, base form take
 VBD verb, past tense took
 VBG verb, gerund/present participle taking
 VBN verb, past participle taken
 VBP verb, sing. present, non-3d take
 VBZ verb, 3rd person sing. present takes
 WDT wh-determiner which
 WP wh-pronoun who, what
 WP possessive wh-pronoun whose
 WRB wh-abverb where, when

In python, we can do pos tagging using nltk.pos_tag .

import nltk
nltk.download('averaged_perceptron_tagger')
sample_text = '''
An sincerity so extremity he additions. Her yet there truth merit. Mrs all projecting favourable now unpleasing. Son law garden chatty temper. Oh children provided to mr elegance marriage strongly. Off can admiration prosperous now devonshire diminution law.
'''
from nltk.tokenize import word_tokenize
words = word_tokenize(sample_text)
print(nltk.pos_tag(words))
################OUTPUT############
[('An', 'DT'), ('sincerity', 'NN'), ('so', 'RB'), ('extremity', 'NN'), ('he', 'PRP'), ('additions', 'VBZ'), ('.', '.'), ('Her', 'PRP$'), ('yet', 'RB'), ('there', 'EX'), ('truth', 'NN'), ('merit', 'NN'), ('.', '.'), ('Mrs', 'NNP'), ('all', 'DT'), ('projecting', 'VBG'), ('favourable', 'JJ'), ('now', 'RB'), ('unpleasing', 'VBG'), ('.', '.'), ('Son', 'NNP'), ('law', 'NN'), ('garden', 'NN'), ('chatty', 'JJ'), ('temper', 'NN'), ('.', '.'), ('Oh', 'UH'), ('children', 'NNS'), ('provided', 'VBD'), ('to', 'TO'), ('mr', 'VB'), ('elegance', 'NN'), ('marriage', 'NN'), ('strongly', 'RB'), ('.', '.'), ('Off', 'CC'), ('can', 'MD'), ('admiration', 'VB'), ('prosperous', 'JJ'), ('now', 'RB'), ('devonshire', 'VBP'), ('diminution', 'NN'), ('law', 'NN'), ('.', '.')]

7. Bag of words

Till now we have understood about tokenizing, stemming, and lemmatizing. all of these are the part of the text cleaning, now after cleaning the text we need to convert the text into some kind of numerical representation called vectors so that we can feed the data to a machine learning model for further processing.

For converting the data into vectors we make use of some predefined libraries in python.

Let’s see how vector representation works

sent1 = he is a good boy
sent2 = she is a good girl
sent3 = boy and girl are good 
        |
        |
  After removal of stopwords , lematization or stemming
sent1 = good boy
sent2 = good girl
sent3 = boy girl good  
        | ### Now we will calculate the frequency for each word by
        |     calculating the occurrence of each word
word  frequency
good     3
boy      2
girl     2
         | ## Then according to their occurrence we assign o or 1 
         |    according to their occurrence in the sentence
         | ## 1 for present and 0 fot not present
         f1  f2   f3
        girl good boy   
sent1    0    1    1     
sent2    1    0    1
sent3    1    1    1
### After this we pass the vector form to machine learning model

The above process can be done using a CountVectorizer in python, we can import the same from sklearn.feature_extraction.text .

CODE to implement CountVectorizer In python

import pandas as pd
sent = pd.DataFrame(['he is a good boy', 'she is a good girl', 'boy and girl are good'],columns=['text'])
corpus = []
for i in range(0,3):
    words = sent['text'][i]
    words  = word_tokenize(words)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    text = ' '.join(texts)
    corpus.append(text)
print(corpus)   #### Cleaned Data
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
X = cv.fit_transform(corpus).toarray()
X  ## Vectorize Form 
############OUTPUT##############
['good boy', 'good girl', 'boy girl good']
array([[1, 0, 1],
       [0, 1, 1],
       [1, 1, 1]], dtype=int64)

Congratulations, Now you know the basics of NLP.

Like👋 & Don’t Forget to share your views below 👇

Future Readings

Thanks For Reading Till Here, If You Like My Content and Want To Support Me The Best Way is —

  1. Follow Me On Medium.
  2. Connect With Me On LinkedIn.
  3. Become a Medium Member With The Cost of One Pizza Using My Referral Link. a small part of your membership fee will go to me.
  4. Subscribe To My Email List To Never Miss An Article From Me.
NLP
Coding
Programming
Machine Learning
Recommended from ReadMedium