avatarAngela and Kezhan Shi

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4909

Abstract

of creating an entirely new token. Another word tokenizer is <code>WordPunctTokenizer</code>. It provides splitting by making punctuation an entirely new token. This type of splitting is usually desirable:</p><div id="ba5b"><pre>from nltk.tokenize <span class="hljs-keyword">import</span> <span class="hljs-type">WordPunctTokenizer</span> <span class="hljs-variable">tokenizer</span> <span class="hljs-operator">=</span> WordPunctTokenizer() tokenizer.tokenize(<span class="hljs-string">"Don't hesitate to ask questions"</span>)</pre></div><p id="d8da"><b>Tokenization using regular expressions (regex)</b></p><p id="7381">The tokenization of words can be performed by constructing regular expressions in these two ways:</p><ul><li>By matching with words</li><li>By matching spaces or gaps</li></ul><p id="16ab">We can import <code>RegexpTokenizer</code> from NLTK. We can create a Regular Expression that can match the tokens present in the text:</p><div id="e0a3"><pre><span class="hljs-keyword">import</span> nltk from nltk.tokenize <span class="hljs-keyword">import</span> <span class="hljs-type">RegexpTokenizer</span> <span class="hljs-variable">tokenizer</span> <span class="hljs-operator">=</span> RegexpTokenizer(<span class="hljs-string">"[\w]+"</span>) tokenizer.tokenize(<span class="hljs-string">"Don't hesitate to ask questions"</span>)</pre></div><p id="ae9d">Instead of instantiating class, an alternative way of tokenization would be to use this function:</p><div id="3c61"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> regexp_tokenize sent = <span class="hljs-string">"Don't hesitate to ask questions"</span> <span class="hljs-built_in">print</span>(regexp_tokenize(sent, pattern=<span class="hljs-string">'\w+|$[\d.]+|\S+'</span>))</pre></div><p id="3876">For a list of regular expressions symbols please read: <a href="https://www.rexegg.com/regex-quickstart.html">Regular Expression Quick Start</a></p><p id="1663"><b>Conversion into lowercase and uppercase:</b></p><div id="3438"><pre>text = <span class="hljs-string">'HARdWork IS KEy to SUCCESS'</span> <span class="hljs-built_in">print</span>(text.<span class="hljs-built_in">lower</span>()) <span class="hljs-built_in">print</span>(text.<span class="hljs-built_in">upper</span>())</pre></div><p id="2eae"><b>Dealing with stop words:</b></p><p id="1767">NLTK has a list of stop words for many languages. We need to unzip the data file so that the list of stop words can be accessed from <code>nltk_data/corpora/stopwords/</code>:</p><div id="7569"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords stops = <span class="hljs-built_in">set</span>(stopwords.words(<span class="hljs-string">'english'</span>)) words = [<span class="hljs-string">"Don't"</span>, <span class="hljs-string">'hesitate'</span>, <span class="hljs-string">'to'</span>, <span class="hljs-string">'ask'</span>, <span class="hljs-string">'questions'</span>] [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> words <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stops]</pre></div><p id="e567">The instance of <code>nltk.corpus.reader.WordListCorpusReader</code> is a stopwords corpus. It has the <code>words()</code> function, whose argument is <code>fileid</code>. Here, it is English; this refers to all the stop words present in the English file. If the <code>words()</code> function has no argument, then it will refer to all the stop words of all the languages. Other languages in which stop word removal can be done, or the number of languages whose file of stop words is present in NLTK can be found using the <code>fileids()</code> function:</p><div id="97e8"><pre>stopwords<span class="hljs-selector-class">.fileids</span>()</pre></div><p id="60aa"><b>Example of the replacement of a text with another text:</b></p><div id="f964"><pre><span class="hljs-keyword">import</span> nltk from replacers <span class="hljs-keyword">import</span> <span class="hljs-type">RegexpReplacer</span> <span class="hljs-variable">replacer</span> <span class="hljs-operator">=</span> RegexpReplacer() replacer.replace(<span class="hljs-string">"Don't hesitate to ask questions"</span>)</pre></div><p id="ec11">The function of <code>RegexpReplacer.replace()</code> is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, <code>must've</code> is replaced by <code>must have</code> and <code>didn't</code> is replaced by <code>did not</code>, since the replacement pattern in <code>replacers.py</code> has already been defined by tuple pairs, that is, <code>(r'(\w+)'ve', '\g<1> have')</code> and <code>(r'(\w+)n

Options

't', '\g<1> not')</code>.</p><p id="f9d7">We can not only perform the replacement of contractions; we can also substitute a token with any other token.</p><p id="3a05"><b>Performing substitution before tokenization:</b></p><div id="ae64"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.<span class="hljs-property">tokenize</span> <span class="hljs-keyword">import</span> word_tokenize <span class="hljs-keyword">from</span> replacers <span class="hljs-keyword">import</span> <span class="hljs-title class_">RegexpReplacer</span> replacer = <span class="hljs-title class_">RegexpReplacer</span>() <span class="hljs-title function_">word_tokenize</span>(<span class="hljs-string">"Don't hesitate to ask questions"</span>) <span class="hljs-title function_">word_tokenize</span>(replacer.<span class="hljs-title function_">replace</span>(<span class="hljs-string">"Don't hesitate to ask questions"</span>))</pre></div><p id="26d4"><b>Lemmatization:</b></p><p id="25db">Lemmatization is the process in which we transform the word into a form with a different word category. The word formed after lemmatization is entirely different. The built-in <code>morphy()</code> function is used for lemmatization in <code>WordNetLemmatizer</code>. The inputted word is left unchanged if it is not found in WordNet. In the argument, <code>pos</code> refers to the part of speech category of the inputted word. Consider an example of lemmatization in NLTK:</p><div id="2631"><pre><span class="hljs-keyword">import</span> nltk from nltk.stem <span class="hljs-keyword">import</span> <span class="hljs-type">WordNetLemmatizer</span> <span class="hljs-variable">lemmatizer_output</span> <span class="hljs-operator">=</span> WordNetLemmatizer() lemmatizer_output.lemmatize(<span class="hljs-string">'working'</span>) lemmatizer_output.lemmatize(<span class="hljs-string">'working'</span>, pos=<span class="hljs-string">'v'</span>) lemmatizer_output.lemmatize(<span class="hljs-string">'works'</span>)</pre></div><p id="eabc">The <code>WordNetLemmatizer</code> library may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the <code>morphy()</code> function present in <code>WordNetCorpusReader</code> to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for <code>works</code>, the lemma returned is the singular form, <code>work</code>.</p><p id="e06c"><b>Difference between stemming and lemmatization:</b></p><div id="7ab5"><pre><span class="hljs-keyword">import</span> nltk from nltk.stem <span class="hljs-keyword">import</span> <span class="hljs-type">PorterStemmer</span> <span class="hljs-variable">stemmer_output</span> <span class="hljs-operator">=</span> PorterStemmer() stemmer_output.stem(<span class="hljs-string">'happiness'</span>) from nltk.stem <span class="hljs-keyword">import</span> <span class="hljs-type">WordNetLemmatizer</span> <span class="hljs-variable">lemmatizer_output</span> <span class="hljs-operator">=</span> WordNetLemmatizer() lemmatizer_output.lemmatize(<span class="hljs-string">'happiness'</span>)</pre></div><p id="9aef">In the preceding code, <code>happiness</code> is converted to <code>happi</code> by stemming. Lemmatization doesn't find the root word for <code>happiness</code>, so it returns the word <code>happiness</code>.</p><p id="9d8a"><b>Similarity Measure:</b></p><div id="5155"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.<span class="hljs-property">metrics</span> <span class="hljs-keyword">import</span> * <span class="hljs-title function_">edit_distance</span>(<span class="hljs-string">"relate"</span>, <span class="hljs-string">"relation"</span>) <span class="hljs-title function_">edit_distance</span>(<span class="hljs-string">"suggestion"</span>, <span class="hljs-string">"calculation"</span>)</pre></div><p id="4031">Applying similarity measures using Jaccard’s Coefficient. Jaccard’s coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two sets, <code>X</code> and <code>Y</code>.</p><div id="73a1"><pre><span class="hljs-keyword">import</span> nltk <span class="hljs-keyword">from</span> nltk.metrics <span class="hljs-keyword">import</span> * X = <span class="hljs-built_in">set</span>([<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]) Y = <span class="hljs-built_in">set</span>([<span class="hljs-number">20</span>,<span class="hljs-number">30</span>,<span class="hljs-number">60</span>]) <span class="hljs-built_in">print</span>(jaccard_distance(X,Y))</pre></div><p id="4eaf">Good to know: For other tests, more than a hundred corpora are available, provided by NLTK <a href="http://www.nltk.org/nltk_data/">here</a>.</p></article></body>

Exploring Natural Language Processing with NLTK

A Practical Guide to Tokenization, Regular Expressions, and Semantic Analysis for ESILV NLP Assignment 1

First of all, install NLTK 3.0, downloadable for free from here. Follow the instructions there to download the version required for your platform.

  1. Go on Google News and select 3 press articles (2 about the same topic and 1 really different).
  2. Copy/paste the text content of each article in 3 separate files.
  3. The goal is to find the two nearest sentences (in a meaning/semantic way) in the articles on the same topic. The method used should also show a difference between the article on a different topic. The possible tools used to achieve this result are been presented the last session (tokenization, normalization, regular expressions, string distances).
  4. Verify if your 3 articles respect the Zipf’s law.

Here under, you will find some examples of these tools with NLTK.

Tokenization

To perform tokenization, we can import the sentence tokenization function. The argument of this function will be text that needs to be tokenized. The sent_tokenize function uses an instance of NLTK known as PunktSentenceTokenizer. This instance of NLTK has already been trained to perform tokenization on different European languages on the basis of letters or punctuation that mark the beginning and end of sentences.

Tokenization of text into sentences:

import nltk
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = "Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)

Tokenization of text in other languages:

For performing tokenization in languages other than English, we can load the respective language pickle file found in tokenizers/punkt and then tokenize the text in another language. For the tokenization of French text, we will use the french.pickle file as follows:

import nltk
french_tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
french_tokenizer.tokenize('Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire')

Tokenization of sentences into words:

Now, we’ll perform processing on individual sentences. Individual sentences are tokenized into words. Word tokenization is performed using a word_tokenize() function. The word_tokenize function uses an instance of NLTK known as TreebankWordTokenizer to perform word tokenization.

import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

TreebankWordTokenizer uses conventions according to Penn Treebank Corpus. It works by separating contractions. This is shown here:

import nltk
text = nltk.word_tokenize("Don't hesitate to ask questions")
print(text)

Another word tokenizer is PunktWordTokenizer. It works by splitting punctuation; each word is kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer. It provides splitting by making punctuation an entirely new token. This type of splitting is usually desirable:

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Don't hesitate to ask questions")

Tokenization using regular expressions (regex)

The tokenization of words can be performed by constructing regular expressions in these two ways:

  • By matching with words
  • By matching spaces or gaps

We can import RegexpTokenizer from NLTK. We can create a Regular Expression that can match the tokens present in the text:

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")

Instead of instantiating class, an alternative way of tokenization would be to use this function:

import nltk
from nltk.tokenize import regexp_tokenize
sent = "Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))

For a list of regular expressions symbols please read: Regular Expression Quick Start

Conversion into lowercase and uppercase:

text = 'HARdWork IS KEy to SUCCESS'
print(text.lower())
print(text.upper())

Dealing with stop words:

NLTK has a list of stop words for many languages. We need to unzip the data file so that the list of stop words can be accessed from nltk_data/corpora/stopwords/:

import nltk
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
words = ["Don't", 'hesitate', 'to', 'ask', 'questions']
[word for word in words if word not in stops]

The instance of nltk.corpus.reader.WordListCorpusReader is a stopwords corpus. It has the words() function, whose argument is fileid. Here, it is English; this refers to all the stop words present in the English file. If the words() function has no argument, then it will refer to all the stop words of all the languages. Other languages in which stop word removal can be done, or the number of languages whose file of stop words is present in NLTK can be found using the fileids() function:

stopwords.fileids()

Example of the replacement of a text with another text:

import nltk
from replacers import RegexpReplacer
replacer = RegexpReplacer()
replacer.replace("Don't hesitate to ask questions")

The function of RegexpReplacer.replace() is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, must've is replaced by must have and didn't is replaced by did not, since the replacement pattern in replacers.py has already been defined by tuple pairs, that is, (r'(\w+)\'ve', '\g<1> have') and (r'(\w+)n\'t', '\g<1> not').

We can not only perform the replacement of contractions; we can also substitute a token with any other token.

Performing substitution before tokenization:

import nltk
from nltk.tokenize import word_tokenize
from replacers import RegexpReplacer
replacer = RegexpReplacer()
word_tokenize("Don't hesitate to ask questions")
word_tokenize(replacer.replace("Don't hesitate to ask questions"))

Lemmatization:

Lemmatization is the process in which we transform the word into a form with a different word category. The word formed after lemmatization is entirely different. The built-in morphy() function is used for lemmatization in WordNetLemmatizer. The inputted word is left unchanged if it is not found in WordNet. In the argument, pos refers to the part of speech category of the inputted word. Consider an example of lemmatization in NLTK:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_output = WordNetLemmatizer()
lemmatizer_output.lemmatize('working')
lemmatizer_output.lemmatize('working', pos='v')
lemmatizer_output.lemmatize('works')

The WordNetLemmatizer library may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for works, the lemma returned is the singular form, work.

Difference between stemming and lemmatization:

import nltk
from nltk.stem import PorterStemmer
stemmer_output = PorterStemmer()
stemmer_output.stem('happiness')
from nltk.stem import WordNetLemmatizer
lemmatizer_output = WordNetLemmatizer()
lemmatizer_output.lemmatize('happiness')

In the preceding code, happiness is converted to happi by stemming. Lemmatization doesn't find the root word for happiness, so it returns the word happiness.

Similarity Measure:

import nltk
from nltk.metrics import *
edit_distance("relate", "relation")
edit_distance("suggestion", "calculation")

Applying similarity measures using Jaccard’s Coefficient. Jaccard’s coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two sets, X and Y.

import nltk
from nltk.metrics import *
X = set([10,20,30,40])
Y = set([20,30,60])
print(jaccard_distance(X,Y))

Good to know: For other tests, more than a hundred corpora are available, provided by NLTK here.

NLP
Machine Learning
Recommended from ReadMedium