Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5835

Abstract

span>=1)</pre></div><div id="c270"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">symspell_corrector</span>(<span class="hljs-params">input_term</span>)<span class="hljs-symbol">:</span></pre></div><div id="0fa2"><pre> # look up suggestions <span class="hljs-keyword">for</span> multi-word input strings suggestions = sym_spell.lookup_compound( <span class="hljs-attribute">phrase</span>=input_term,
<span class="hljs-attribute">max_edit_distance</span>=2,
<span class="hljs-attribute">transfer_casing</span>=<span class="hljs-literal">True</span>,
<span class="hljs-attribute">ignore_term_with_digits</span>=<span class="hljs-literal">True</span>, <span class="hljs-attribute">ignore_non_words</span>=<span class="hljs-literal">True</span>, <span class="hljs-attribute">split_by_space</span>=<span class="hljs-literal">True</span> )

display the correction

<span class="hljs-keyword">for</span> suggestion <span class="hljs-keyword">in</span> suggestions: return f<span class="hljs-string">"OUTPUT: {suggestion.term}"</span></pre></div><div id="77e5"><pre>text = <span class="hljs-string">"the resturant had greatfood."</span> <span class="hljs-function"><span class="hljs-title">symspell_corrector</span><span class="hljs-params">(text)</span></span></pre></div><h2 id="c845">Result</h2><div id="d85d"><pre><span class="hljs-symbol">ORIGINAL:</span> the resturant had greatfood.</pre></div><div id="b005"><pre><span class="hljs-literal">OUTPUT</span>: the restaurant had great food</pre></div><h2 id="0fc6">Use Case</h2><p id="a52c">Whether you are working with customer reviews or social media posts, your text data is likely to contain spelling errors. SymSpell could be used as another step during NLP preprocessing. For instance, a Bag-of-Words or TF-IDF model will view <i>restaurant </i>and the misspelled word <i>resturant </i>differently even though we know they both have the same meaning. Running spelling correction fixes this issue and may help reduce dimensionality.</p><p id="480c"><a href="https://pypi.org/project/symspellpy/"><b>Documentation</b></a></p><h1 id="2b13">3) PySBD (python Sentence Boundary Disambiguation)</h1><p id="526e">Finally! A smart, simple Python library that splits text into sentence units. Although a seemingly straightforward task, human language is complex and noisy. Splitting text into sentences based on punctuation alone only works up to a certain point. What’s great about pySBD is its ability to handle a large variety of edge cases, such as abbreviations, decimal values, and other complex instances oftentimes found within legal, financial, and biomedical corpora. Unlike most other libraries that leverage neural networks for this task, PySBD identifies sentence boundaries using a rule-based approach. In their <a href="https://aclanthology.org/2020.nlposs-1.15.pdf">paper</a>, the authors of this library demonstrate that pySBD scores higher accuracy than the alternatives on benchmark tests.</p><h2 id="6202">Installation</h2><div id="883e"><pre>!pip <span class="hljs-keyword">install</span> pysbd</pre></div><h2 id="0cdd">Example</h2><div id="e136"><pre><span class="hljs-keyword">from</span> pysbd <span class="hljs-keyword">import</span> Segmenter</pre></div><div id="bfee"><pre>segmenter = Segmenter(<span class="hljs-attribute">language</span>=’en’, <span class="hljs-attribute">clean</span>=<span class="hljs-literal">True</span>)</pre></div><div id="ffeb"><pre><span class="hljs-built_in">text</span> = “My <span class="hljs-built_in">name</span> <span class="hljs-keyword">is</span> Mr. Robert H. Jones. Please <span class="hljs-built_in">read</span> up <span class="hljs-keyword">to</span> p. <span class="hljs-number">45.</span> At <span class="hljs-number">3</span> P.M. we will talk <span class="hljs-keyword">about</span> U.S. history.”</pre></div><div id="1874"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(segmenter.segment(text)</span></span>)</pre></div><h2 id="3d8b">Result</h2><div id="752b"><pre>ORIGINAL: My name <span class="hljs-keyword">is</span> Mr. Robert H. Jones. Please <span class="hljs-keyword">read</span> <span class="hljs-keyword">up</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">p</span>. <span class="hljs-number">45</span>. At <span class="hljs-number">3</span> <span class="hljs-keyword">P</span>.M. we will talk about U.S. <span class="hljs-keyword">history</span>.</pre></div><div id="3374"><pre><span class="hljs-symbol">OUTPUT:</span> [<span class="hljs-comment">'My name is Dr. Robert H. Jones.',</span> <span class="hljs-comment">'Please read up to p. 45.',</span> <span class="hljs-comment">'At 3 P.M. we will talk about U.S. history.']</span></pre></div><h2 id="bb2d">Use Case</h2><p id="f051">There have been many times in which I needed to treat or analyze text on the sentence level. A recent Aspect-Based Sentiment Analysis (ASBA) project is a good example. In <a href="https://readmedium.com/nlp-project-with-augmentation-attacks-aspect-based-sentiment-analysis-3342510c90e7">this work</a>, it was important to determine the polarity of specific relevant sentences within customer clothing reviews. This could only be done by breaking up the text into individual sentences first. So instead of spending time writing complex regular expressions to cover dozens of edge cases, let pySBD do the heavy lifting for you.</p><p id="3de9"><a href="https://github.com/nipunsadvilkar/pySBD"><b>Documentation</b></a></p><h1 id="eb16">4) TextAttack</h1><p id="5adb">TextAttack is a fantastic Python framework for developing adversarial attacks on NLP models.</p><p id="32ff">An adversarial attack in NLP is the process of creating small perturbations (<i>or edits

Options

</i>) to text data in order to fool the NLP model into making the wrong prediction. Perturbations include swapping words with synonyms, inserting new words, or deleting random characters from the text. These edits are applied to randomly selected observations from your model’s dataset input.</p><figure id="beeb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hip4K6xpx9K2jIjhpOLMNg.png"><figcaption>Image by the author. A single successful attack. The adversarial perturbations fooled the NLP classification model into predicting the incorrect label.</figcaption></figure><p id="fb76">TextAttack provides a seamless, low-code way of generating these adversarial examples to form an attack. Once an attack is run, a summary will be shown of how well the NLP model performed. This will provide an evaluation of the robustness of your model — or in other words, how susceptible it is to certain perturbations. Robustness is an important factor to consider when launching NLP models into the real world.</p><h2 id="4cda">Installation</h2><div id="8424"><pre>!pip <span class="hljs-keyword">install</span> textattack[tensorflow]</pre></div><h2 id="c869">Example</h2><p id="4190">TextAttack is way too versatile to cover in brief so I heavily recommend checking out its well-written <a href="https://textattack.readthedocs.io/en/latest/index.html">documentation page</a>.</p><p id="cd93">Here, I will be running an attack via command line API (<b>within Google Colab</b>) on a BERT-based sentiment classification model from Hugging Face. This pre-trained model was fine-tuned to predict <i>Positive </i>or <i>Negative </i>using the Rotten Tomatoes Movie Review dataset.</p><p id="e900">The attack contains a <i>word-swap-embedding </i>transformation, which will transform selected observations from the <i>Rotten Tomatoes</i> dataset by replacing random words with synonyms in the word embedding space.</p><p id="cf39">Let’s see how this NLP model holds up against 20 adversarial examples.</p><div id="4f00"><pre>!textattack attack <span class="hljs-string"></span> --model-<span class="hljs-keyword">from</span>-huggingface RJZauner/distilbert_rotten_tomatoes_sentiment_classifier <span class="hljs-string"></span> --dataset-<span class="hljs-keyword">from</span>-huggingface rotten_tomatoes <span class="hljs-string"></span> --transformation word-swap-embedding <span class="hljs-string"></span> --goal-<span class="hljs-keyword">function</span> untargeted-classification <span class="hljs-string"></span> --shuffle True <span class="hljs-string"></span> --num-examples <span class="hljs-number">20</span></pre></div><h2 id="b1c4">Result</h2><figure id="92f6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*B8sFTxTBxa7P6IA7VdopOA.png"><figcaption>Image by the author. One of 19 successful attacks on the fine-tuned BERT model. It can be argued that the meaning of the original negative review stayed intact after the perturbations. Ideally, the model should NOT have misclassified this adversarial example.</figcaption></figure><figure id="9c39"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rH3IDICz0Liet4KRC-LBEw.png"><figcaption>Image by the author. Results of the attack!</figcaption></figure><p id="ea98">Interesting! Without any perturbations, this model achieves an impressive 100% accuracy. However, out of 20 total attacks — in which only 18% of the words were altered on average — the NLP model was fooled into misclassifying 19 times!</p><h2 id="da23">Use Case</h2><p id="4ce9">By testing an NLP model against adversarial attacks, we can better understand the model’s weaknesses. The next step can then be to improve model accuracy and/or robustness by further training the NLP model on augmented data.</p><p id="de02">For a full project example of how I put this library to use to evaluate a custom LSTM classification model, check out this article. It also includes a full code script.</p><div id="ed49" class="link-block"> <a href="https://towardsdatascience.com/nlp-project-with-augmentation-attacks-aspect-based-sentiment-analysis-3342510c90e7"> <div> <div> <h2>NLP Project With Augmentation, Attacks, & Aspect-Based Sentiment Analysis</h2> <div><h3>Do you know about these 3 advanced NLP concepts?</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*lFCA46TPjXz0cF2Ju3H8FQ.jpeg)"></div> </div> </div> </a> </div><p id="17e0"><a href="https://textattack.readthedocs.io/en/latest/index.html"><b>Documentation</b></a></p><h2 id="5613">Conclusion</h2><p id="3ebe">I hope that these libraries come to use in your future NLP endeavors!</p><p id="6a70">This was a continuation of a similar article I wrote recently. So if you haven’t heard of useful Python libraries like <i>contractions</i>, <i>distilbert-punctuator</i>, or <i>textstat</i>, then check that out too!</p><div id="5cee" class="link-block"> <a href="https://towardsdatascience.com/5-lesser-known-python-libraries-for-your-next-nlp-project-ff13fc652553"> <div> <div> <h2>5 Lesser-Known Python Libraries for Your Next NLP Project</h2> <div><h3>With code examples and explanations.</h3></div> <div><p>towardsdatascience.co</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*1POhlVeQy2wFKspp)"></div> </div> </div> </a> </div><p id="c8b8">Thanks for reading!</p></article></body>

4 More Little-Known NLP Libraries That Are Hidden Gems

With code examples and explanations

Image generated by the author using DALL·E 2 (**Prompt:** “a huge blue gem being dug from the ground”)

Discovering new Python libraries can oftentimes spark new ideas. Here are 4 hidden-gem libraries that are exceptional to know about.

Let’s get into it.

1) Presidio Analyzer and Anonymizer

Developed by Microsoft, Presidio offers an automatic way to anonymize sensitive text data. First, the locations of private entities are detected within the unstructured text. This is done using a combination of named entity recognition (NER) and rule-based pattern matching with regular expressions. In the following example, we look for names, emails, and phone numbers but there are many other predefined recognizers that you can choose from. The information from the Analyzer is then passed into the Anonymizer which replaces the private entities with de-sensitized text.

Installation

!pip install presidio-anonymizer
!pip install presidio_analyzer
!python -m spacy download en_core_web_lg

Example

from presidio_analyzer import AnalyzerEngine 
from presidio_anonymizer import AnonymizerEngine 
from presidio_anonymizer.entities import OperatorConfig

# identify spans of private entities
text_to_anonymize = "Reached out to Bob Warner at 215-555-8678. Sent invoice to [email protected]" 
analyzer = AnalyzerEngine() 
analyzer_results = analyzer.analyze(text=text_to_anonymize,  
                                    entities=["EMAIL_ADDRESS", "PERSON", "PHONE_NUMBER"],  
                                    language='en')

# pass Analyzer results into the anonymizer
anonymizer = AnonymizerEngine() 
anonymized_results = anonymizer.anonymize( 
    text=text_to_anonymize, 
    analyzer_results=analyzer_results     
) 
print(anonymized_results.text)

Result

ORIGINAL: Reached out to Bob Warner at 215–555–8678. Sent invoice to [email protected]

OUTPUT: Reached out to <PERSON> at <PHONE_NUMBER>. Sent invoice to <EMAIL_ADDRESS>

Use Case

Anonymization is a critical step toward safeguarding personal information. It is especially important if you are collecting or sharing sensitive data in the workplace.

Documentation

2) SymSpell

A go-to Python library for automatic spelling correction: SymSpell. It offers speedy performance and covers a large variety of common mistakes including spelling issues and missing or extra spacing. Although SymSpell will not fix grammatical issues or consider the context of words, you will benefit from its quick execution speed — which is helpful when working with large datasets. SymSpell suggests corrections based on the frequency of words (i.e the is more frequently appearing than therapy), as well as single-character edit distances with regard to keyboard layout.

Installation

!pip install symspellpy

Example

from symspellpy import SymSpell, Verbosity
import pkg_resources

# load a dictionary (this one consists of 82,765 English words)
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)

# term_index: column of the term 
# count_index: column of the term's frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

def symspell_corrector(input_term):

  # look up suggestions for multi-word input strings 
  suggestions = sym_spell.lookup_compound( 
      phrase=input_term,  
      max_edit_distance=2,  
      transfer_casing=True,  
      ignore_term_with_digits=True, 
      ignore_non_words=True, 
      split_by_space=True 
  ) 
  # display the correction
  for suggestion in suggestions: 
      return f"OUTPUT: {suggestion.term}"

text = "the resturant had greatfood."
symspell_corrector(text)

Result

ORIGINAL: the resturant had greatfood.

OUTPUT: the restaurant had great food

Use Case

Whether you are working with customer reviews or social media posts, your text data is likely to contain spelling errors. SymSpell could be used as another step during NLP preprocessing. For instance, a Bag-of-Words or TF-IDF model will view restaurant and the misspelled word resturant differently even though we know they both have the same meaning. Running spelling correction fixes this issue and may help reduce dimensionality.

Documentation

3) PySBD (python Sentence Boundary Disambiguation)

Finally! A smart, simple Python library that splits text into sentence units. Although a seemingly straightforward task, human language is complex and noisy. Splitting text into sentences based on punctuation alone only works up to a certain point. What’s great about pySBD is its ability to handle a large variety of edge cases, such as abbreviations, decimal values, and other complex instances oftentimes found within legal, financial, and biomedical corpora. Unlike most other libraries that leverage neural networks for this task, PySBD identifies sentence boundaries using a rule-based approach. In their paper, the authors of this library demonstrate that pySBD scores higher accuracy than the alternatives on benchmark tests.

Installation

!pip install pysbd

Example

from pysbd import Segmenter

segmenter = Segmenter(language=’en’, clean=True)

text = “My name is Mr. Robert H. Jones. Please read up to p. 45. At 3 P.M. we will talk about U.S. history.”

print(segmenter.segment(text))

Result

ORIGINAL:
My name is Mr. Robert H. Jones. Please read up to p. 45. At 3 P.M. we will talk about U.S. history.

OUTPUT:
['My name is Dr. Robert H. Jones.',
 'Please read up to p. 45.',
 'At 3 P.M. we will talk about U.S. history.']

Use Case

There have been many times in which I needed to treat or analyze text on the sentence level. A recent Aspect-Based Sentiment Analysis (ASBA) project is a good example. In this work, it was important to determine the polarity of specific relevant sentences within customer clothing reviews. This could only be done by breaking up the text into individual sentences first. So instead of spending time writing complex regular expressions to cover dozens of edge cases, let pySBD do the heavy lifting for you.

Documentation

4) TextAttack

TextAttack is a fantastic Python framework for developing adversarial attacks on NLP models.

An adversarial attack in NLP is the process of creating small perturbations (or edits) to text data in order to fool the NLP model into making the wrong prediction. Perturbations include swapping words with synonyms, inserting new words, or deleting random characters from the text. These edits are applied to randomly selected observations from your model’s dataset input.

Image by the author. A single successful attack. The adversarial perturbations fooled the NLP classification model into predicting the incorrect label.

TextAttack provides a seamless, low-code way of generating these adversarial examples to form an attack. Once an attack is run, a summary will be shown of how well the NLP model performed. This will provide an evaluation of the robustness of your model — or in other words, how susceptible it is to certain perturbations. Robustness is an important factor to consider when launching NLP models into the real world.

Installation

!pip install textattack[tensorflow]

Example

TextAttack is way too versatile to cover in brief so I heavily recommend checking out its well-written documentation page.

Here, I will be running an attack via command line API (within Google Colab) on a BERT-based sentiment classification model from Hugging Face. This pre-trained model was fine-tuned to predict Positive or Negative using the Rotten Tomatoes Movie Review dataset.

The attack contains a word-swap-embedding transformation, which will transform selected observations from the Rotten Tomatoes dataset by replacing random words with synonyms in the word embedding space.

Let’s see how this NLP model holds up against 20 adversarial examples.

!textattack attack \
    --model-from-huggingface RJZauner/distilbert_rotten_tomatoes_sentiment_classifier \
    --dataset-from-huggingface rotten_tomatoes \
    --transformation word-swap-embedding \
    --goal-function untargeted-classification \
    --shuffle `True` \
    --num-examples 20

Result

Image by the author. One of 19 successful attacks on the fine-tuned BERT model. It can be argued that the meaning of the original negative review stayed intact after the perturbations. Ideally, the model should NOT have misclassified this adversarial example.

Interesting! Without any perturbations, this model achieves an impressive 100% accuracy. However, out of 20 total attacks — in which only 18% of the words were altered on average — the NLP model was fooled into misclassifying 19 times!

Use Case

By testing an NLP model against adversarial attacks, we can better understand the model’s weaknesses. The next step can then be to improve model accuracy and/or robustness by further training the NLP model on augmented data.

For a full project example of how I put this library to use to evaluate a custom LSTM classification model, check out this article. It also includes a full code script.

NLP Project With Augmentation, Attacks, & Aspect-Based Sentiment Analysis

Do you know about these 3 advanced NLP concepts?

towardsdatascience.com

Documentation

Conclusion

I hope that these libraries come to use in your future NLP endeavors!

This was a continuation of a similar article I wrote recently. So if you haven’t heard of useful Python libraries like contractions, distilbert-punctuator, or textstat, then check that out too!

5 Lesser-Known Python Libraries for Your Next NLP Project

With code examples and explanations.

towardsdatascience.co

Thanks for reading!