Summary

The web content discusses the development of natural language processing (NLP) pipelines for chatbots, emphasizing the use of machine learning and hardcoded rules to understand user intent and the importance of preprocessing steps to improve accuracy.

Abstract

The article "Natural Language Pipeline for Chatbots" outlines the two primary technologies employed by chatbot developers for interpreting user messages: machine learning algorithms and hardcoded rules. It underscores the necessity of having a substantial amount of labeled data for effective machine learning, suggesting at least 1000 examples per class for classification tasks. In scenarios where data is scarce, the article suggests crafting specific rules to identify message intent. It also highlights common errors in intent classification, such as false positives and false negatives, and proposes a preprocessing NLP pipeline to mitigate these issues. This pipeline includes spellchecking, sentence splitting, word tokenization, part-of-speech tagging, lemmatization, entity recognition, and concept identification to enhance pattern matching for intent recognition. The article concludes by inviting readers to share their experiences with chatbot engines and suggests that developers may create internal or external domain-specific languages (DSLs) for defining intent identification patterns.

Opinions

The author suggests that machine learning requires a significant amount of data and may not be feasible for all projects, hinting at a potential barrier for smaller-scale operations.
There is an emphasis on the simplicity and potential inaccuracy of hardcoded rules without proper preprocessing, implying that they should be used with caution.
The article conveys the idea that preprocessing is crucial for reducing errors in chatbot interpretation, suggesting that a well-designed NLP pipeline is essential for robust chatbot functionality.
By mentioning the use of internal and external DSLs, the author implies that there is room for innovation in how developers define and apply rules for intent recognition.
The author seems to advocate for the practical use of readily available NLP libraries like NLTK, StanfordNLP, and SpaCy, indicating their effectiveness in the preprocessing steps.
The invitation for readers to share their experiences suggests that the author values community input and sees it as a source of valuable insights and collective problem-solving.

Natural Language Pipeline for Chatbots

Image credit: https://thefilmgeekfiles.files.wordpress.com/2011/06/wall_e_and_eve-wide.jpg

Chatbot developers usually use two technologies to make the bot understand the meaning of user messages: machine learning and hardcoded rules. See more details on chatbot architecture in my previous article.

Machine learning can help you to identify intent of the message and extract named entities. It is quite powerful but requires lots of data to train the model. Rule of thumb is to have around 1000 examples for each class for classification problems.

If you don’t have enough labeled data then you can handcraft rules which will identify the intent of a message. Rules can be as simple as “if a sentence contains words ‘pay’ and ‘order’ then the user is asking to pay for an order”. And the simplest implementation in your favorite programming language could look like this:

Any intent classification code can make errors of two types. False positives: the user doesn’t express an intent, but the chatbot identifies an intent. False negatives: the user expresses an intent, but the chatbot doesn’t find it. This simple solution will make lots of errors:

The user can use words “pay” and “order” in different sentences: “I make an order by mistake. I won’t pay.”
A keyword is a substring of another word: “Can I use paypal for order #123?”
Spelling errors: “My orrder number is #123. How can I pay?”
Different forms of words: “How can I pay for my orders?”

Your chatbot needs a preprocessing NLP pipeline to handle typical errors. It may include these steps:

Spellcheck

Get the raw input and fix spelling errors. You can do something very simple or build a spell checker using deep learning.

2. Split into sentences

It is very helpful to analyze every sentence separately. Splitting the text into sentences is easy, you can use one of NLP libraries, e.g. NLTK, StanfordNLP, SpaCy.

3. Split into words

This is also very important because hardcoded rules typically operate with words. Same NLP libraries can do it.

4. POS tagging

Some words have multiple meanings, for an example “charge” as a noun and “charge” as a verb. Knowing a part of speech can help to disambiguate the meaning. You can use same NLP libraries, or Google SyntaxNet, that is a little bit more accurate and supports multiple languages.

5. Lemmatize words

One word can have many forms: “pay”, “paying”, “paid”. In many cases, an exact form of the word is not important for writing a hardcoded rule. If preprocessing code can identify a lemma, a canonical form of the word, it helps to simplify the rule. Lemmatization, identifying lemmas, is based on dictionaries which list all forms of every word. The most popular dictionary for English is WordNet. NLTK and some other libraries allow using it for lemmatization.

6. Entity recognition: dates, numbers, proper nouns

Dates and numbers can be expressed in different formats: “3/1/2016”, “1st of March”, “next Wednesday”, “2016–03–01”, “123”, “one hundred”, etc. It may be helpful to convert them to unified format before doing pattern matching. Other entities which require special treatment: locations (countries, regions, cities, street addresses, places), people, phone numbers.

7. Find concepts/synonyms

If you want to search for a breed of a dog, you don’t want to list all the dog breeds in the rule, because there are hundreds of them. It is nice if preprocessing code identified a dog breed in the message and marked the word with a special tag. Then you can just look for that tag when applying the rule.

WordNet can be used to identify common concepts. You may need to add domain specific concept libraries, e.g. a list of drug names if you are building a healthcare bot.

After preprocessing is done you have a nice clean list of sentences and lists of words inside each sentence. Each word is marked with a part of speech and concepts, and you have a lemma for every word. The next step is to define patterns for intent identification.

You can invent your own pattern language using common logical operators AND, OR, NOT. The rule can look like this if you create an internal DSL (domain-specific language) based on Python:

Alternatively, you can invent external DSL, which can be more readable, but you will need extra work to create a compiler or an interpreter for that language. If you use a ChatScript language, it can look like this:

Do you use a chatbot engine with hardcoded rules? Have your developed your own? What issues have you encountered when building or using a chatbot engine? Please share in comments!