Simple intent recognition and question answering with DeepPavlov

This article is part of an ongoing series on the DeepPavlov framework. You can also check out the full list of articles.

Let’s get straight to the point. Suppose your clients are likely to ask some of a very limited pool of questions. You can address these questions either via a customer support call center, or a chat widget on the webpage. In both of these cases, for user input, you should decide if it’s semantically similar to one of the predefined questions and then return the corresponding answer. Formally speaking, this is a text classification problem. Text classification is one of the widely used tasks in the field of natural language processing (NLP).

Text classification can solve the following problems:

Recognize a user’s intent in any chatbot platform.
Distinguish between spam and nonspam messages.
Identify the sentiment of client reviews.
Classify a product item into one or more product types from a catalog.

If you are faced with one of these problems, there are several solutions at your disposal. You can hire an NLP engineer to collect the training data for you, train, fine-tune, and retrain the model until its performance satisfies you. Alternatively, you can employ one of the NLP API services to do the job for you. Both solutions have advantages and disadvantages. An NLP engineer is costly, and a black box API service might not be able to provide you with the flexibility to test several models and fine-tune all the parameters you want.

Fortunately, the open-source conversational AI framework DeepPavlov offers a free and easy-to-use solution. DeepPavlov comes with a bunch of predefined components powered by TensorFlow and Keras for solving NLP-related problems, including text classification. The framework allows you to fine-tune hyperparameters and test several models.

Next, I would like to show how you can do text classification with the DeepPavlov framework, particularly I will focus on the case when the training data is limited. The popular usage scenario for these models is to classify a user utterance into one of the FAQ questions and retrieve the corresponding answer (autoFAQ models). The code used in this article can be accessed on Colaboratory via the link.

Text classification with DeepPavlov

We use the student FAQ as a dataset for demonstration. The FAQ consists of student questions with corresponding answers. Here is a sample data file.

Question,Answer

How to get a bank card?,Visit the social service on the second floor of the building housing the dining hall. The social service is next to local internal affairs office.

The DeepPavlov framework already contains pretrained models for classifying these questions. You can interact with the pretrained model either by running a Python code or via the command line. The interaction with the model via the command line is illustrated by the snippet below. But first install all the model requirements.

python -m deeppavlov install tfidf_logreg_en_faq
python -m deeppavlov interact tfidf_logreg_en_faq -d

q::I need help
>> If you have any further inquiries, you can address them to the International Students Office, which is located in the Auditorium Building, Room 315. The phone number is (7-495) 408-7043.

Where tfidf_logreg_en_faq is the model’s configuration file and -dindicates that all model-related data is to be downloaded.

Alternatively, you can interact with the model by running the following Python code. In addition, you can navigate the configuration files by using Autocomplete (Tab key) on configs.

Model description

DeepPavlov contains several text classification models that work well on few training pairs. The models are described in the separate configuration files under the config/faq folder.

The config file consists of four main sections: dataset_reader, dataset_iterator, chainer, and train. The dataset_reader defines the dataset’s location along with the dataset format (x_col_name, y_col_name). After loading, the data is split into the train, validation, and test sets according to the dataset_iterator settings.

"dataset_reader": {
    "class_name": "faq_reader",
    "x_col_name": "Question",
    "y_col_name": "Answer",
    "data_url": "http://files.deeppavlov.ai/faq/school/faq_school_en.csv"},

"dataset_iterator": {
    "class_name": "data_learning_iterator",
    "field_to_split": "train",
    "split_fields": ["train","test"],
    "split_proportions": [0.8,0.2]}

The chainer section of the configuration files consists of three subsections. The in and out sections define an input and an output to the chainer, whereas the pipe section defines a pipeline of the required components to interact with the models — i.e., the tokenizer, tf-idf vectorizer, and others. The tokenizer splits a string into tokens. The tf-idf vectorizer transforms the tokens into tf-idf vectors. On the configuration below, the tokenizer with the lemmatizer enabled (lemmas: true) divides an input question into tokens and converts tokens into lemmas, then stores an output in q_token_lemmas. The next component, fasttext, loads fastText embeddings (from the load_path file) and converts all the q_token_lemmas lemmas into word vectors. As an output, it calculates the mean of all word vectors and stores it in out. Finally, the cos_sim_classifier component is able to provide top_n candidates from the training data fit_on on our input q_vect using the cosine similarity measure.

"chainer": {
   "in": "question",
   "pipe": [{
             "class_name": "tokenizer",
             "in": "question",
             "lemmas": true,
             "out": "q_token_lemmas"
             },
             {
              "class_name": "fasttext",
              "in": "q_token_lemmas",
              "load_path": "embeddings/100.bin", 
              "mean": true,
              "out": "question_vector"
             }, 
             {
              "class_name": "cos_sim_classifier",
              "in": "question_vector",
              "fit_on": ["question_vector","y"],
              "top_n": 1,
              "save_path": "faq/ft_cos_classifier.pkl",            
              "load_path": "faq/ft_cos_classifier.pkl",
              "out": ["answer", "score"]
            }],
   "out": ["answer"]
}

You can train a model by running it with train parameter, the model will be trained on the dataset defined in the dataset_reader section of the configuration file. The DeepPavlov framework allows you to test all the available models on your data in order to identify the best-performing model. To test the model, specify the dataset split along with split fields in the dataset_iterator section of the configuration file. In addition, you should define the measured metrics in the train section as the following

"train": {
    "metrics": ["accuracy"],
    "validate_best": false,
    "test_best": true
}

Then, train the model by running

python -m deeppavlov train tfidf_logreg_en_faq
...
{"test": {"eval_examples_count": 9, "metrics": {"accuracy": 0.7778}, "time_spent": "0:00:01"}}

Alternatively, you can train the model by running the following Python code.

Model evaluation

Model performance was measured on the FAQ dataset (with manually added paraphrases for each question). Due to the limited number of the question-answer pairs, we measure performance by using leave-one-out cross-validation (LOOCV).

All the models are based on two major text representations: fastText word embeddings and tf-idf representation.

The fastText model (fasttext_avg_autofaq.json) is a popular approach that averages fastText word embeddings and assigns the label of the closest utterance from the training set according to cosine distance. The tf-idf model (tfidf_autofaq.json) uses the tf-idf representation of the utterances; then, as the previous model, it leverages cosine distance to assign a label. A hybrid fastText tf-idf weighting model (fasttext_tfidf_autofaq.json) weights the fastText word embeddings by tf-idf values and also uses the cosine similarity approach. Finally, the tf-idf logistic regression model (tfidf_logreg_autofaq.json) trains a logistic regression on the tf-idf representation of the input.

The results in the sorted order are presented in Table 1.

The fastText mean based model outperforms all tf-idf based models by a large margin, this result can be caused by rich lexical variability of the dataset. The tf-idf logistic regression that learns to assign weights to the words outperforms the rest tf-idf based models.

Conclusion

In this article, I described the text classification models of the DeepPavlov framework. The relevant code can be found in the Colab notebook. These models were specifically developed to be effective for a small training dataset. However, if a large enough dataset is available, more sophisticated deep learning models can be applied.

I would like to thank Mikhail Burtsev, Luiza Sayfullina, Olga Kairova, and the entire team of the iPavlov for the insightful comments.