Two minutes NLP — Effective intents identification in short texts with unsupervised learning
LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN
There are mainly two unsupervised learning approaches to understand what is talked about in short texts: topic modeling and clustering of embeddings.
Topic Modeling
Topic Modeling is used to discover latent topics in a collection of documents. A very common topic modeling algorithm is LDA (Latent Dirichlet Allocation). Note that a hyperparameter of the LDA algorithm is the number of topics to be found, which can be optimized by maximizing/minimizing a suitable metric, such as the coherence metric. LDA is used by Airbnb for this purpose.
However, intents are often more specific than topics, therefore clustering of embeddings can be a useful alternative.
Clustering of embeddings
Intents can be identified by finding precise and narrow clusters. This is done typically in three steps:
- Obtain an encoding from each document. Google’s Universal Sentence Encoder (USE) and Sentence-BERT are popular sentence encoders for this purpose.
- Reduce the dimensionality of the embedding. You can use techniques like PCA and UMAP. This step has been observed to improve clustering results at the next step.
- Cluster the embeddings. Typically density-based clustering algorithms are used, such as HDBSCAN.
Datasets
The PolyAI team published a banking dataset that contains 10000+ messages spanning 77 intents, which you can use to test your algorithms. Consider that in a real-world setting you would face additional challenges, such as identifying which message of each conversation contains the intent. https://github.com/PolyAI-LDN/task-specific-datasets.
It’s hard to find other public datasets as real-world data needs to be anonymized.
Code examples

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.
