avatarFabio Chiusano

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1397

Abstract

se.</p><p id="70d4">However, intents are often more specific than topics, therefore clustering of embeddings can be a useful alternative.</p><p id="a4bb"><b>Clustering of embeddings</b></p><p id="ebe1">Intents can be identified by finding precise and narrow clusters. This is done typically in three steps:</p><ol><li>Obtain an encoding from each document. Google’s <a href="https://tfhub.dev/google/universal-sentence-encoder/4">Universal Sentence Encoder</a> (USE) and <a href="https://www.sbert.net/">Sentence-BERT</a> are popular sentence encoders for this purpose.</li><li>Reduce the dimensionality of the embedding. You can use techniques like <a href="https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c">PCA</a> and <a href="https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668">UMAP</a>. This step has been observed to improve clustering results at the next step.</li><li>Cluster the embeddings. Typically <a href="https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods">density-based clustering algorithms</a> are used, such as <a href="https://github.com/scikit-learn-contrib/hdbscan">HDBSCAN</a>.</li></ol><p id="3543"><b>Datasets</b></p><p id="23ae">The PolyAI team published a banking dataset that contains 10000+ messages spanning 77 intents, which you can use to test your algorithms. Consider

Options

that in a real-world setting you would face additional challenges, such as identifying which message of each conversation contains the intent. <a href="https://github.com/PolyAI-LDN/task-specific-datasets">https://github.com/PolyAI-LDN/task-specific-datasets</a>.</p><p id="09ba">It’s hard to find other public datasets as real-world data needs to be anonymized.</p><p id="dc34"><b>Code examples</b></p><ul><li><a href="https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24#:~:text=Latent%20Dirichlet%20Allocation%20(LDA)%20is,model%2C%20modeled%20as%20Dirichlet%20distributions.">Topic Modeling with LDA</a></li><li><a href="https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e">Identify intents using clustering of embeddings</a></li></ul><figure id="f291"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*vRQblQU3FNXdK4A5.png"><figcaption>NLPlanet logo.</figcaption></figure><p id="5c96"><i>Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on <a href="https://www.linkedin.com/company/nlplanet">LinkedIn</a>, <a href="https://twitter.com/nlplanet_">Twitter</a>, <a href="https://www.facebook.com/NLPlanet-113393687828458">Facebook</a>, and <a href="https://t.me/nlplanet">Telegram</a>.</i></p></article></body>

Two minutes NLP — Effective intents identification in short texts with unsupervised learning

LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN

Photo by Volodymyr Hryshchenko on Unsplash

There are mainly two unsupervised learning approaches to understand what is talked about in short texts: topic modeling and clustering of embeddings.

Topic Modeling

Topic Modeling is used to discover latent topics in a collection of documents. A very common topic modeling algorithm is LDA (Latent Dirichlet Allocation). Note that a hyperparameter of the LDA algorithm is the number of topics to be found, which can be optimized by maximizing/minimizing a suitable metric, such as the coherence metric. LDA is used by Airbnb for this purpose.

However, intents are often more specific than topics, therefore clustering of embeddings can be a useful alternative.

Clustering of embeddings

Intents can be identified by finding precise and narrow clusters. This is done typically in three steps:

  1. Obtain an encoding from each document. Google’s Universal Sentence Encoder (USE) and Sentence-BERT are popular sentence encoders for this purpose.
  2. Reduce the dimensionality of the embedding. You can use techniques like PCA and UMAP. This step has been observed to improve clustering results at the next step.
  3. Cluster the embeddings. Typically density-based clustering algorithms are used, such as HDBSCAN.

Datasets

The PolyAI team published a banking dataset that contains 10000+ messages spanning 77 intents, which you can use to test your algorithms. Consider that in a real-world setting you would face additional challenges, such as identifying which message of each conversation contains the intent. https://github.com/PolyAI-LDN/task-specific-datasets.

It’s hard to find other public datasets as real-world data needs to be anonymized.

Code examples

NLPlanet logo.

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.

NLP
Machine Learning
Data Science
Topic Modeling
Bert
Recommended from ReadMedium