Two minutes NLP — Basic taxonomy of Topic Tagging models and elementary use cases

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1485

Abstract

p id="839b"><b>Predictive Topic Tagging</b></p><p id="1553">Works with a predefined set of topics, by training a classification model with several examples of texts for each topic, making it a multilabel classification problem. The training set can be scraped from the web since each published article with tags is a potential training sample.</p><p id="7196"><b>Datasets</b></p><ul><li><a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">WikiData</a>: a collaboratively edited multilingual knowledge graph hosted by the <a href="https://www.wikimedia.org/">Wikimedia Foundation</a>. It is a common source of open data that Wikimedia projects such as Wikipedia.</li><li><a href="https://iptc.org/standards/media-topics/">Media Topics</a>: a constantly updated taxonomy of over 1,200 terms with a focus on categorizing text about media. Originally based on the IPTC <a href="https://iptc.org/standards/subject-codes/">Subject Codes</a> taxonomy.</li><li><a href="https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html">20 newsgroups</a>: a dataset comprises around 18000 newsgroups posts on 20 topics.</li></ul><p id="4f50"><b>Use cases</b></p><ul><li><a href="https://readmedium.com/content-marketing-analysis-with-nlp-github-vs-gitlab-b9ee114d5fb7">Analysis of blog articles of companies to deduce their content marketing strategy.</a></li><li>Analysis of news data.</li><li>Quick organization of a corpus of documents by topics.</li></ul><p id="ff5f"><b>Code examples</

Options

b></p><ul><li><a href="https://towardsai.net/p/l/how-to-train-a-topic-tagging-model-to-assign-high-quality-topics-to-articles">Train a predictive topic tagging model</a></li></ul><p id="6ef0"><b>Two minutes NLP related posts</b></p><div id="9083" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-effective-intents-identification-in-short-texts-with-unsupervised-learning-61b7b670d3"> <div> <div> <h2>Two minutes NLP — Effective intents identification in short texts with unsupervised learning</h2> <div><h3>LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*f4GiN5LC5CxAp9U1)"></div> </div> </div> </a> </div><figure id="2d66"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*wTR4oNJDgAbWsuwT.png"><figcaption>NLPlanet logo.</figcaption></figure><p id="bfa0"><i>Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on <a href="https://www.linkedin.com/company/nlplanet">LinkedIn</a>, <a href="https://twitter.com/nlplanet_">Twitter</a>, <a href="https://www.facebook.com/NLPlanet-113393687828458">Facebook</a>, and <a href="https://t.me/nlplanet">Telegram</a>.</i></p></article></body>

LDA, NMF, Top2Vec, and WikiData

Comparison of extractive and predictive topic tagging. Image by the author.

Topic Tagging is the process of assigning topics to the content of various forms, the most spread being text.

Topic Modeling

Topic Modeling is a set of unsupervised techniques to extract these topics, such as LDA, NMF, and Top2Vec. These techniques automatically detect topics specific to the corpus of documents analyzed, without the need of external taxonomies.

Extractive Topic Tagging

Works by detecting the keywords contained in a text and using their normalized forms as topics. Often these topics are enriched using categories from open-source knowledge bases like WikiData.

Predictive Topic Tagging

Works with a predefined set of topics, by training a classification model with several examples of texts for each topic, making it a multilabel classification problem. The training set can be scraped from the web since each published article with tags is a potential training sample.

Datasets

WikiData: a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia.

Media Topics: a constantly updated taxonomy of over 1,200 terms with a focus on categorizing text about media. Originally based on the IPTC Subject Codes taxonomy.

20 newsgroups: a dataset comprises around 18000 newsgroups posts on 20 topics.

Use cases

Code examples

Two minutes NLP related posts

NLPlanet logo.

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.