Summary

KeyBERT is a keyword extraction tool that leverages BERT's semantic capabilities to identify relevant keywords from text documents.

Abstract

KeyBERT is a keyword extraction technique that utilizes the BERT language model to capture the semantic meaning of documents, addressing the limitations of traditional statistical methods. It embeds both the document and potential keywords into a shared space, allowing for the extraction of keywords based on their cosine similarity to the document embedding. KeyBERT offers methods like Max Sum Similarity (MSS) and Maximal Marginal Relevance (MMR) to ensure diversity among the extracted keywords. While KeyBERT is a powerful tool for NLP projects, its reliance on large BERT models may lead to longer processing times, which could be a limitation for real-time applications or environments without dedicated GPUs.

Opinions

The author acknowledges the effectiveness of traditional keyword extraction techniques but emphasizes the semantic advantage provided by KeyBERT.
KeyBERT is praised for its ease of use and the ability to integrate with other NLP tools and libraries, such as transformers, flair, gensim, spacy, and use.
The author suggests that while BERT models are resource-intensive, there are potential workarounds to improve inference time, such as using smaller models like DistilBERT or converting models to ONNX format.
The author encourages readers to explore KeyBERT for their NLP projects, implying that it could be a valuable addition to their toolkit.
There is a subtle endorsement for the AI service ZAI.chat, which is presented as a cost-effective alternative to ChatGPT Plus (GPT-4), offering similar performance.

How to Extract Relevant Keywords with KeyBERT

Yet another application of BERT

There are many powerful techniques that perform keywords extraction (e.g. Rake, YAKE!, TF-IDF). However, they are mainly based on the statistical properties of the text and don’t necessarily take into account the semantic aspects of the full document.

KeyBERT is a minimal and easy-to-use keyword extraction technique that aims at solving this issue. It leverages the BERT language model and relies on the 🤗transformers library.

source: https://github.com/MaartenGr/KeyBERT

KeyBERT is developed and maintained by Maarten Grootendorst. So go check his repo (and clone it) if you’re interested in using it.

In this post, I’ll briefly present KeyBERT: how it works and how you can use it

PS: If you want to see a video tutorial on how to use KeyBERT and how to embed it in a Streamlit app, you can have a look at my video:

KeyBERT: a BERT-powered keyword extraction technique

You can install KeyBERT with pip.

pip install keybert

If you need embeddings from other sources than 🤗transformers, you can install them as well:

pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]

Calling KeyBERT is straightforward: you initialize a keyword extraction model based on a 🤗transformers model and apply the extract_keywords method on it.

How does KeyBERT extract keywords?

KeyBERT extracts keywords by performing the following steps:

1 — The input document is embedded using a pre-trained BERT model. You can pick any BERT model your want from 🤗transformers. This turns a chunk of text into a fixed-size vector that is meant the represent the semantic aspect of the document

2 — Keywords and expressions (n-grams) are extracted from the same document using Bag Of Words techniques (such as a TfidfVectorizer or CountVectorizer). This is a classical step that you may be familiar with if you’ve performed keywords extraction in the past

3 — Each keyword is then embedded into a fixed-size vector with the same model used to embed the document

4 — Now that the keywords and the document are represented in the same space, KeyBERT computes a cosine similarity between the keyword embeddings and the document embedding. Then, the most similar keywords (with the highest cosine similarity score) are extracted.

The idea is pretty simple: you can think of it as an enhanced version of a classical keyword extraction technique in which the BERT language model comes in to add its semantic capability.

This doesn’t stop here: KeyBERT includes two methods to introduce diversity in the resulting keywords.

1 — Max Sum Similarity (MSS)

To use this method, you start by setting the top_n argument to a value, say 20. Then 2 x top_n keywords are extracted from the document. Pairwise similarities are computed between these keywords. Finally, the method extracts the most relevant keywords that are the least similar to each other.

Here’s an example from the KeyBERT’s repository:

2 — Maximal Marginal Relevance (MMR)This method is similar to the previous one: it adds a diversity argument

MMR tries to minimize redundancy and maximize the diversity of results in text summarization tasks.

It starts by selecting the keywords that are the most similar to the document. Then, it iteratively selects new candidates that are both similar to the document and not similar to the already selected keywords

You can choose a low-diversity threshold:

or a high one:

So far so good, but…

One limitation that KeyBERT may suffer from though is the execution time: if you have large documents and need real-time results, KeyBERT may not be the best solution (unless you have dedicated GPUs in your production environment). The reason being that BERT models are notoriously huge and consume a lot of resources especially when they have to process large documents.

You can probably find some hacks to speed up the inference time by picking smaller models (DistilBERT), using mixed precision or even convert your model to ONNX format.

If this still doesn’t work out for you, check other classical methods: you’d be surprised by their efficiency despite their relative simplicity.

Thanks for reading!

That’s it for today. I hope you’ll find this small method useful for your NLP projects if you’re performing keywords extraction.

You can learn more about KeyBERT here:

MaartenGr/KeyBERT

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and…

github.com

Keyword Extraction with BERT

A minimal method for extracting keywords and keyphrases

towardsdatascience.com

and here:

Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling

In this paper we propose a novel self-supervised approach of keywords and keyphrases retrieval and extraction by an…

www.preprints.org

Take care,

New to Medium? You can subscribe for $5 per month and unlock unlimited articles — click here.