KeyBERT is a keyword extraction tool that leverages BERT's semantic capabilities to identify relevant keywords from text documents.
Abstract
KeyBERT is a keyword extraction technique that utilizes the BERT language model to capture the semantic meaning of documents, addressing the limitations of traditional statistical methods. It embeds both the document and potential keywords into a shared space, allowing for the extraction of keywords based on their cosine similarity to the document embedding. KeyBERT offers methods like Max Sum Similarity (MSS) and Maximal Marginal Relevance (MMR) to ensure diversity among the extracted keywords. While KeyBERT is a powerful tool for NLP projects, its reliance on large BERT models may lead to longer processing times, which could be a limitation for real-time applications or environments without dedicated GPUs.
Opinions
The author acknowledges the effectiveness of traditional keyword extraction techniques but emphasizes the semantic advantage provided by KeyBERT.
KeyBERT is praised for its ease of use and the ability to integrate with other NLP tools and libraries, such as transformers, flair, gensim, spacy, and use.
The author suggests that while BERT models are resource-intensive, there are potential workarounds to improve inference time, such as using smaller models like DistilBERT or converting models to ONNX format.
The author encourages readers to explore KeyBERT for their NLP projects, implying that it could be a valuable addition to their toolkit.
There is a subtle endorsement for the AI service ZAI.chat, which is presented as a cost-effective alternative to ChatGPT Plus (GPT-4), offering similar performance.
How to Extract Relevant Keywords with KeyBERT
Yet another application of BERT
Image by the author
There are many powerful techniques that perform keywords extraction (e.g. Rake, YAKE!, TF-IDF). However, they are mainly based on the statistical properties of the text and don’t necessarily take into account the semantic aspects of the full document.
KeyBERT is a minimal and easy-to-use keyword extraction technique that aims at solving this issue. It leverages the BERT language model and relies on the 🤗transformers library.
Calling KeyBERT is straightforward: you initialize a keyword extraction model based on a 🤗transformers model and apply the extract_keywords method on it.
KeyBERT extracts keywords by performing the following steps:
1 — The input document is embedded using a pre-trained BERT model. You can pick any BERT model your want from 🤗transformers. This turns a chunk of text into a fixed-size vector that is meant the represent the semantic aspect of the document
2 — Keywords and expressions (n-grams) are extracted from the same document using Bag Of Words techniques (such as a TfidfVectorizer or CountVectorizer). This is a classical step that you may be familiar with if you’ve performed keywords extraction in the past
Image by the author
3 — Each keyword is then embedded into a fixed-size vector with the same model used to embed the document
Image by the author
4 — Now that the keywords and the document are represented in the same space, KeyBERT computes a cosine similarity between the keyword embeddings and the document embedding. Then, the most similar keywords (with the highest cosine similarity score) are extracted.
Image by the author
The idea is pretty simple: you can think of it as an enhanced version of a classical keyword extraction technique in which the BERT language model comes in to add its semantic capability.
This doesn’t stop here: KeyBERT includes two methods to introduce diversity in the resulting keywords.
1 — Max Sum Similarity (MSS)
To use this method, you start by setting the top_n argument to a value, say 20. Then 2 x top_n keywords are extracted from the document. Pairwise similarities are computed between these keywords. Finally, the method extracts the most relevant keywords that are the least similar to each other.
2 — Maximal Marginal Relevance (MMR)This method is similar to the previous one: it adds a diversity argument
MMR tries to minimize redundancy and maximize the diversity of results in text summarization tasks.
It starts by selecting the keywords that are the most similar to the document. Then, it iteratively selects new candidates that are both similar to the document and not similar to the already selected keywords
One limitation that KeyBERT may suffer from though is the execution time: if you have large documents and need real-time results, KeyBERT may not be the best solution (unless you have dedicated GPUs in your production environment). The reason being that BERT models are notoriously huge and consume a lot of resources especially when they have to process large documents.
You can probably find some hacks to speed up the inference time by picking smaller models (DistilBERT), using mixed precision or even convert your model to ONNX format.
If this still doesn’t work out for you, check other classical methods: you’d be surprised by their efficiency despite their relative simplicity.
Thanks for reading!
That’s it for today. I hope you’ll find this small method useful for your NLP projects if you’re performing keywords extraction.