avatarChristian Bernecker

Summarize

NLP SIMILARITY: Use pretrained word embeddings for semantic similarity search with BERT Transformers

This article describes how to use pretrained word embeddings to measure document similarity and doing a semantic similarity search. First you get an introduction about the advantages and the different use cases of embeddings. Finally I’ll present you a framework that you can use for semantic similarity search. At the end you will find a snippet to build you own similar search queries like in the picture below.

Photo by visuals on Unsplash

What are embeddings:

Embeddings are an advanced NLP technique that outpeforms traditional methods like TF/IDF. Word embeddings are a way to represent words as vectors of numbers in a high-dimensional space. These vectors capture the meaning and context of a word in a way that allows them to be used as input to machine learning models. The following picture demonstrates a simple representation of embeddings:

Word Embeddings

Advantages over traditional methods like TF-IDF:

  1. Dimensionality reduction: Word embeddings represent words as dense, low-dimensional vectors, while TF-IDF results in a high-dimensional sparse representation. This makes it easier to work with word embeddings in downstream machine learning tasks.
  2. Semantic meaning: Word embeddings capture the semantic meaning of words, which means that similar words will have similar vectors. This is not the case with TF-IDF, which only considers the frequency of words in a document.
  3. Handling Out-of-vocabulary words: Word embeddings can easily handle new, unseen words, whereas TF-IDF requires retraining the model.
  4. Handling Synonyms: Word embeddings can handle synonyms in a more elegant way. Words that are semantically similar will have similar embeddings, whereas with TF-IDF, synonyms will have different feature representations.
  5. Handling Polysemy: Word embeddings can handle polysemy (words with multiple meanings) in a better way than TF-IDF. As different meanings of a word will have different embeddings.

What can you do with embeddings?

  1. Text classification: You can use the embeddings as features for a machine learning model to classify the input text into different categories or labels.
  2. Text similarity: You can use the embeddings to measure the similarity between two or more input texts, allowing you to identify duplicate or near-duplicate content.
  3. Text clustering: You can use the embeddings to group similar input texts together, allowing you to explore and understand patterns and themes in large collections of text data.
  4. Information retrieval: You can use the embeddings to index and retrieve relevant input texts based on their similarity to a given query.
  5. Language modeling: You can use the embeddings as input to a language model to generate new text that is similar in style and content to the input text.
  6. Language Translation: You can use the embeddings to train a neural machine translation model to translate text from one language to another
  7. Sentiment Analysis: You can use the embeddings as input to a model that can classify the sentiment of a text as positive, negative or neutral.

How to use pretrained embeddings:

There are several sources for pre-trained word embeddings that you can use in your projects. Here are some of them most popular ones:

  1. GloVe: The Global Vectors for Word Representation is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
  2. Word2Vec: Developed by Google, Word2Vec is another popular method for learning word embeddings. It uses a neural network architecture to learn the embeddings. It cames in two flavors CBOW and Skipgrams.
  3. FastText: Developed by Facebook, FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
  4. ELMO: Developed by Allen Institute for AI, ELMO is a deep contextualized word representations, pre-trained on a diverse range of web pages.
  5. BERT: Developed by Google AI and introduced in a paper in 2018 by Devlin et al. The paper describes the model and its pre-training method, and shows how it can be fine-tuned for a wide range of natural language understanding tasks.

It’s important to note that the pre-trained embeddings are trained on different corpus and with different architectures, so it’s better to use the one that fits your task the most.

An example from SBERT for Semantic Textual Similarity

https://www.sbert.net/docs/usage/semantic_textual_similarity.html
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
Output:

The cat sits outside       The dog plays in the garden  Score: 0.2838 
A man is playing guitar    A woman watches TV           Score: -0.0327 
The new movie is awesome   The new movie is so great    Score: 0.8939

Semantic Search for showing similar search queries.

from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['is tiktok getting banned.',
          'is tiktok shutting down.',
          'is tiktok banned in us.',
          'is tiktok getting deleted.',
          'is tiktok a competitor for instagram.',
          'is instagram a useful app',
          'something unrelevant'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['is tiktok']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(7, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
Output:

Query: is tiktok

Top 5 most similar sentences in corpus:
is tiktok banned in us.                 (Score: 0.7735)
is tiktok getting banned.               (Score: 0.7716)
is tiktok getting deleted.              (Score: 0.7651)
is tiktok a competitor for instagram.   (Score: 0.7560)
is tiktok shutting down.                (Score: 0.7308)
is instagram a useful app               (Score: 0.1481)
something unrelevant                    (Score: 0.0925)

Leave a comment if you have any questions, recommendations or something is not clear and I’ll try to answer soon as possible.

If you want to scale that solution you need at least a vector database. In the following article I described how to do that.

Don’t Forget to clap — if you find this helpful:

Like — Share — Commet — Follow
Bert
NLP
Semantic Search
Similarity Search
Transformers
Recommended from ReadMedium