The Problem With Semantic Search

Semantic search is an approach to retrieving information on the web that involves understanding the contextual meaning of search terms. Traditional search engines, like Google in its early days, relied on keyword-based search, where the results were based on the exact match of the search terms. However, semantic search goes a step further by understanding the intent behind the search query and the contextual meaning of the terms, providing more relevant and accurate results.

For example, if you search for “Apple”, a keyword-based search engine might return results about the fruit, the tech company, and possibly even the record company. However, a semantic search engine would analyze the context in which you’re searching to provide more relevant results. If your previous searches were about smartphones, it would prioritize results about Apple Inc., the tech company.

Semantic search uses various techniques including natural language processing (NLP), machine learning, and semantic understanding of text. It can involve understanding synonyms, homonyms, context, natural language queries, and even user behavior analysis.

Let’s look at a simple example of how semantic search can be implemented using Python. We’ll use a Python library called NLTK (Natural Language Toolkit) for this purpose.

import nltk
from nltk.corpus import wordnet

def semantic_search(query, text):
    query_synsets = wordnet.synsets(query)
    text_synsets = wordnet.synsets(text)
    
    best_score = max(s1.path_similarity(s2) for s1 in query_synsets for s2 in text_synsets)
    
    return best_score

# Example usage
print(semantic_search("apple", "fruit"))
print(semantic_search("apple", "company"))

In this example, we’re using WordNet, a lexical database of English words, which is part of NLTK. WordNet groups English words into sets of synonyms called synsets and provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.

The semantic_search function takes a query and a text as input, and it calculates the semantic similarity between them. The path_similarity method returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hyponym) taxonomy.

This is a very basic example and real-world semantic search systems are much more complex, involving understanding of natural language queries, user behavior analysis, and use of machine learning algorithms to improve the search results over time.

A more complex illustration

Let’s create a more advanced example using the Gensim library in Python. Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy, and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and incremental online algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

In this example, we’ll use Gensim’s Word2Vec model. Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far.

Here’s a simple example of how to train a Word2Vec model with Gensim:

from gensim.models import Word2Vec
from nltk.corpus import brown

# Training the model using the Brown corpus
sentences = brown.sents()
model = Word2Vec(sentences, min_count=1)

# Getting the vector for a word
print(model['human'])

# Getting the most common words
print(model.wv.index2word[0], model.wv.index2word[1], model.wv.index2word[2])

# Getting the least common words
vocab_size = len(model.wv.vocab)
print(model.wv.index2word[vocab_size - 1], model.wv.index2word[vocab_size - 2], model.wv.index2word[vocab_size - 3])

# Finding most similar words
print(model.wv.most_similar('human'))

# Finding the word that doesn't belong
print(model.wv.doesnt_match("human computer interface tree".split()))

In this example, we’re training the Word2Vec model using the Brown corpus (a large corpus of English text) from the NLTK library. The min_count parameter is used to ignore all words with total frequency lower than this.

After training the model, we can get the vector for a word using the model’s dictionary. We can also get the most common and least common words in the corpus.

The most_similar function finds the top-N most similar words, which can be used to find words with similar context or meaning. The doesnt_match function finds the word in a list of words that is furthest away from the others in the vector space, which can be used to detect words that don't belong in a context.

How Google Search works: the power of semantic search

Google’s search engine is incredibly complex and uses a multitude of algorithms and methods to deliver search results. Here’s a simplified explanation of how it works:

Crawling and Indexing: Google discovers new pages by “crawling” the web. Google’s bots start with a list of web page URLs generated from previous crawls and then augment those pages with sitemap data provided by webmasters. As Google’s bots visit these web pages, they use links on those pages to discover other pages. The data gathered from these pages is then used to build an index of the web.
Ranking and Returning Results: When a user enters a query, Google’s algorithms look for clues to better understand what the user is seeking. The search algorithms analyze the user’s query and then search the index for matching pages. The algorithms rank the results based on hundreds of ranking factors, including things like the user’s location, language, device (desktop or phone), and previous queries.
Evaluating Usefulness: Google uses both automated systems and human evaluation to ensure that their search algorithms are returning useful and relevant results. They conduct live tests, collect user feedback, and have thousands of external Search Quality Raters from around the world who evaluate the quality of search results.

Now, let’s talk about how Google uses semantic search. Google’s semantic search was significantly enhanced with the introduction of the Knowledge Graph and the Hummingbird algorithm update.

The Knowledge Graph is a knowledge base used by Google to enhance its search engine’s results with information gathered from a variety of sources. The information is presented to users in an infobox next to the search results. Knowledge Graph infoboxes were added to Google’s search engine in May 2012, starting in the United States, and have since been rolled out worldwide.

The Hummingbird update, announced in September 2013, was a complete overhaul of the core algorithm. With Hummingbird, Google moved beyond keyword matching to better understand context and user intent. For example, instead of just finding pages with matching words, Google can understand complex queries and provide results that match the overall intent, rather than individual keywords.

For instance, if you search for “places to eat pizza near me,” Google understands that you’re not just looking for any place that mentions “pizza,” but specifically restaurants that serve pizza in your current location. It’s this understanding of the meaning behind your search, rather than just the words themselves, that’s at the heart of semantic search.

Google continues to refine and enhance its search algorithms, and semantic understanding plays a big role in providing more useful and relevant search results.

The premises of semantic search

Language model-based applications, particularly those using transformer-based models like BERT, GPT-3, and others, have revolutionized the field of natural language processing. These models are trained to understand the semantic meaning of text, which makes them incredibly powerful tools for a variety of applications, including document similarity search, question answering, text generation, and more.

One of the key applications is the ability to “chat with your documents”. This essentially means that you can ask questions or make queries in natural language, and the model can understand the context of your question and provide relevant responses based on the content of the documents. This is similar to having a conversation with a human who has read and understood the documents.

The way this works is by using similarity search. When you make a query, the model converts your query into a high-dimensional vector using the language model. It does the same for all the documents or parts of the documents in your database. It then calculates the similarity between your query vector and all the document vectors, and returns the documents that are most similar to your query.

This is done using techniques like cosine similarity or dot product in the high-dimensional vector space. The idea is that if two vectors are close together in this space, the corresponding pieces of text have similar semantic meaning.

For example, if you have a database of scientific papers and you ask a question like “What is the latest research on climate change?”, the model can return a list of papers that are most relevant to your question. It does this not just by looking for papers that contain the words “latest”, “research”, and “climate change”, but by understanding the semantic meaning of your question and finding papers that are semantically related to it.

This ability to understand the semantic meaning of text and perform similarity search has opened up a whole new range of possibilities for interacting with text data, making it much easier to find relevant information in large databases of documents.

Let’s illustrate this with a simple example using the sentence-transformers library in Python, which allows us to generate sentence embeddings for our text. We'll use these embeddings to calculate similarity and find the most relevant document for a given query.

First, let’s install the library:

pip install sentence-transformers

Now, let’s say we have a list of documents and a query. We can use the sentence-transformers library to find the document that is most similar to the query:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our documents
documents = [
    "The sky is blue and beautiful.",
    "Love this blue and beautiful sky!",
    "The quick brown fox jumps over the lazy dog.",
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    "I love green eggs, ham, sausages and bacon!",
    "The brown fox is quick and the blue dog is lazy!",
    "The sky is very blue and the sky is very beautiful today",
    "The dog is lazy but the brown fox is quick!"
]

# Our query
query = "The fox jumps over the dog"

# Generate embeddings for documents and query
document_embeddings = model.encode(documents)
query_embedding = model.encode([query])

# Calculate similarity scores
similarity_scores = cosine_similarity(query_embedding, document_embeddings)

# Find the most similar document
most_similar_idx = similarity_scores.argmax()
most_similar_document = documents[most_similar_idx]

print(f"The most similar document to the query is: '{most_similar_document}'")

In this example, we’re using the all-MiniLM-L6-v2 model from the sentence-transformers library to generate our embeddings. This model is trained to generate embeddings that are semantically meaningful, so texts that are semantically similar should have similar embeddings.

We then calculate the cosine similarity between the query embedding and each document embedding. The document with the highest similarity score is considered the most similar to the query.

If each document has multiple sentences, we can modify the code to generate embeddings for each sentence in the document. We can then calculate the similarity score for each sentence in the document and return the document that has the highest average similarity score. Here’s how you can do it:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import nltk

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our documents
documents = [
    "The sky is blue and beautiful. Love this blue and beautiful sky!",
    "The quick brown fox jumps over the lazy dog. The dog is lazy but the brown fox is quick!",
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans. I love green eggs, ham, sausages and bacon!",
    "The sky is very blue and the sky is very beautiful today. The dog is lazy but the brown fox is quick!"
]

# Our query
query = "The fox jumps over the dog"

# Generate embedding for the query
query_embedding = model.encode([query])

# Initialize an empty list to store average similarity scores for each document
average_similarity_scores = []

# For each document
for document in documents:
    # Split the document into sentences
    sentences = nltk.tokenize.sent_tokenize(document)
    
    # Generate embeddings for each sentence in the document
    sentence_embeddings = model.encode(sentences)
    
    # Calculate similarity scores for each sentence in the document
    similarity_scores = cosine_similarity(query_embedding, sentence_embeddings)
    
    # Calculate the average similarity score for the document
    average_similarity_score = np.mean(similarity_scores)
    
    # Add the average similarity score to our list
    average_similarity_scores.append(average_similarity_score)

# Find the document with the highest average similarity score
most_similar_idx = np.argmax(average_similarity_scores)
most_similar_document = documents[most_similar_idx]

print(f"The most similar document to the query is: '{most_similar_document}'")

Semantic search is a powerful tool that can greatly improve the relevance and quality of search results by understanding the intent and contextual meaning of search terms. However, it’s not without its limitations. One of the key challenges with semantic search is the assumption that the answer to a query is semantically similar to the query itself. This is not always the case, and it can lead to less than optimal results in certain situations.

Let’s consider an example. Suppose a user asks the question, “Who won the Nobel Prize in Literature in 2020?” The answer to this question is “Louise Glück”. However, the terms “Louise Glück” and “Who won the Nobel Prize in Literature in 2020” are not semantically similar. They belong to completely different semantic fields — one is a person’s name, and the other is a question about a specific event. A semantic search model might struggle to match these two pieces of information because they don’t share similar semantic features.

Another example could be a question like “What is the capital of Australia?”. The answer is “Canberra”. Again, the semantic relationship between the question and the answer is not based on similarity. The question is about geography and the answer is a specific city name.

In these cases, the semantic search model might fail to retrieve the correct answer because it’s looking for documents or sentences that are semantically similar to the query, not necessarily ones that contain the answer to the query.

This limitation of semantic search highlights the importance of using a combination of techniques in information retrieval and natural language understanding. While semantic search can greatly improve the relevance of search results, it’s also important to consider other factors, such as the structure and format of the information, the specific details being asked for in a query, and the context in which the query is being made.

In the context of Language Learning Models (LLMs) like GPT-3, BERT, or RoBERTa, semantic search plays a crucial role in information retrieval and question-answering tasks. These models are trained to understand the semantic meaning of text, and they use this understanding to generate responses or retrieve relevant information.

However, the assumption that the answer to a user’s question is semantically similar to the question itself can lead to challenges. For instance, when a user asks a question, the LLM might retrieve a set of documents or sentences that are semantically similar to the question. But these documents might not necessarily contain the answer to the user’s question.

Let’s consider a practical example. Suppose we have an LLM-based application for searching through scientific research papers. A user might ask the question, “What is the impact of climate change on polar bear populations?” The LLM could return several documents that discuss climate change and polar bears, because these documents are semantically similar to the question. However, these documents might not specifically address the impact of climate change on polar bear populations. The actual answer to the user’s question might be contained in a document that discusses various impacts of climate change on different animal species, and this document might not be considered semantically similar to the question because it covers a broader range of topics.

This illustrates a key challenge with semantic search in LLM-based applications: the assumption of semantic similarity between a question and its answer can lead to the retrieval of documents that are relevant to the question, but don’t necessarily contain the answer.

In conclusion, while semantic search is a powerful tool for understanding and retrieving information, it’s not a silver bullet. It’s important to understand its limitations and to use it in combination with other techniques to ensure the best results.

WRITER at MLearning.ai // Code Interpreter // Jailbreaking GPT-4!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com