The web content discusses an exploratory case study on enhancing Elasticsearch with semantic search capabilities to improve search engine performance.
Abstract
The article delves into the integration of Elasticsearch, known for its fast keyword search, with semantic search techniques to enhance search engine efficiency and relevance. The author conducts a hands-on case study using Wikipedia articles as a dataset, demonstrating the initial effectiveness of Elasticsearch's keyword search. However, when semantic search is layered on top, using BERT embeddings for document relevance, the results are unexpectedly poor. The author attributes this to the disparity in length and content between the user queries and the document embeddings, which can dilute specific information and introduce noise. The article sets the stage for a follow-up piece that promises to outline a more effective two-stage search approach.
Opinions
The author views traditional keyword search as a solid foundation but acknowledges its limitations in capturing the nuances of user queries.
There is an optimistic opinion about the potential of combining Elasticsearch with semantic search to create a more intuitive search system.
The author expresses surprise at the counterintuitive results produced by the initial attempt to blend keyword and semantic search.
The article suggests that document-level embeddings may not be sufficient for accurate semantic search due to the dilution of specific details in longer documents.
The author maintains a positive outlook, indicating that the challenges encountered are solvable and that a more performant system is achievable.
Combining Elasticsearch And Semantic Search: A Case Study (Part 1)
Search engines are becoming smarter, more intuitive, and more responsive to user needs. Among the variety of tools and techniques available, Elasticsearch has established itself as a powerful platform for lightning-fast search operations. But what if we could further amplify its capabilities?
In this article, we embark on an exploratory journey to meld the precision of Elasticsearch with the nuanced understanding of semantic search.
Through a hands-on case study, we’ll scrutinize the outcomes of this blend and evaluate its efficacy. From initial successes to baffling challenges, our experiments offer insights into the promises and pitfalls of integrating traditional keyword search with cutting-edge semantic algorithms.
Dive in as we decode the intricacies of modern search and lay the groundwork for the next wave of innovation in information retrieval.
Basic Keyword Search
Let’s begin by exploring the foundation of most search systems: the Basic Keyword Search.
This conventional method, which relies heavily on exact keyword matches, has been the bedrock of information retrieval for years. But how does it fare in real-world scenarios? And more importantly, what potential lies beyond it?
Here is the code to do that using Elasticsearch.
In this first experiment, I created an index to store all the Wikipedia articles I downloaded. I then went on to query that index based on user queries. And this traditional keyword-based approach is already quite powerful. If the user asks who Obama was for example, here are the results he would get:
As you can see, the article about Obama comes in first, followed by the one about Biden, and it makes sense. Biden was Obama’s VP.
We could already use these results to power a RAG system. And there are advantages to using keyword search in a RAG system: it’s cheaper (no need to compute vector embeddings and perform semantic similarity), it is faster. But for some questions, keyword search might also underperform.
So, what if we added semantic search on top of the traditional keyword based approach. If you read the lines of codes I shared above, you might have noticed that I was already generating some text embeddings using the BERT model. It will be useful for the second part of this blogpost.
Blending keyword search and semantic search
My first idea here was to implement a two stage system. First you retrieve the most relevant documents based on keyword search, then you extract the most relevant among them based on document level embeddings.
Here is the code to do that:
The results were surprisingly bad. I tried several times and got semantic scores that were very counterintuitive. Look at the example below:
Enter your question (or type 'exit' to quit): who was obama?
Results:
1.**Title**: Trump
-**Similarity Score**: 0.4779
-**Snippet**: The trumpet is a brass instrument commonly used in classical and jazz ensembles. The trumpet group ranges from the piccolo trumpet—with the highest register in the brass family—to the bass trumpet, pi...
2.**Title**: Donald Trump
-**Similarity Score**: -0.2824
-**Snippet**: Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021. Trump received a BS in eco...
3.**Title**: President Trump
-**Similarity Score**: -0.2824
-**Snippet**: Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021. Trump received a BS in eco...
4.**Title**: George Bush
-**Similarity Score**: -0.3696
-**Snippet**: George Walker Bush (born July 6, 1946) is an American politician who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, he previously served as the...
When querying the system with “who was obama?”, one would expect to receive results related to Barack Obama, the 44th president of the United States. Instead, the results are bafflingly off-the-mark, with the top results being related to Trump, another U.S. president, and even more strangely, a result about the trumpet, a musical instrument.
The similarity scores further add to the confusion. The highest similarity score is associated with the “trumpet” result, which is counterintuitive given the query. Moreover, results related to Donald Trump have negative similarity scores, suggesting they are not similar to the query, yet they rank higher than many other potential documents.
Here is why it doesn’t work:
When embeddings are computed on entire documents, they encapsulate the general theme or essence of the document as a whole. These embeddings can be high-dimensional, and the nuances or specific details within the document might get diluted, especially in long documents.
When a user query, which is typically much shorter and more specific, is converted into an embedding, it captures the essence of that specific query. Comparing this query embedding with the document-level embedding can be challenging because:
Length Disparity: The user query is much shorter than a full document. This disparity in length means the query’s embedding is more focused on the specific topic of the query, while the document’s embedding represents a broader range of information.
Dilution of Information: Important details or topics within a long document might be underrepresented in the document’s overall embedding, especially if they are not the main theme of the document.
Noise: Longer documents may contain a variety of topics, not all of which are relevant to a given query. The document’s embedding might be influenced by these other topics, making it less similar to the query embedding even if the document contains relevant information.
In part 2 of this tutorial series, I will show you how to build a more performant two stage search over documents.