Enhancing Document Retrieval with Hypothetical Document Embedding (HyDE) and Retrieval-Augmented Generation (RAG)
Unlock Advanced Search Relevance: Bridging the Query-Document Gap with Innovative HyDE Techniques
👨🏾💻 GitHub ⭐️ | 👔LinkedIn |📝 Medium

Introduction
In the field of document retrieval, traditional methods often struggle to capture the full context of user queries, particularly when these queries are brief or complex. Despite significant advancements in retrieval systems through sophisticated models and algorithms, a persistent issue is the semantic gap between the concise nature of user queries and the extensive detail present in documents. This semantic gap can impede the relevance of search results, making it challenging for users to locate precisely what they need.
To address this issue, the Hypothetical Document Embedding (HyDE) technique introduces an innovative approach by transforming queries into hypothetical documents designed to encapsulate the query’s answer. This transformation aims to bridge the gap between the query’s representation and the document’s representation in vector space. By aligning the query more closely with the distribution of the actual documents, HyDE seeks to enhance retrieval relevance and accuracy.
This blog explores the implementation of the HyDE technique, focusing on the integration of PDF processing, document chunking, vector storage with FAISS, and the use of a language model to generate hypothetical documents. Through this guide, readers will gain insights into setting up a system that significantly improves retrieval relevance, particularly in specialized domains such as legal research or academic literature.
Overview of the HyDE Technique
The HyDE technique focuses on query expansion, where the original query is transformed into a hypothetical document that contains a detailed answer. This hypothetical document is then used to perform a similarity search against a vector store of preprocessed documents.
Key Components:
- PDF Processing and Text Chunking: Extracts content from PDFs and splits it into manageable chunks for efficient vectorization.
- Vector Store Creation: Uses FAISS and SentenceTransformerEmbeddings for storing document embeddings and performing fast similarity searches.
- Hypothetical Document Generation: Leverages a language model to generate a hypothetical document based on the query.
- HyDERetriever Class: Implements the core retrieval logic, generating hypothetical documents and retrieving similar ones from the vector store.
Benefits of the HyDE Approach
- Improved Relevance: By generating a detailed hypothetical document, the retrieval system captures more relevant matches, especially for complex or multi-faceted queries.
- Handling Complex Queries: This technique excels at handling queries that would be difficult to match directly due to the semantic gap between query and document distributions.
- Adaptability: The hypothetical document generation adapts to various query types and domains, making it useful for different applications like legal research, academic literature review, or any domain requiring nuanced retrieval.
Step 1: Install Necessary Packages
Before diving into the code, you’ll need to install the necessary packages. These libraries include tools for document processing, embeddings, and the retrieval system itself.
!pip install python-dotenv !pip install langchain -U langchain-community !pip install PyMuPDF !pip install rank-bm25 !pip install deepeval !pip install langchain_ollama !pip install pypdf !pip install sentence-transformers !pip install faiss-gpu
This will install the core libraries needed to process PDFs, generate embeddings, and retrieve documents based on queries.
If you’re working in Google Colab, use the complete set of instructions provided below. For local machine users, just run the command
curl -fsSL https://ollama.com/install.sh | sh, then start the server withollama serve, and finally, download the model withollama pull llama3.1.
Install and Load Colab-XTerm
Colab-XTerm is a handy package that enables terminal access within a Colab notebook. This can be useful for running shell commands directly within the notebook environment. To install it, run the following command:
!pip install colab-xterm %load_ext colabxterm
Installing Ollama
You can then open a terminal session by running:
%xterm
In the terminal, run the following command to install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
Pulling the Models
Once Ollama is installed, you can pull the models you need. Ollama provides several LLMs, including Llama 3.1 and Gemma 2. Here’s how to pull them:
ollama pull llama3.1The above commands will download and prepare the models for use in your Colab environment.
Alternatively, Pull any LLM model that is available in Ollama. All LLM model lists and details are available: https://ollama.com/library
Step 2: Import Required Modules
We will be using a variety of modules for document processing, text splitting, embeddings, and similarity search.
from concurrent.futures import ThreadPoolExecutor
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.schema import Document
import textwrap
import fitz # PyMuPDF
import asyncio
import random
import numpy as np
import json
from typing import List
from rank_bm25 import BM25OkapiStep 3: Process and Vectorize PDF Documents
Define a function to process and convert PDF documents into a vector store for efficient similarity search.
- Replace Tabs with Spaces: Clean the document text by replacing tab characters with spaces.
- PDF to Vector Store: Load the PDF, split it into chunks, and create a vector store using FAISS and SentenceTransformerEmbeddings.
def replace_tabs_with_spaces(docs: List[Document]) -> List[Document]:
for doc in docs:
doc.page_content = doc.page_content.replace('\t', ' ')
return docs
def pdf_to_vectorstore(path: str, chunk_size=1000, chunk_overlap=200) -> FAISS:
loader = PyPDFLoader(path)
docs = loader.load()
with ThreadPoolExecutor() as executor:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
chunks = list(executor.map(lambda d: text_splitter.split_documents([d]), docs))
chunks = [item for sublist in chunks for item in sublist]
cleaned_chunks = replace_tabs_with_spaces(chunks)
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(cleaned_chunks, embeddings)
return vectorstore4. Generate Hypothetical Documents
Define the function to generate hypothetical documents that represent the query in a more detailed manner.
- Initialize LLM Model: Create an instance of the Ollama model.
- Create QA Chain: Define the prompt template and chain it with the language model to answer questions based on the context.
- Generate Hypothetical Document: Use the language model to create a hypothetical document based on the query.
def initialize_llm_model():
return Ollama(model="llama3.1", temperature=0)
def create_qa_chain(llm):
question_answer_prompt_template = """
For the question below, provide a concise but sufficient answer based ONLY on the provided context:
{context}
Question:
{question}
"""
question_answer_prompt = PromptTemplate(template=question_answer_prompt_template, input_variables=["context", "question"])
question_answer_chain = question_answer_prompt | llm
return question_answer_chain
class HyDERetriever:
def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
self.llm = initialize_llm_model()
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.vectorstore = pdf_to_vectorstore(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
self.hyde_prompt = PromptTemplate(
input_variables=["query", "chunk_size"],
template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
The document size should be exactly {chunk_size} characters.""",
)
self.hyde_chain = self.hyde_prompt | self.llm
def generate_hypothetical_document(self, query):
input_variables = {"query": query, "chunk_size": self.chunk_size}
result = self.hyde_chain.invoke(input_variables)
return result if isinstance(result, str) else result.content5. Perform Retrieval
Define the retrieval process using the generated hypothetical document to search the vector store for relevant documents.
- Retrieve Context: Generate the hypothetical document from the query and use it to search for similar documents in the vector store.
def retrieve_context_for_question(question: str, retriever) -> List[str]:
if question in context_cache:
return context_cache[question]
docs = retriever.get_relevant_documents(question)
context = [doc.page_content for doc in docs]
context_cache[question] = context
return context
class HyDERetriever:
def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
# (Initialization code as above)
def retrieve(self, query, k=3):
hypothetical_doc = self.generate_hypothetical_document(query)
similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
return similar_docs, hypothetical_doc6. Evaluate Results
Implement functions to evaluate the correctness, faithfulness, and relevance of the generated answers and perform a comprehensive evaluation.
- Evaluation Functions: Define metrics to assess how well the generated answers match the ground truth, the faithfulness to the retrieved documents, and the relevance to the query.
def correctness_eval(generated_answer: str, ground_truth: str) -> float:
return 1.0 if generated_answer.strip().lower() == ground_truth.strip().lower() else 0.0
def faithfulness_eval(generated_answer: str, retrieved_documents: List[str]) -> float:
context_string = " ".join(retrieved_documents).lower()
return 1.0 if generated_answer.strip().lower() in context_string else 0.0
def relevance_eval(question: str, generated_answer: str) -> float:
return 1.0 if question.lower() in generated_answer.lower() else 0.0
def run_evaluation(retriever, num_questions: int = 5) -> None:
llm = initialize_llm_model()
qa_chain = create_qa_chain(llm)
with open("../data/q_a.json", "r", encoding="utf-8") as file:
q_a_data = json.load(file)
questions = [qa["question"] for qa in q_a_data][:num_questions]
ground_truth_answers = [qa["answer"] for qa in q_a_data][:num_questions]
retrieved_docs_batch = batch_retrieve_context(questions, retriever)
retrieved_contexts = [" ".join(doc) for doc in retrieved_docs_batch]
loop = asyncio.get_event_loop()
generated_answers = loop.run_until_complete(batch_generate_answers(questions, retrieved_contexts, qa_chain))
correctness_scores = np.array([correctness_eval(g["answer"], gt) for g, gt in zip(generated_answers, ground_truth_answers)])
faithfulness_scores = np.array([faithfulness_eval(g["answer"], r) for g, r in zip(generated_answers, retrieved_docs_batch)])
relevance_scores = np.array([relevance_eval(q, g["answer"]) for q, g in zip(questions, generated_answers)])
print(f"Avg Correctness: {correctness_scores.mean()}\nAvg Faithfulness: {faithfulness_scores.mean()}\nAvg Relevance: {relevance_scores.mean()}")
7. Main Execution Code
This section is where the actual execution of the document retrieval and evaluation process takes place. First use the HyDERetriever, to set the path to your PDF document by updating "/path/to/your/document.pdf" it with the actual file location. Next, initialize the HyDERetriever class with this path to prepare the system for querying. Define a test query, such as, to evaluate the retrieval process. Call the retrieve method to get both the relevant documents and a generated hypothetical document based on your query. Finally, extract and print the content of the retrieved documents and the hypothetical document to review how well the system matches and expands the query.
if __name__ == "__main__":
# Define the path to the PDF document
path = "/path/to/your/Facebook.pdf"
# Initialize the HyDERetriever with the path to the PDF document
retriever = HyDERetriever(path)
# Define a test query
test_query = "What were the primary reasons for Facebook's rebranding to Meta Platforms, Inc. in 2021?"
# Retrieve similar documents and generate a hypothetical document based on the test query
results, hypothetical_doc = retriever.retrieve(test_query)
# Extract the content from the retrieved documents
docs_content = [doc.page_content for doc in results]
# Print the generated hypothetical document
print("Hypothetical Document:\n")
print(wrap_text_with_width(hypothetical_doc) + "\n")
# Print the content of the retrieved documents
for i, doc in enumerate(docs_content):
print(f"Context {i+1}:\n{wrap_text_with_width(doc)}\n")In the below section is the output of hypothetical and original document context:
Hypothetical Document:
**Meta Platforms, Inc. Rebranding Report**
**Executive Summary:**
In October 2021, Facebook, Inc. rebranded to Meta Pla
tforms, Inc., marking a significant shift in the company's identity. This report outlines the primary reasons behind thi
s transformation.
**Reasons for Rebranding:**
1. **Expansion of Services:** The rebranding reflects Facebook's evoluti
on into a comprehensive metaverse platform, encompassing virtual reality (VR), augmented reality (AR), and online social
interactions.
2. **Diversification of Business:** By separating the company name from its primary product, Meta Platfor
ms, Inc. aims to distance itself from controversies surrounding Facebook, while emphasizing its broader technological am
bitions.
3. **Preparation for Future Growth:** The rebranding positions the company for future growth, as it prepares to
expand into new markets and industries, such as virtual reality and online commerce.
**Conclusion:**
The rebranding of
Facebook, Inc. to Meta Platforms, Inc. represents a strategic move towards a more comprehensive and forward-thinking id
entity, reflecting the company's commitment to innovation and expansion.
Context 1:
and advertising drove its growth, culminating in a highly anticipated IPO on May 18, 2012. The
IPO raised $16 billion,
making it one of the largest tech IPOs in history and valuing Facebook at
$104 billion.
1.4 Evolution and Re branding
I
n October 2021, Facebook announced its rebranding to Meta Platforms, Inc., signaling a strategic
shift towards building
the “metaverse” — a collective virtual shared space created by the
Context 2:
Facebook: A Comprehensive
Introduction
Facebook, now known as Meta Platforms, Inc., has dramatically transformed the way
people
interact and engage with digital content. Since its creation in 2004, the platform has evolved from a
college
project into a leading global technology company. This detailed overview covers
Facebook’s history, services, leadershi
p, location, workforce, and more.
1. History and Foundation
1.1 Origins and Early Development
Context 3:
operations and business strategy helped Facebook become one of the most profitable tech
companies. Sandberg’s leadershi
p extended to advocating for gender equality in the workplace
through her book “Lean In” and various public speaking en
gagements.
3.3 Recent Leadership Changes
In recent years, there have been several changes in Facebook’s leadership struc
ture. In 2021, the
rebranding to Meta Platforms, Inc. marked a shift in focus towards developing the metaverse. TheThe output consists of a detailed hypothetical document that aims to answer the query about Facebook’s rebranding, followed by several contexts from the actual document that were retrieved based on their relevance to this hypothetical document. The hypothetical document helps narrow down the search space by providing a detailed representation of the query, while the retrieved contexts give specific pieces of information that match this representation.
This method improves retrieval by transforming the query into a form that is more similar to the document contents, thus enhancing the relevance and usefulness of the retrieved information.
Conclusion
The Hypothetical Document Embedding (HyDE) technique represents a significant advancement in document retrieval. By transforming queries into hypothetical documents, HyDE improves the relevance of retrieved documents, especially for complex queries. This approach is particularly useful in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems.
Implementing HyDE allows you to bridge the semantic gap between queries and documents, leading to more accurate and context-aware retrieval results. Whether you’re dealing with legal documents, academic papers, or any other form of complex content, HyDE offers a powerful solution for achieving more relevant search results.
Happy coding! 🎉
👨🏾💻 GitHub ⭐️ | 👔LinkedIn |📝 Medium
Thank you for your time in reading this post!
Make sure to leave your feedback and comments. See you in the next blog, stay tuned 📢
Enjoyed this article? Check out more of my work:
- Build Your Own AI Assistant: Discover a step-by-step guide to creating an AI assistant using GPT4All and Langchain, along with a performance comparison of Mixtral vs. Llama3. Check out the guide.
- Run LLaMA3.1 and Gemma2 with Ollama: Learn how to run LLaMA3.1 and Gemma2 models locally or on Google Colab using Ollama. Find out how here.
- Supercharge Text-to-Speech with Piper TTS: Speed up your text-to-speech capabilities with Piper TTS — 10x faster, real-time, and offline with human-like accuracy. Transform your text into lifelike speech.




