Enhancing Document Retrieval with Hypothetical Document Embedding (HyDE) and Retrieval-Augmented Generation (RAG)

Unlock Advanced Search Relevance: Bridging the Query-Document Gap with Innovative HyDE Techniques

👨🏾‍💻 GitHub ⭐️ | 👔LinkedIn |📝 Medium

Introduction

In the field of document retrieval, traditional methods often struggle to capture the full context of user queries, particularly when these queries are brief or complex. Despite significant advancements in retrieval systems through sophisticated models and algorithms, a persistent issue is the semantic gap between the concise nature of user queries and the extensive detail present in documents. This semantic gap can impede the relevance of search results, making it challenging for users to locate precisely what they need.

To address this issue, the Hypothetical Document Embedding (HyDE) technique introduces an innovative approach by transforming queries into hypothetical documents designed to encapsulate the query’s answer. This transformation aims to bridge the gap between the query’s representation and the document’s representation in vector space. By aligning the query more closely with the distribution of the actual documents, HyDE seeks to enhance retrieval relevance and accuracy.

This blog explores the implementation of the HyDE technique, focusing on the integration of PDF processing, document chunking, vector storage with FAISS, and the use of a language model to generate hypothetical documents. Through this guide, readers will gain insights into setting up a system that significantly improves retrieval relevance, particularly in specialized domains such as legal research or academic literature.

Overview of the HyDE Technique

The HyDE technique focuses on query expansion, where the original query is transformed into a hypothetical document that contains a detailed answer. This hypothetical document is then used to perform a similarity search against a vector store of preprocessed documents.

Key Components:

PDF Processing and Text Chunking: Extracts content from PDFs and splits it into manageable chunks for efficient vectorization.
Vector Store Creation: Uses FAISS and SentenceTransformerEmbeddings for storing document embeddings and performing fast similarity searches.
Hypothetical Document Generation: Leverages a language model to generate a hypothetical document based on the query.
HyDERetriever Class: Implements the core retrieval logic, generating hypothetical documents and retrieving similar ones from the vector store.

Benefits of the HyDE Approach

Improved Relevance: By generating a detailed hypothetical document, the retrieval system captures more relevant matches, especially for complex or multi-faceted queries.
Handling Complex Queries: This technique excels at handling queries that would be difficult to match directly due to the semantic gap between query and document distributions.
Adaptability: The hypothetical document generation adapts to various query types and domains, making it useful for different applications like legal research, academic literature review, or any domain requiring nuanced retrieval.

Step 1: Install Necessary Packages

Before diving into the code, you’ll need to install the necessary packages. These libraries include tools for document processing, embeddings, and the retrieval system itself.

!pip install python-dotenv 
!pip install langchain -U langchain-community 
!pip install PyMuPDF 
!pip install rank-bm25 
!pip install deepeval 
!pip install langchain_ollama 
!pip install pypdf 
!pip install sentence-transformers 
!pip install faiss-gpu

This will install the core libraries needed to process PDFs, generate embeddings, and retrieve documents based on queries.

If you’re working in Google Colab, use the complete set of instructions provided below. For local machine users, just run the command curl -fsSL https://ollama.com/install.sh | sh, then start the server with ollama serve, and finally, download the model with ollama pull llama3.1.

Install and Load Colab-XTerm

Colab-XTerm is a handy package that enables terminal access within a Colab notebook. This can be useful for running shell commands directly within the notebook environment. To install it, run the following command:

!pip install colab-xterm
%load_ext colabxterm

Installing Ollama

You can then open a terminal session by running:

%xterm

In the terminal, run the following command to install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

ollama serve

Pulling the Models

Once Ollama is installed, you can pull the models you need. Ollama provides several LLMs, including Llama 3.1 and Gemma 2. Here’s how to pull them:

ollama pull llama3.1

The above commands will download and prepare the models for use in your Colab environment.

Alternatively, Pull any LLM model that is available in Ollama. All LLM model lists and details are available: https://ollama.com/library

Step 2: Import Required Modules

We will be using a variety of modules for document processing, text splitting, embeddings, and similarity search.

from concurrent.futures import ThreadPoolExecutor
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.schema import Document
import textwrap
import fitz  # PyMuPDF
import asyncio
import random
import numpy as np
import json
from typing import List
from rank_bm25 import BM25Okapi

Step 3: Process and Vectorize PDF Documents

Define a function to process and convert PDF documents into a vector store for efficient similarity search.

Replace Tabs with Spaces: Clean the document text by replacing tab characters with spaces.
PDF to Vector Store: Load the PDF, split it into chunks, and create a vector store using FAISS and SentenceTransformerEmbeddings.

def replace_tabs_with_spaces(docs: List[Document]) -> List[Document]:
    for doc in docs:
        doc.page_content = doc.page_content.replace('\t', ' ')
    return docs

def pdf_to_vectorstore(path: str, chunk_size=1000, chunk_overlap=200) -> FAISS:
    loader = PyPDFLoader(path)
    docs = loader.load()

    with ThreadPoolExecutor() as executor:
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
        chunks = list(executor.map(lambda d: text_splitter.split_documents([d]), docs))

    chunks = [item for sublist in chunks for item in sublist]
    cleaned_chunks = replace_tabs_with_spaces(chunks)

    embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(cleaned_chunks, embeddings)
    
    return vectorstore

4. Generate Hypothetical Documents

Define the function to generate hypothetical documents that represent the query in a more detailed manner.

Initialize LLM Model: Create an instance of the Ollama model.
Create QA Chain: Define the prompt template and chain it with the language model to answer questions based on the context.
Generate Hypothetical Document: Use the language model to create a hypothetical document based on the query.

def initialize_llm_model():
    return Ollama(model="llama3.1", temperature=0)

def create_qa_chain(llm):
    question_answer_prompt_template = """ 
    For the question below, provide a concise but sufficient answer based ONLY on the provided context:
    {context}
    Question:
    {question}
    """

    question_answer_prompt = PromptTemplate(template=question_answer_prompt_template, input_variables=["context", "question"])
    question_answer_chain = question_answer_prompt | llm
    return question_answer_chain

class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = initialize_llm_model()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = pdf_to_vectorstore(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)

        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
                        The document size should be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        result = self.hyde_chain.invoke(input_variables)
        return result if isinstance(result, str) else result.content

5. Perform Retrieval

Define the retrieval process using the generated hypothetical document to search the vector store for relevant documents.

Retrieve Context: Generate the hypothetical document from the query and use it to search for similar documents in the vector store.

def retrieve_context_for_question(question: str, retriever) -> List[str]:
    if question in context_cache:
        return context_cache[question]

    docs = retriever.get_relevant_documents(question)
    context = [doc.page_content for doc in docs]
    context_cache[question] = context
    return context

class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        # (Initialization code as above)

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc

6. Evaluate Results

Implement functions to evaluate the correctness, faithfulness, and relevance of the generated answers and perform a comprehensive evaluation.

Evaluation Functions: Define metrics to assess how well the generated answers match the ground truth, the faithfulness to the retrieved documents, and the relevance to the query.

def correctness_eval(generated_answer: str, ground_truth: str) -> float:
    return 1.0 if generated_answer.strip().lower() == ground_truth.strip().lower() else 0.0

def faithfulness_eval(generated_answer: str, retrieved_documents: List[str]) -> float:
    context_string = " ".join(retrieved_documents).lower()
    return 1.0 if generated_answer.strip().lower() in context_string else 0.0

def relevance_eval(question: str, generated_answer: str) -> float:
    return 1.0 if question.lower() in generated_answer.lower() else 0.0

def run_evaluation(retriever, num_questions: int = 5) -> None:
    llm = initialize_llm_model()
    qa_chain = create_qa_chain(llm)

    with open("../data/q_a.json", "r", encoding="utf-8") as file:
        q_a_data = json.load(file)

    questions = [qa["question"] for qa in q_a_data][:num_questions]
    ground_truth_answers = [qa["answer"] for qa in q_a_data][:num_questions]

    retrieved_docs_batch = batch_retrieve_context(questions, retriever)
    retrieved_contexts = [" ".join(doc) for doc in retrieved_docs_batch]

    loop = asyncio.get_event_loop()
    generated_answers = loop.run_until_complete(batch_generate_answers(questions, retrieved_contexts, qa_chain))

    correctness_scores = np.array([correctness_eval(g["answer"], gt) for g, gt in zip(generated_answers, ground_truth_answers)])
    faithfulness_scores = np.array([faithfulness_eval(g["answer"], r) for g, r in zip(generated_answers, retrieved_docs_batch)])
    relevance_scores = np.array([relevance_eval(q, g["answer"]) for q, g in zip(questions, generated_answers)])

    print(f"Avg Correctness: {correctness_scores.mean()}\nAvg Faithfulness: {faithfulness_scores.mean()}\nAvg Relevance: {relevance_scores.mean()}")

7. Main Execution Code

This section is where the actual execution of the document retrieval and evaluation process takes place. First use the HyDERetriever, to set the path to your PDF document by updating "/path/to/your/document.pdf" it with the actual file location. Next, initialize the HyDERetriever class with this path to prepare the system for querying. Define a test query, such as, to evaluate the retrieval process. Call the retrieve method to get both the relevant documents and a generated hypothetical document based on your query. Finally, extract and print the content of the retrieved documents and the hypothetical document to review how well the system matches and expands the query.

if __name__ == "__main__":
    # Define the path to the PDF document
    path = "/path/to/your/Facebook.pdf"
    
    # Initialize the HyDERetriever with the path to the PDF document
    retriever = HyDERetriever(path)

    # Define a test query
    test_query = "What were the primary reasons for Facebook's rebranding to Meta Platforms, Inc. in 2021?"
    
    # Retrieve similar documents and generate a hypothetical document based on the test query
    results, hypothetical_doc = retriever.retrieve(test_query)

    # Extract the content from the retrieved documents
    docs_content = [doc.page_content for doc in results]

    # Print the generated hypothetical document
    print("Hypothetical Document:\n")
    print(wrap_text_with_width(hypothetical_doc) + "\n")

    # Print the content of the retrieved documents
    for i, doc in enumerate(docs_content):
        print(f"Context {i+1}:\n{wrap_text_with_width(doc)}\n")

In the below section is the output of hypothetical and original document context:

Hypothetical Document:

**Meta Platforms, Inc. Rebranding Report**

**Executive Summary:**
In October 2021, Facebook, Inc. rebranded to Meta Pla
tforms, Inc., marking a significant shift in the company's identity. This report outlines the primary reasons behind thi
s transformation.

**Reasons for Rebranding:**

1. **Expansion of Services:** The rebranding reflects Facebook's evoluti
on into a comprehensive metaverse platform, encompassing virtual reality (VR), augmented reality (AR), and online social
 interactions.
2. **Diversification of Business:** By separating the company name from its primary product, Meta Platfor
ms, Inc. aims to distance itself from controversies surrounding Facebook, while emphasizing its broader technological am
bitions.
3. **Preparation for Future Growth:** The rebranding positions the company for future growth, as it prepares to
 expand into new markets and industries, such as virtual reality and online commerce.

**Conclusion:**
The rebranding of
 Facebook, Inc. to Meta Platforms, Inc. represents a strategic move towards a more comprehensive and forward-thinking id
entity, reflecting the company's commitment to innovation and expansion.

Context 1:
and advertising drove its growth, culminating in a highly anticipated IPO on May 18, 2012. The 
IPO raised $16 billion, 
making it one of the largest tech IPOs in history and valuing Facebook at 
$104 billion.
1.4 Evolution and Re branding
I
n October 2021, Facebook announced its rebranding to Meta Platforms, Inc., signaling a strategic 
shift towards building
 the “metaverse” — a collective virtual shared space created by the

Context 2:
Facebook: A Comprehensive
Introduction
Facebook, now known as Meta Platforms, Inc., has dramatically transformed the way
 people 
interact and engage with digital content. Since its creation in 2004, the platform has evolved from a 
college 
project into a leading global technology company. This detailed overview covers 
Facebook’s history, services, leadershi
p, location, workforce, and more.
1. History and Foundation
1.1 Origins and Early Development

Context 3:
operations and business strategy helped Facebook become one of the most profitable tech 
companies. Sandberg’s leadershi
p extended to advocating for gender equality in the workplace 
through her book “Lean In” and various public speaking en
gagements.
3.3 Recent Leadership Changes
In recent years, there have been several changes in Facebook’s leadership struc
ture. In 2021, the 
rebranding to Meta Platforms, Inc. marked a shift in focus towards developing the metaverse. The

The output consists of a detailed hypothetical document that aims to answer the query about Facebook’s rebranding, followed by several contexts from the actual document that were retrieved based on their relevance to this hypothetical document. The hypothetical document helps narrow down the search space by providing a detailed representation of the query, while the retrieved contexts give specific pieces of information that match this representation.

This method improves retrieval by transforming the query into a form that is more similar to the document contents, thus enhancing the relevance and usefulness of the retrieved information.

Github code:

Large-Language-Model-LLM-/RAG …

Welcome to the LLM Tutorials and RAG Implementations repository! This repository provides tutorials, guides, and…

github.com

Conclusion

The Hypothetical Document Embedding (HyDE) technique represents a significant advancement in document retrieval. By transforming queries into hypothetical documents, HyDE improves the relevance of retrieved documents, especially for complex queries. This approach is particularly useful in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems.

Implementing HyDE allows you to bridge the semantic gap between queries and documents, leading to more accurate and context-aware retrieval results. Whether you’re dealing with legal documents, academic papers, or any other form of complex content, HyDE offers a powerful solution for achieving more relevant search results.

References: https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/HyDe_Hypothetical_Document_Embedding.ipynb

Happy coding! 🎉

👨🏾‍💻 GitHub ⭐️ | 👔LinkedIn |📝 Medium

Thank you for your time in reading this post!

Make sure to leave your feedback and comments. See you in the next blog, stay tuned 📢

Enjoyed this article? Check out more of my work:

Build Your Own AI Assistant: Discover a step-by-step guide to creating an AI assistant using GPT4All and Langchain, along with a performance comparison of Mixtral vs. Llama3. Check out the guide.
Run LLaMA3.1 and Gemma2 with Ollama: Learn how to run LLaMA3.1 and Gemma2 models locally or on Google Colab using Ollama. Find out how here.
Supercharge Text-to-Speech with Piper TTS: Speed up your text-to-speech capabilities with Piper TTS — 10x faster, real-time, and offline with human-like accuracy. Transform your text into lifelike speech.