avatarAhmed Besbes

Summary

This text discusses three advanced techniques to enhance document retrieval in RAG-based applications: query expansion, cross-encoder re-ranking, and embedding adaptors.

Abstract

The article begins by addressing the issue of off-the-shelf RAG implementations retrieving irrelevant documents for user queries. To tackle this problem, the author introduces three techniques: query expansion, cross-encoder re-ranking, and embedding adaptors. Query expansion involves rephrasing the original query, which can be done by generating a hypothetical answer or creating related questions. Cross-encoder re-ranking uses a deep neural network to compare and contrast inputs, allowing for better relevancy scoring of retrieved documents. Lastly, embedding adaptors are lightweight alternatives to fully fine-tuning a pre-trained model, which can be used to alter the embedding query for better retrieval results.

Opinions

  • The author believes that query expansion, specifically generating a hypothetical answer, works surprisingly well to improve document retrieval.
  • The author suggests that cross-encoder re-ranking can be used with query expansion to reduce context size while selecting the most important pieces.
  • The author emphasizes the importance of training an embedding adaptor using user feedback on the relevancy of retrieved documents.
  • The author mentions ongoing research in the field, including fine-tuning the embedding model using real feedback data and exploring more complex embedding adaptors using deep neural networks.
  • The author encourages readers to follow their newsletter for more machine learning content and practical tips from the industry.
  • The author concludes by recommending an AI service that provides similar performance to ChatGPT Plus (GPT-4) but at a more cost-effective price.
  • The author invites readers to share the article and follow them on Medium.

3 Advanced Document Retrieval Techniques To Improve RAG Systems

Query expansion, cross-encoder re-ranking, and embedding adaptors

Image created by the author using DALL-E 3

Have you ever observed that documents retrieved by RAG systems may not always align with the user’s query?

This is a common occurrence, particularly with off-the-shelf RAG implementations. Documents may lack complete answers to the query, contain redundant information, or include irrelevant details. Furthermore, the order in which these documents are presented may not consistently match the user’s intent.

In this post, we will explore three effective techniques to enhance document retrieval in RAG-based applications:

  1. Query expansion
  2. Cross-encoder re-ranking
  3. Embedding adaptors

By incorporating these techniques, you can retrieve more pertinent documents that closely match the user’s query, thereby increasing the impact of the generated answer.

Let’s have a look 👇.

If you’re interested in ML content, detailed tutorials and practical tips from the industry, follow my newsletter. It’s called The Tech Buffet.

1 — Query expansion 💥

Query expansion refers to a set of techniques that rephrase the original query.

Two popular methods that are easy to implement will be discussed in this article.

👉 Query expansion with a generated answer

Given an input query, this method first instructs an LLM to provide a hypothetical answer, whatever its correctness.

Then, the query and the generated answer are combined in a prompt and sent to the retrieval system.

Image by the author

This technique works surprisingly well. Check the findings of this paper to learn more about it.

The rationale behind this method is that we want to retrieve documents that look more like an answer. The correctness of the hypothetical answer doesn’t matter much because what we’re interested in is its structure and formulation.

You could consider the hypothetical answer as a template that helps identify a relevant neighborhood in the embedding space.

Here’s an example of a prompt I used to augment the query sent to a RAG that answers questions about financial reports.

You are a helpful expert financial research assistant.

Provide an example answer to the given question, that might 
be found in a document like an annual report.

👉 Query expansion with multiple related questions

This second method instructs an LLM to generate N questions related to the original query and then sends them all (+ the original query) to the retrieval system.

By doing this, more documents will be retrieved from the vectorstore. However, some of them will be duplicates which is why you need to perform post-processing to remove them.

Image by the author

The idea behind this method is that you extend the initial query that may be incomplete or ambiguous and incorporate related aspects that can be eventually relevant and complementary.

Here’s a prompt I used to generate the related questions:

You are a helpful expert financial research assistant. 
Your users are asking questions about an annual report.
Suggest up to five additional related questions to help them 
find the information they need, for the provided question.
Suggest only short questions without compound sentences. 
Suggest a variety of questions that cover different aspects of the topic.
Make sure they are complete questions, and that they are related to 
the original question.
Output one question per line. Do not number the questions.

The downside of this method is that we end up with a lot more documents that may distract the LLM from generating a useful answer.

That’s where re-ranking comes into play 👇.

To learn more about different query expansion techniques, check this paper for Google.

2 — Cross encoder re-ranking 📊

This method re-ranks the retrieved documents according to a score that quantifies their relevancy with the input query.

Image by the author

To compute this score, we will use a cross-encoder.

A cross-encoder is a deep neural network that processes two input sequences together as a single input. This allows the model to directly compare and contrast the inputs, understanding their relationship in a more integrated and nuanced way.

Image by the author

Cross-encoders can be used for Information Retrieval: given a query, encode it with all retrieved documents. Then, sort them in a decreasing order. The high-scored documents are the most relevant ones.

See SBERT.net Retrieve & Re-rank for more details.

Image by the author

Here’s how to quickly get started with re-ranking using cross-encoders:

  • Install sentence-transformers:
pip install -U sentence-transformers
  • Import the cross-encoder and load it:
from sentence_transformers import CrossEncoder 
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
  • Score each pair of (query, document):
pairs = [[query, doc] for doc in retrieved_documents] 
scores = cross_encoder.predict(pairs) 

print("Scores:") for score in scores:     
print(score)  

# Scores: 
# 0.98693466 
# 2.644579 
# -0.26802942 
# -10.73159 
# -7.7066045 
# -5.6469955 
# -4.297035 
# -10.933233 
# -7.0384283 
# -7.3246956
  • Reorder the documents:
print("New Ordering:") 
for o in np.argsort(scores)[::-1]:
    print(o+1)

Cross-encoder re-ranking can be used with query expansion: after you generate multiple related questions and retrieve the corresponding documents (say you end up with M documents), you re-rank them and pick the top K (K < M). That way, you reduce the context size while selecting the most important pieces.

In the next section, we will dive into adaptors, a powerful yet simple-to-implement technique to scale embeddings to better align with the user’s task.

3 — Embedding adaptors 🧩

This method leverages user feedback on the relevancy of the retrieved documents to train an adapter.

An adapter is a lightweight alternative to fully fine-tune a pre-trained model. Currently, adapters are implemented as small feedforward neural networks that are inserted between layers of pre-trained models.

The underlying goal of training an adapter is to alter the embedding query to produce better retrieval results for a specific task.

An embedding adapter is a stage that can be inserted after the embedding phase and before the retrieval. Think about it as a matrix (with trained weights) that takes the original embedding and scales it.

Image by the author

To train an adapter, we need to go through the following steps.

Prepare the training data

To train an embedding adapter, we need some training data on the relevancy of the documents. This data can be manually labeled or generated by an LLM.

This data must include tuples of (query, document) as well as their corresponding labels (1 if the document is relevant to the query, -1 otherwise).

For the sake of simplicity, we’re going to create a synthetic dataset, but in real-world settings, you need to find a way to collect user feedback (e.g. ask the user to rate the relevancy of the document from the interface with 👍 and 👎)

To create some training data, we first generate sample questions that a financial analyst may ask when analyzing a financial report.

Let’s use an LLM for this:

import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

PROMPT_DATASET = """
You are a helpful expert financial research assistant. 
You help users analyze financial statements to better understand companies.
Suggest 10 to 15 short questions that are important to ask when analyzing 
an annual report.
Do not output any compound questions (questions with multiple sentences 
or conjunctions).
Output each question on a separate line divided by a newline.
"""

def generate_queries(model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": PROMPT_DATASET,
        },
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content


generated_queries = generate_queries()
for query in generated_queries:
    print(query)

# 1. What is the company's revenue growth rate over the past three years?
# 2. What are the company's total assets and total liabilities?
# 3. How much debt does the company have? Is it increasing or decreasing?
# 4. What is the company's profit margin? Is it improving or declining?
# 5. What are the company's cash flow from operations, investing, and financing activities?
# 6. What are the company's major sources of revenue?
# 7. Does the company have any pending litigation or legal issues?
# 8. What is the company's market share compared to its competitors?
# 9. How much cash does the company have on hand?
# 10. Are there any major changes in the company's executive team or board of directors?
# 11. What is the company's dividend history and policy?
# 12. Are there any related party transactions?
# 13. What are the company's major risks and uncertainties?
# 14. What is the company's current ratio and quick ratio?
# 15. How has the company's stock price performed over the past year?

Then, we retrieve documents for each generated question. To do this, we’ll query a Chroma collection where we’ve previously indexed a financial report.

results = chroma_collection.query(query_texts=generated_queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

We evaluate the relevance of each question to its corresponding documents. Once again, we’ll use an LLM for this task:

PROMPT_EVALUATION = """
You are a helpful expert financial research assistant. 
You help users analyze financial statements to better understand companies.
For the given query, evaluate whether the following satement is relevant.
Output only 'yes' or 'no'.
"""

def evaluate_results(query, statement, model="gpt-3.5-turbo"):
    messages = [
    {
        "role": "system",
        "content": PROMPT_EVALUATION,
    },
    {
        "role": "user",
        "content": f"Query: {query}, Statement: {statement}"
    }
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=1
    )
    content = response.choices[0].message.content
    if content == "yes":
        return 1
    return -1

Now we structure the training data into tuples.

Each tuple will contain the embedding of the query, the embedding of a document, and the evaluation label (1, -1).

retrieved_embeddings = results['embeddings']
query_embeddings = embedding_function(generated_queries)

adapter_query_embeddings = []
adapter_doc_embeddings = []
adapter_labels = []

for q, query in enumerate(tqdm(generated_queries)):
    for d, document in enumerate(retrieved_documents[q]):
        adapter_query_embeddings.append(query_embeddings[q])
        adapter_doc_embeddings.append(retrieved_embeddings[q][d])
        adapter_labels.append(evaluate_results(query, document))

When tuples are created, we put them in a Torch Dataset to prepare for the training.

adapter_query_embeddings = torch.Tensor(np.array(adapter_query_embeddings))
adapter_doc_embeddings = torch.Tensor(np.array(adapter_doc_embeddings))
adapter_labels = torch.Tensor(np.expand_dims(np.array(adapter_labels),1))
dataset = torch.utils.data.TensorDataset(adapter_query_embeddings, adapter_doc_embeddings, adapter_labels)

Define a model

We define a function that takes the query embedding, the document embedding, and the adaptor matrix as input. This function first multiplies the query embedding with the adaptor matrix and computes a cosine similarity between this result and the document embedding.

def model(query_embedding, document_embedding, adaptor_matrix):
    updated_query_embedding = torch.matmul(adaptor_matrix, query_embedding)
    return torch.cosine_similarity(updated_query_embedding, document_embedding, dim=0)

Define the loss

Our goal is to minimize the cosine similarity computed by the previous function. To do this, we’ll use a Mean Square Error (MSE) loss to optimize the weights of the adaptor matrix.

def mse_loss(query_embedding, document_embedding, adaptor_matrix, label):
    return torch.nn.MSELoss()(model(query_embedding, document_embedding, adaptor_matrix), label)

Run backpropagation:

In this step, we first initialize the adaptor matrix and train it over 100 epochs.

# Initialize the adaptor matrix
mat_size = len(adapter_query_embeddings[0])
adapter_matrix = torch.randn(mat_size, mat_size, requires_grad=True)

min_loss = float('inf')
best_matrix = None
for epoch in tqdm(range(100)):
    for query_embedding, document_embedding, label in dataset:
        loss = mse_loss(query_embedding, document_embedding, adapter_matrix, label)
        if loss < min_loss:
            min_loss = loss
            best_matrix = adapter_matrix.clone().detach().numpy()
        loss.backward()
        with torch.no_grad():
            adapter_matrix -= 0.01 * adapter_matrix.grad
            adapter_matrix.grad.zero_()

Once the training is complete, the adapter can be used to scale the original embedding and adapt to the user task.

All you need now is to take the original embedding output and multiply it with the adaptor matrix before feeding it to the retrieval system.

test_vector = torch.ones((mat_size,1))
scaled_vector = np.matmul(best_matrix, test_vector).numpy()

test_vector.shape
# torch.Size([384, 1])

scaled_vector.shape
# (384, 1)

best_matrix.shape
# (384, 384)

Thanks for reading

These retrieval techniques we covered help improve the relevancy of the documents.

There’s however ongoing research in this area and other methods are currently being assessed and explored. For example,

  • Fine-tuning the embedding model using real feedback data
  • Fine-tuning the LLM directly to maximize its retrieval power (RA-DIT: Retrieval Augmented Dual Instruction Tuning)
  • Exploring more complex embedding adaptors using deep neural networks instead of matrices
  • Deep and intelligent chunking techniques

More on them later.

If you enjoyed this article, don’t forget to share it and follow me on Medium.

Until next time 👋.

Programming
Machine Learning
Artificial Intelligence
Data Science
Editors Pick
Recommended from ReadMedium