3 Advanced Document Retrieval Techniques To Improve RAG Systems
Query expansion, cross-encoder re-ranking, and embedding adaptors

Have you ever observed that documents retrieved by RAG systems may not always align with the user’s query?
This is a common occurrence, particularly with off-the-shelf RAG implementations. Documents may lack complete answers to the query, contain redundant information, or include irrelevant details. Furthermore, the order in which these documents are presented may not consistently match the user’s intent.
In this post, we will explore three effective techniques to enhance document retrieval in RAG-based applications:
- Query expansion
- Cross-encoder re-ranking
- Embedding adaptors
By incorporating these techniques, you can retrieve more pertinent documents that closely match the user’s query, thereby increasing the impact of the generated answer.
Let’s have a look 👇.
If you’re interested in ML content, detailed tutorials and practical tips from the industry, follow my newsletter. It’s called The Tech Buffet.
1 — Query expansion 💥
Query expansion refers to a set of techniques that rephrase the original query.
Two popular methods that are easy to implement will be discussed in this article.
👉 Query expansion with a generated answer
Given an input query, this method first instructs an LLM to provide a hypothetical answer, whatever its correctness.
Then, the query and the generated answer are combined in a prompt and sent to the retrieval system.

This technique works surprisingly well. Check the findings of this paper to learn more about it.
The rationale behind this method is that we want to retrieve documents that look more like an answer. The correctness of the hypothetical answer doesn’t matter much because what we’re interested in is its structure and formulation.
You could consider the hypothetical answer as a template that helps identify a relevant neighborhood in the embedding space.
Here’s an example of a prompt I used to augment the query sent to a RAG that answers questions about financial reports.
You are a helpful expert financial research assistant.
Provide an example answer to the given question, that might
be found in a document like an annual report.
👉 Query expansion with multiple related questions
This second method instructs an LLM to generate N questions related to the original query and then sends them all (+ the original query) to the retrieval system.
By doing this, more documents will be retrieved from the vectorstore. However, some of them will be duplicates which is why you need to perform post-processing to remove them.

The idea behind this method is that you extend the initial query that may be incomplete or ambiguous and incorporate related aspects that can be eventually relevant and complementary.
Here’s a prompt I used to generate the related questions:
You are a helpful expert financial research assistant.
Your users are asking questions about an annual report.
Suggest up to five additional related questions to help them
find the information they need, for the provided question.
Suggest only short questions without compound sentences.
Suggest a variety of questions that cover different aspects of the topic.
Make sure they are complete questions, and that they are related to
the original question.
Output one question per line. Do not number the questions.
The downside of this method is that we end up with a lot more documents that may distract the LLM from generating a useful answer.
That’s where re-ranking comes into play 👇.
To learn more about different query expansion techniques, check this paper for Google.
2 — Cross encoder re-ranking 📊
This method re-ranks the retrieved documents according to a score that quantifies their relevancy with the input query.

To compute this score, we will use a cross-encoder.
A cross-encoder is a deep neural network that processes two input sequences together as a single input. This allows the model to directly compare and contrast the inputs, understanding their relationship in a more integrated and nuanced way.

Cross-encoders can be used for Information Retrieval: given a query, encode it with all retrieved documents. Then, sort them in a decreasing order. The high-scored documents are the most relevant ones.
See SBERT.net Retrieve & Re-rank for more details.

Here’s how to quickly get started with re-ranking using cross-encoders:
- Install sentence-transformers:
pip install -U sentence-transformers
- Import the cross-encoder and load it:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
- Score each pair of (query, document):
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print("Scores:") for score in scores:
print(score)
# Scores:
# 0.98693466
# 2.644579
# -0.26802942
# -10.73159
# -7.7066045
# -5.6469955
# -4.297035
# -10.933233
# -7.0384283
# -7.3246956
- Reorder the documents:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
print(o+1)
Cross-encoder re-ranking can be used with query expansion: after you generate multiple related questions and retrieve the corresponding documents (say you end up with M documents), you re-rank them and pick the top K (K < M). That way, you reduce the context size while selecting the most important pieces.
In the next section, we will dive into adaptors, a powerful yet simple-to-implement technique to scale embeddings to better align with the user’s task.
3 — Embedding adaptors 🧩
This method leverages user feedback on the relevancy of the retrieved documents to train an adapter.
An adapter is a lightweight alternative to fully fine-tune a pre-trained model. Currently, adapters are implemented as small feedforward neural networks that are inserted between layers of pre-trained models.
The underlying goal of training an adapter is to alter the embedding query to produce better retrieval results for a specific task.
An embedding adapter is a stage that can be inserted after the embedding phase and before the retrieval. Think about it as a matrix (with trained weights) that takes the original embedding and scales it.

To train an adapter, we need to go through the following steps.
Prepare the training data
To train an embedding adapter, we need some training data on the relevancy of the documents. This data can be manually labeled or generated by an LLM.
This data must include tuples of (query, document) as well as their corresponding labels (1 if the document is relevant to the query, -1 otherwise).
For the sake of simplicity, we’re going to create a synthetic dataset, but in real-world settings, you need to find a way to collect user feedback (e.g. ask the user to rate the relevancy of the document from the interface with 👍 and 👎)
To create some training data, we first generate sample questions that a financial analyst may ask when analyzing a financial report.
Let’s use an LLM for this:
import os
import openai
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']
PROMPT_DATASET = """
You are a helpful expert financial research assistant.
You help users analyze financial statements to better understand companies.
Suggest 10 to 15 short questions that are important to ask when analyzing
an annual report.
Do not output any compound questions (questions with multiple sentences
or conjunctions).
Output each question on a separate line divided by a newline.
"""
def generate_queries(model="gpt-3.5-turbo"):
messages = [
{
"role": "system",
"content": PROMPT_DATASET,
},
]
response = openai_client.chat.completions.create(
model=model,
messages=messages,
)
content = response.choices[0].message.content
content = content.split("\n")
return content
generated_queries = generate_queries()
for query in generated_queries:
print(query)
# 1. What is the company's revenue growth rate over the past three years?
# 2. What are the company's total assets and total liabilities?
# 3. How much debt does the company have? Is it increasing or decreasing?
# 4. What is the company's profit margin? Is it improving or declining?
# 5. What are the company's cash flow from operations, investing, and financing activities?
# 6. What are the company's major sources of revenue?
# 7. Does the company have any pending litigation or legal issues?
# 8. What is the company's market share compared to its competitors?
# 9. How much cash does the company have on hand?
# 10. Are there any major changes in the company's executive team or board of directors?
# 11. What is the company's dividend history and policy?
# 12. Are there any related party transactions?
# 13. What are the company's major risks and uncertainties?
# 14. What is the company's current ratio and quick ratio?
# 15. How has the company's stock price performed over the past year?
Then, we retrieve documents for each generated question. To do this, we’ll query a Chroma collection where we’ve previously indexed a financial report.
results = chroma_collection.query(query_texts=generated_queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']
We evaluate the relevance of each question to its corresponding documents. Once again, we’ll use an LLM for this task:
PROMPT_EVALUATION = """
You are a helpful expert financial research assistant.
You help users analyze financial statements to better understand companies.
For the given query, evaluate whether the following satement is relevant.
Output only 'yes' or 'no'.
"""
def evaluate_results(query, statement, model="gpt-3.5-turbo"):
messages = [
{
"role": "system",
"content": PROMPT_EVALUATION,
},
{
"role": "user",
"content": f"Query: {query}, Statement: {statement}"
}
]
response = openai_client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1
)
content = response.choices[0].message.content
if content == "yes":
return 1
return -1
Now we structure the training data into tuples.
Each tuple will contain the embedding of the query, the embedding of a document, and the evaluation label (1, -1).
retrieved_embeddings = results['embeddings']
query_embeddings = embedding_function(generated_queries)
adapter_query_embeddings = []
adapter_doc_embeddings = []
adapter_labels = []
for q, query in enumerate(tqdm(generated_queries)):
for d, document in enumerate(retrieved_documents[q]):
adapter_query_embeddings.append(query_embeddings[q])
adapter_doc_embeddings.append(retrieved_embeddings[q][d])
adapter_labels.append(evaluate_results(query, document))
When tuples are created, we put them in a Torch Dataset to prepare for the training.
adapter_query_embeddings = torch.Tensor(np.array(adapter_query_embeddings))
adapter_doc_embeddings = torch.Tensor(np.array(adapter_doc_embeddings))
adapter_labels = torch.Tensor(np.expand_dims(np.array(adapter_labels),1))
dataset = torch.utils.data.TensorDataset(adapter_query_embeddings, adapter_doc_embeddings, adapter_labels)
Define a model
We define a function that takes the query embedding, the document embedding, and the adaptor matrix as input. This function first multiplies the query embedding with the adaptor matrix and computes a cosine similarity between this result and the document embedding.
def model(query_embedding, document_embedding, adaptor_matrix):
updated_query_embedding = torch.matmul(adaptor_matrix, query_embedding)
return torch.cosine_similarity(updated_query_embedding, document_embedding, dim=0)
Define the loss
Our goal is to minimize the cosine similarity computed by the previous function. To do this, we’ll use a Mean Square Error (MSE) loss to optimize the weights of the adaptor matrix.
def mse_loss(query_embedding, document_embedding, adaptor_matrix, label):
return torch.nn.MSELoss()(model(query_embedding, document_embedding, adaptor_matrix), label)
Run backpropagation:
In this step, we first initialize the adaptor matrix and train it over 100 epochs.
# Initialize the adaptor matrix
mat_size = len(adapter_query_embeddings[0])
adapter_matrix = torch.randn(mat_size, mat_size, requires_grad=True)
min_loss = float('inf')
best_matrix = None
for epoch in tqdm(range(100)):
for query_embedding, document_embedding, label in dataset:
loss = mse_loss(query_embedding, document_embedding, adapter_matrix, label)
if loss < min_loss:
min_loss = loss
best_matrix = adapter_matrix.clone().detach().numpy()
loss.backward()
with torch.no_grad():
adapter_matrix -= 0.01 * adapter_matrix.grad
adapter_matrix.grad.zero_()
Once the training is complete, the adapter can be used to scale the original embedding and adapt to the user task.
All you need now is to take the original embedding output and multiply it with the adaptor matrix before feeding it to the retrieval system.
test_vector = torch.ones((mat_size,1))
scaled_vector = np.matmul(best_matrix, test_vector).numpy()
test_vector.shape
# torch.Size([384, 1])
scaled_vector.shape
# (384, 1)
best_matrix.shape
# (384, 384)
Thanks for reading
These retrieval techniques we covered help improve the relevancy of the documents.
There’s however ongoing research in this area and other methods are currently being assessed and explored. For example,
- Fine-tuning the embedding model using real feedback data
- Fine-tuning the LLM directly to maximize its retrieval power (RA-DIT: Retrieval Augmented Dual Instruction Tuning)
- Exploring more complex embedding adaptors using deep neural networks instead of matrices
- Deep and intelligent chunking techniques
More on them later.
If you enjoyed this article, don’t forget to share it and follow me on Medium.
Until next time 👋.