How Self-RAG Could Revolutionize Industrial LLMs
Let’s face it — vanilla RAG is pretty dumb. There’s no guarantee responses returned are relevant. Learn how Self-RAG can significantly help

Large language models (LLMs) are all set to revolutionize various industries. Let’s take the example of the financial sector, wherein LLMs can be used to pore over troves of documents and find trends in a fraction of time and at a fraction of the cost of analysts doing the same task. But here’s the catch — the answers you get are only partial and incomplete many times. Take, for example, the case where you have a document containing company X’s annual revenue over the past 15 years, but in different sections. In the standard Retrieval Augmented Generation (RAG) architecture as pictured below, you typically retrieve the top-k documents, or choose documents within a fixed context length.

However, this can have several issues. One issue is wherein the top-k documents do not contain all the answers — maybe for example only corresponding to the last 5 or 10 years. The other issue is that computing similarity between document chunks and prompt does not always yield relevant contexts. In this case, you could be getting a wrong answer.
A real issue is that you have developed your vanilla RAG app that works well in simple cases you test out — but this fails when you present this prototype to stakeholders and they ask some out of the box questions.
This is where self-RAG comes to the rescue! The authors develop a clever way for a fine-tuned LM (Llama2–7B and 13B) to output special tokens [Retrieval], [No Retrieval], [Relevant], [Irrelevant], [No support / Contradictory], [Partially supported], [Utility], etc. appended to LM generations to decide whether or not a context is relevant/irrelevant, the LM generated text from the context is supported or not, and the utility of the generation.
Training Self-RAG
Self-RAG was trained in a 2-step hierarchical process. In step 1, a simple LM was trained to classify a generated output (either just the prompt or prompt + RAG augmented output) and append the relevant special token at the end. This “critic model” was trained by GPT-4 annotations. Specifically, GPT-4 was prompted using a type-specific instruction (“Given an instruction, make a judgment on whether finding some external documents from the web helps to generate a better response.”)
In step 2, the generator model model, using a standard next token prediction objective, learns to generate continuations, as well as special tokens to retrieve/critique generations. Unlike other fine-tuning or RLHF methods where downstream training can impact model outputs and make future generations biased, through this simple approach, the model is trained only to generate special tokens as appropriate, and otherwise not change the underlying LM! Which is brilliant!
Evaluating Self-RAG
The authors performed a bunch of evaluations against public health fact verification, multiple-choice reasoning, Q&A, etc. There were 3 types of tasks. Closed-set tasks included fact verification and multiple-choice reasoning, and accuracy was used as the evaluation metric. Short-form generation tasks included open-domain Q&A datasets. The authors evaluated for whether or not gold answers are included in the model generations instead of strictly requiring exact matching.
Long-form generation included biography generation and long-form QA. For evaluating these tasks, the authors used FactScore to evaluate biographies — basically a measure of the various pieces of information generated, and their factual correctness. For long-form QA, citation precision and recall were used.

Self-RAG performs the best among non-proprietary models, and in most cases the larger 13B parameter outperforms the 7B model. It even outperforms ChatGPT in some cases.
Inference
For inference, the self-RAG repository suggests using vllm — an library for LLM inference.
After pip installing vllm, you can load in the libraries and query as follows:
from vllm import LLM, SamplingParams
model = LLM("selfrag/selfrag_llama2_7b", download_dir="/gscratch/h2lab/akari/model_cache", dtype="half")
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=100, skip_special_tokens=False)
def format_prompt(input, paragraph=None):
prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
if paragraph is not None:
prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
return prompt
query_1 = "Leave odd one out: twitter, instagram, whatsapp."
query_2 = "Can you tell me the difference between llamas and alpacas?"
queries = [query_1, query_2]
# for a query that doesn't require retrieval
preds = model.generate([format_prompt(query) for query in queries], sampling_params)
for pred in preds:
print("Model prediction: {0}".format(pred.outputs[0].text))For a query that requires retrieval, you can supply the necessary information as a string in the example below.
paragraph="""Llamas range from 200 to 350 lbs., while alpacas weigh in at 100 to 175 lbs."""
def format_prompt_p(input, paragraph=paragraph):
prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
if paragraph is not None:
prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
return prompt
query_1 = "Leave odd one out: twitter, instagram, whatsapp."
query_2 = "Can you tell me the differences between llamas and alpacas?"
queries = [query_1, query_2]
# for a query that doesn't require retrieval
preds = model.generate([format_prompt_p(query) for query in queries], sampling_params)
for pred in preds:
print("Model prediction: {0}".format(pred.outputs[0].text))[Irrelevant]Whatsapp is the odd one out.
[No Retrieval]Twitter and Instagram are both social media platforms,
while Whatsapp is a messaging app.[Utility:5]
[Relevant]Llamas are larger than alpacas, with males weighing up to 350 pounds.
[Partially supported][Utility:5]In the above example, for the first query (related to social media platforms) the paragraph context is irrelevant, as reflected by the [Irrelevant] token at the beginning of the retrieval. The external context is however, relevant to the second query (related to llamas and alpacas). As you can see, it includes this information in the generated context, marked by the [Relevant] token.
But in the example below, the context “I like Avocado.” is unrelated to the prompt. As you can see below, the model prediction starts of as [Irrelevant] for both queries, and just uses internal information to answer the prompt.
paragraph="""I like Avocado."""
def format_prompt_p(input, paragraph=paragraph):
prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
if paragraph is not None:
prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
return prompt
query_1 = "Leave odd one out: twitter, instagram, whatsapp."
query_2 = "Can you tell me the differences between llamas and alpacas?"
queries = [query_1, query_2]
# for a query that doesn't require retrieval
preds = model.generate([format_prompt_p(query) for query in queries], sampling_params)
for pred in preds:
print("Model prediction: {0}".format(pred.outputs[0].text))Model prediction: [Irrelevant]Twitter is the odd one out.[Utility:5]
[Irrelevant]Sure![Continue to Use Evidence]
Alpacas are a much smaller than llamas.
They are also bred specifically for their fiber.[Utility:5]Takeaways
Self-RAG has several advantages over the vanilla LLM.
- Adaptive Passage Retrieval: By this, the LLM can keep retrieving context until all the relevant context is found (within the context window of course.)
- More relevant retrieval: A lot of times, embedding models are not the best at retrieving relevant context. Self-RAG potentially solves this with the relevant/irrelevant special token.
- Beats other similar models: Self-RAG beats other similar models, and also surprisingly beats ChatGPT in many tasks. It would be interesting to do a comparison with data that ChatGPT has not been trained on — so more proprietary, industrial data.
- Doesn’t change underlying LM: For me this is a huge upsell — as we know how fine-tuning and RLHF can lead to biased models very easily. Self-RAG seems to solve this by adding special tokens, and otherwise keeping text generation the same.
Some room for improvement though is in dealing with fixed context lengths. This might be achieved by also adding in a summarization component to Self-RAG. In fact there has been some previous work on this (See: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation). Another exciting direction is the increase in context length window that just came out from OpenAI — with the GPT-4 128k context window update. However, as mentioned in forums, this context window represents the input length, while the output limit is still at 4k tokens.
RAG represents one of the most exciting ways industries can incorporate LLMs on their data, to generate real business impacts. However, there has not been too much RAG specific tuning of language models. I’m excited for future improvements in this space.
The inference code is in this GitHub repo:
If you like this post, follow me — I write on Generative AI in real-world applications and, more generally, on the intersections between data and society.
Feel free to connect on LinkedIn!





