Prompt Engineering Is Dead: DSPy Is New Paradigm For Prompting

I remember very clearly that till a few months ago, Prompt Engineering was all the hype. The entire job market was filled with the role of prompt engineers, but not so anymore. Prompt engineering was not any art or science, it was just a clever Hans phenomenon, humans putting up the necessary context for the system to answer in a better way. People even wrote books/blogs like the Top 50 prompts to get the best of GPT, so on and so forth. But large-scale experiments have clearly shown that there is no single prompt or strategy that works for all kinds of problems, it’s just some prompts appear to be better in isolation but are a hit-and-miss when analyzed comprehensively. So, today we are going to talk about DSPY: COMPILING DECLARATIVE LANGUAGE MODEL CALLS INTO SELF-IMPROVING PIPELINES, a framework developed by Stanford for self-improvement pipelines, where LLMs are treated as a module that is being optimized by a compiler, similar to the abstractions found in PyTorch.

Introduction

As I mentioned above the internet is filled with promoting books and blogs. And most of them are just selling a ton of crap to you. Now, as I said a few of them might be actually working but this is not a very good way to build our apps. Not knowing when something doesn’t work is important, we need to define a safe hypothesis space where the system works and doesn’t work.

The first few results on Google search on prompt engineering books. There are literal prompts written in these books, not the usage of techniques like CoT or ReAct

There are even papers where that showed that with certain emotional prompts, LLMs performance increases. For me, I still have my reservations about the authenticity of such a paper. How long does it hold? Is it true for every topic? Are there topics where doing this emotional type of prompting might lead to worse results? There are so many papers like this, that are inadvertently putting out half-baked research. Another paper like this was the Embers of Autoregression, a lot of things were proved wrong later on.

But the bigger question is what kind of a scientific/systematic way it is where I have to tell a system that “I might get fired, if you don’t give up the answer right away or my grandma is sick, so on and so forth”. This is just people randomly trying to hack into the behavior of LLMs.

Understanding the Prompting Problem

For instance, when I say “Add 5-shot CoT with RAG, using hard negative examples”, it is pretty clear conceptually, but really hard to implement in practice. LLMs are pretty sensitive to prompting thus putting this kind of structure in a prompt, doesn’t work most of the time. LLMs behavior is very sensitive to how a prompt is written and this makes it quite difficult to steer them.

So, when we are building a pipeline, it is not just me trying to convince an LLM to give an output in a certain way, but more of the output should be restricted in such a way that it can work as input for other modules in the bigger pipeline.

To solve this issue, there is a lot of research already happening, but they are quite limited in many ways. Most of them strive on String templates, which are brittle and unscalable. The language model changes over time and the prompt breaks. If we want to plug our module into a different pipeline, it doesn’t work. We want it to interact with newer tools, a new database, or a retriever, it doesn’t work.

This is the exact problem that DSPy aims to solve treating LLM as a module, adapting it’s behavior automatically based on how it is interacting with other components in the pipeline.

DSPy Paradigm: Let’s program — not prompt — LMs

So, the goal of DSPy is to shift focus from tweaking the LLMs to good overarching system design.

But how to do it?

In order to think about this on a mental level we can think of the LLMs as: Devices: that execute instructions and operate through an abstraction akin to DNN.

For instance, we define a layer of Convolution in PyTorch and it can be operated on a set of inputs, coming from other layers. Conceptually we can stack these layers and achieve a desired level of abstraction on our original inputs, we don’t need to define any CUDA core and many other instructions. All of this is already abstracted in the definition of the Convolution layer. This is what we wish to do with LLMs, where LLMs are abstracted modules, stacked in different combinations to achieve a certain type of behavior, be it CoT, ReAct, or something else.

In order to get the desired behavior, we need to change a few things:

NLP Signatures

These are simply the declarations of the behavior we want from our LLMs. This only defines what needs to be achieved not the specifications of how it would be achieved. A spec that tell DSPy what a transformation does, rather that how to prompt the LLM to do it.

Signature handles structured formatting and parsing logic.
Signatures can be compiled into self-improving and pipeline-adaptive prompts or finetunes.

DSPY infers the role of fields using:

Their names, e.g. DSPy will use in-context learning to interpret questions differently from answers.
Their traces (input/output examples)

Note: All of this is not hard coded but the system is figuring it out during the com

Modules

This is where we use the signatures to build our modules, say, if we want to build a CoT module we use these signatures to build it. This automatically produces high-quality prompts, to achieve the behavior of certain prompting techniques.

A more technical definition: A module is a parameterized layer that expresses a signature by abstracting a prompting technique.

After it is declared, a module behaves like a callable function.

Parameters: To express a particular signature, any LLM call needs to specify:

The specific LLM to call
The prompt instructions
The string prefix of each signature field
The demonstrations used as few shot prompts and/or as fine-tuning data

Optimizers

To make this system work, Optimizer basically takes the entire pipeline and optimizes it on a certain metric and in the process automatically comes up with the best prompts, and even the weights of the Language model are updated in this process.

The idea at the high level is that we will be using an Optimizer to compile our code which makes language model calls so that each module in our pipeline is optimized into a prompt that is automatically generated for us or a new fine-tuned set of weights for our language model that fits the task that we are trying to solve.

Practical Example

A single search query is often not enough for complex QA tasks. For instance, an example within HotPotQA includes a question about the birth city of the writer of "Right Back At It Again". A search query often identifies the author correctly as "Jeremy McKinnon", but lacks the capability to compose the intended answer in determining when he was born.

The standard approach for this challenge in retrieval-augmented NLP literature is to build multi-hop search systems, like GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021). These systems read the retrieved results and then generate additional queries to gather additional information when necessary before arriving to a final answer. Using DSPy, we can easily simulate such systems in a few lines of code.

Currently to achieve this we need to write very complex prompts and structure them in a very messy way. But the bad part is that as soon as I change the type of questions I might need to completely change the system design, but not with DSPy.

Configuring Language Model and Retrieval Model

import dspy

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

If you want to know more about Retrieval Models:

RAG 2.0: Retrieval Augmented Language Models

RAG 2.0 shows the true capabilities of Retrieval Systems and LLMs

medium.com

Loading the Dataset

We make use of the mentioned HotPotQA dataset, a collection of complex question-answer pairs typically answered in a multi-hop fashion. We can load this dataset provided by DSPy through the HotPotQA class:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

#Output
(20, 50)

Building Signature

Now that we have the data loaded let’s start defining the signatures for sub-tasks of out Baleen pipeline.

We’ll start by creating the GenerateAnswer signature that'll take context and question as input and give answer as output.

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

Building the Pipeline

So, let’s define the program itself SimplifiedBaleen. There are many possible ways to implement this, but we'll keep this version down to the key elements.

from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

As we can see, the __init__ method defines a few key sub-modules:

generate_query: For each hop, we will have one dspy.ChainOfThought predictor with the GenerateSearchQuery signature.
retrieve: This module will conduct the search using the generated queries over our defined ColBERT RM search index via the dspy.Retrieve module.
generate_answer: This dspy.Predict module will be used with the GenerateAnswer signature to produce the final answer.

The forward method uses these sub-modules in simple control flow.

First, we’ll loop up to self.max_hops times.
In each iteration, we’ll generate a search query using the predictor at self.generate_query[hop].
We’ll retrieve the top-k passages using that query.
We’ll add the (deduplicated) passages to our context accumulator.
After the loop, we’ll use self.generate_answer to produce an answer.
We’ll return a prediction with the retrieved context and predicted answer.

Executing the Pipeline

Let’s execute this program in its zero-shot (uncompiled) setting.

This doesn’t necessarily imply the performance will be bad but rather that we’re bottlenecked directly by the reliability of the underlying LM to understand our sub-tasks from minimal instructions. Often, this is perfectly fine when using the most expensive/powerful models (e.g., GPT-4) on the easiest and most standard tasks (e.g., answering simple questions about popular entities).

# Ask any question you like to this simple RAG program.
my_question = "How many storeys are in the castle that David Gregory inherited?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")


#Output
Question: How many storeys are in the castle that David Gregory inherited?
Predicted Answer: five
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'The Boleyn Inheritance | The Boleyn Inheritance is a novel by British author Philippa Gregory which was first published in 2006. It is a direct sequel to her previous novel "The Other Boleyn Girl," an...', 'Gregory of Gaeta | Gregory was the Duke of Gaeta from 963 until his death. He was the second son of Docibilis II of Gaeta and his wife Orania. He succeeded his brother John II, who had left only daugh...', 'Kinnairdy Castle | Kinnairdy Castle is a tower house, having five storeys and a garret, two miles south of Aberchirder, Aberdeenshire, Scotland. The alternative name is Old Kinnairdy....', 'Kinnaird Head | Kinnaird Head (Scottish Gaelic: "An Ceann Àrd" , "high headland") is a headland projecting into the North Sea, within the town of Fraserburgh, Aberdeenshire on the east coast of Scotla...', 'Kinnaird Castle, Brechin | Kinnaird Castle is a 15th-century castle in Angus, Scotland. The castle has been home to the Carnegie family, the Earl of Southesk, for more than 600 years....']

Optimizing the Pipeline

However, a zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let’s compile our multi-hop (SimplifiedBaleen) program.

Let’s first define our validation logic for compilation:

The predicted answer matches the gold answer.
The retrieved context contains the gold answer.
None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

We’ll use one of the most basic teleprompters in DSPy, namely, BootstrapFewShot to optimize the predictors in pipeline with few-shot examples.

from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)

Evaluating the Pipeline

Let’s now define our evaluation function and compare the performance of the uncompiled and compiled Baleen pipelines. While this devset does not serve as a completely reliable benchmark, it is instructive to use for this tutorial.

from dspy.evaluate.evaluate import Evaluate

# Define metric to check if we retrieved the correct documents
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example["gold_titles"]))
    found_titles = set(
        map(dspy.evaluate.normalize_text, [c.split(" | ")[0] for c in pred.context])
    )
    return gold_titles.issubset(found_titles)

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen, metric=gold_passages_retrieved, display=False)

compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen, metric=gold_passages_retrieved)

print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")


#Output
## Retrieval Score for uncompiled Baleen: 36.0
## Retrieval Score for compiled Baleen: 60.0

Conclusion

These results show that combining a multihop setting in DSPy can even surpass human feedback. They even showed that even a much smaller model, like T5 was compared to GPT when used in a DSPy setting. DSPy is one of the coolest I’ve come across after the release of lang chain, this shows great promise in making a much better and systematically designed system rather than wildly putting pieces in a big LLM pipeline.

Please check out Solving Production Issues In Modern RAG Systems-I & II, Agentic workflows, RAG 2.0, and AI Agents Are All You need

Writing such articles is very time-consuming; show some love and respect by clapping and sharing the article. Happy learning ❤

Please don’t forget to subscribe to AIGuys Digest Newsletter