
LANGCHAIN — What Is the Spade Tool for Automatically Digging Up Evals Based on Prompt Refinements?
Computer science is no more about computers than astronomy is about telescopes. — Edsger W. Dijkstra.
The Spade tool, short for System for Prompt Analysis and Delta-based Evaluation, is a powerful tool developed to automatically recommend evaluation functions for custom LLM chains. This tool, developed in collaboration with LangChain, uses prompt refinements to suggest automatic evaluation functions in Python that can be run on all chain input-output pairs. These evaluation functions can be as simple as checking the format of the response or as complex as verifying that the response adheres to specific criteria. In this tutorial, we’ll delve deeper into how Spade works and how you can leverage it for your LLM chains.
Understanding Spade ♠️
Spade uses prompt version history to identify key evaluation criteria, guardrails, and constraints encoded as parts of refinements or changes to prompts. By analyzing the refinements, Spade recommends evaluation functions that can be used to monitor and improve the reliability of LLM deployment. Let’s take a closer look at how Spade operates.
Digging into LLM evals with Spade ♠️
Spade first identifies the differences between prompt versions and categorizes these changes based on a taxonomy developed by the Berkeley research team. For each category identified, Spade prompts an LLM to write a relevant Python evaluation function. These evaluation functions accept prompt and response pairs as arguments and return boolean values, allowing them to be aggregated and visualized across many chain runs.
By using an example prompt refinement from Alta, let’s understand the process more clearly:
def check_excludes_white_wedding(prompt: str, response: str) -> bool:
"""
This function checks if the response does not include white items for wedding-related events,
unless explicitly stated by the client.
"""
# Check if event is wedding-related
if "wedding" in prompt.lower() and "my wedding" not in prompt.lower():
# Check if the response includes the word "white"
return "white" not in response.lower()
else:
return TrueAs seen in the example above, Spade generates evaluation functions based on prompt refinements. In this case, the evaluation function checks whether the response includes white items for wedding-related events, unless explicitly stated by the client. This illustrates how Spade automates the creation of evaluation functions based on prompt changes.
Current prototype and feedback
The current prototype of Spade provides a preliminary research tool for suggesting evaluation functions. It identifies all possible prompt refinement categories and generates a Python function for each category that indicates a relevant evaluation. While there is room for improvement, it is valuable for developers interested in this area of research. The tool allows explicit feedback on the generated functions and invites users to participate in refining the evaluation functions further.
Conclusion
Spade offers a promising approach to automatically generate evaluation functions based on prompt refinements, enabling more effective monitoring and improvement of LLM deployment reliability. The collaboration between Berkeley and LangChain has resulted in a tool that has the potential to greatly enhance the practical application of LLM chains in real-world scenarios.
To explore the Spade prototype and provide feedback, visit the Spade Prototype. For those interested in this research area or looking to deploy evaluation functions in an observability tool, there is an opportunity to connect with the Berkeley research team by filling out the interest form. Additionally, the code for Spade is freely available on the linked GitHub repo.
The Spade tool represents an exciting advancement in the field of automated evaluation function generation for LLM chains and holds great potential for the future of language model deployment and management.






