
LANGCHAIN — Benchmarking Question Answering over CSV Data
Computers are good at following instructions, but not at reading your mind. — Donald Knuth
In this tutorial, we will take a deep dive into question-answering over tabular data, specifically using CSV data. We’ll cover the following topics: background motivation, initial application, initial solution, debugging with LangSmith, evaluation setup, and improved solution. Throughout this tutorial, we will use LangSmith to collect real user questions over CSV data, and employ LangSmith’s features to evaluate our question-answering system. Let’s get started by exploring how we can use LangSmith to collect and evaluate our dataset.
Background Motivation
When working with tabular (CSV) data, it can be challenging to answer natural language questions over the data. Traditional machine learning datasets typically consist of inputs and outputs, which are used to train and evaluate models. However, language model applications often lack sufficient training data and evaluation metrics. LangSmith offers a way to construct datasets for language model-based applications, making it easier to evaluate solutions. To tackle this challenge, we can gather real user questions and feedback to construct a dataset, and then use language models to evaluate correctness.
Let’s start by creating a dataset of real-world questions and ground truth answers. We can achieve this by deploying a demo application and gathering user interactions and feedback. LangSmith can help monitor user interactions and feedback, allowing us to manually review and create a dataset of interesting questions.
Initial Application
In our initial application, we decided to use the Titanic dataset — a classic example of tabular data containing a mix of numeric, categorical, and text columns. Using Streamlit, we created a simple application and gathered real user questions and feedback. By logging interactions and feedback using LangSmith, we were able to create a dataset consisting of interesting user questions.
Initial Solution
The initial solution involved addressing the challenge of dealing with text-heavy tabular data and performing natural language queries. We used a retrieval system for natural language queries, and a Python REPL or kork for more complex queries. The retrieval system utilized a vector store to match input questions using cosine similarity, whereas kork provided access to a predetermined set of functions to handle query language-based questions.
Debugging with LangSmith
As users started asking questions, feedback revealed that some areas of the initial solution needed improvement. LangSmith allowed us to inspect traces and identify issues with data formatting and limited functionality of kork. For instance, we discovered that data formatting inconsistencies affected the language model's ability to reason about the data correctly. With LangSmith's help, we fixed formatting issues and gained insights into debugging performance issues.
Evaluation Setup
With the dataset of real-world examples and insights from LangSmith, we are ready to measure our improvements. However, evaluating natural language answers is complex, as there are multiple valid ways to respond to a question. We decided to use language models to evaluate correctness, even though this approach is not perfect. LangSmith facilitated the evaluation process by leveraging language models to compare predicted answers with ground truth answers.
Improved Solution
We arrived at an improved solution that involved an agent powered by OpenAIFunctions, GPT-4, and two tools: a Python REPL and a retriever. This solution allowed for more flexible and accurate responses to user questions. We also included specific instructions in the prompt to guide the system’s decision-making process. LangSmith was instrumental in comparing the performance of our improved solution with other methods, such as the Pandas Agent and PandasAI.
In conclusion, the improved solution demonstrated positive feedback and performance. While there is always room for improvement, LangSmith played a crucial role in collecting real-world examples, debugging, and evaluating the effectiveness of our question-answering system over CSV data.
By employing LangSmith’s capabilities, you can efficiently gather, debug, and evaluate language model-based applications over CSV data, making it an essential tool for developing and refining question-answering systems.
