Performing Customer Analytics with LangChain and LLMs
Discover the potentials and constraints of LangChain in calculating statistics, insight generation, visualization, and making conversation for customer analytics — with implementation codes
Many businesses possess a lot of proprietary data stored in their databases. However, the data is complex and unapproachable for users, so they often struggle to identify trends and extract actionable insights. That’s where business intelligence (BI) dashboards play an essential role, which is the starting point for users to interact with the consolidated view of data at a glance.
The bottleneck of the BI dashboards
An effective BI dashboard should be designed to contain only the relevant information for the target audience and avoid picking cluttered visual elements into one. But this does not well address a challenge. Sometimes users suddenly have additional inquiries or wish to explore new analytical perspectives beyond what is displayed in the dashboard. If they do not have any technical background to dynamically tailor the underlying logic of visualization, it may fail to meet their needs.
The recent framework LangChain reduces the technical barrier of interacting with data due to its advanced language processing capabilities, it thus potentially presents new opportunities for businesses. Let’s explore the basics of how it works.
How LangChain works
Large-language models (LLMs), such as ChatGPT and Llama, have high abilities in language comprehension and text generation. As an open-source library, LangChain integrates LLMs into the applications. It provides multiple modules for efficient interaction and streamlining the workflow, such as:
- Document loader: Facilitate the data loading from various sources, including CSV files, SQL databases, and public datasets like Wikipedia.
- Agent: Use the language model as a reasoning engine to determine which actions to take and in which order. It repeats through a continuous cycle of thought-action-observation until the task is completed.
- Chain: Different from agents, they consist of predetermined sequences of actions, which are hard coded. It addresses complex and well-defined tasks by guiding multiple tools with high-level directions.
- Memory: Currently the beta version supports accessing windows of past messages, this provides the application with a conversational interface
Using these modules as the foundation, we will start coding a simple application that utilizes the LLM. In this hands-on journey, we will assume the roles of business users and experiment to see if we can conduct exploratory data analysis simply by inputting natural language inquiries.
Consider a scenario where you plan to conduct customer analytics for a retail shop, so you have collected sales data over the past 12 months. Your goal is to understand better diverse aspects such as customer demographics, spending behaviors, and product categories.
The dataset [with license as CC0: Public domain] obtained from Kaggle contains several fields, including transaction ID, transaction date, customer ID, gender, age, product category, the number of units of the products purchased, the price per unit, and the total amount of transactions. We can begin our analysis now.
The initial setup
We need to set up the environment and configurations properly to use LangChain in Python.
- Prepare the Python environment with LangChain library alongside other necessary dependencies like SQLite and Pandas
- Configure an OpenAI Key to query GPT models
- Import a CSV file ‘retail_sales_dataset.csv’ into a table within the SQLite database
# Import necessary libraries and modules
from langchain.chat_models import ChatOpenAI
import sqlite3
import pandas as pd
# Set the OpenAI API key
OPENAI_API_KEY = "<OpenAI API key>"
# Initialize the Langchain ChatOpenAI model
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name="gpt-3.5-turbo-1106")
# Connect to the SQLite database
connection = sqlite3.connect("customer.db")
# Convert DataFrame to a SQLite table named "RetailSalesTable"
df.to_sql("RetailSalesTable", connection, if_exists='replace')
Create a LangChain Application
#1 Generate basic statistics
What is the number of transactions per year and month?
To inquire about basic statistics related to the sales SQL table, we use the create_sql_agent
agent helper. Both parameters verbose
and return_intermediate_steps
are set as True so that the internal states and steps during the execution process can be shown. This will be effective for us to iteratively evaluate and refine the approach to communicating with the agent.
# Import necessary libraries and modules
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.agents.agent_types import AgentType
from langchain.sql_database import SQLDatabase
# Create an instance of SQLDatabase using the 'customer.db' SQLite database
db = SQLDatabase.from_uri('sqlite:///customer.db')
# Create an SQL agent executor with specified parameters
agent_executor = create_sql_agent(
llm=llm,
toolkit=SQLDatabaseToolkit(db=db, llm=llm),
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
handle_parsing_errors=True,
verbose=True,
return_intermediate_steps=True
)
# Define user input
user_inquiry = "What is the number of transactions per year and month?"
# Run the agent to generate a response
agent_executor.run(user_inquiry)
The complete content of the outputs is provided below.
The agent writes the query which includes rectifying the formatting of the ‘Date’ column and producing the correct results. It successfully presents the breakdown of transaction counts from Jan 2023 to Jan 2024.
#2 Perform feature engineering
What is the number of transactions across different age groups of customers?
This time, we slightly modified the inquiries to focus on the dimension ‘age groups’. This aims to evaluate the agent’s ability to generate new features that do not directly originate from the dataset. Applied with a similar concept, you can additionally explore statistics across dimensions like transaction dates categorized by weekdays/ weekends, purchasing frequency, etc.
The agent attempts to write the query but unfortunately gives the wrong answer. It seems that the agent oversimplifies the concept of ‘age group’, so it is not classified as a distinct dimension, and considers it as just ‘age’.
Improvement: Use a prompt template with background context
When we find that the model misunderstood our intended meaning, or lacks certain knowledge, we can use class PromptTemplate
to create parametrized prompts for language models. In this case, I have supplemented additional context about examples of extra features in customer analytics as background information, right before showing the user inquiry. This aims to provide the model with clearer guidelines and deliver our human intent for generating appropriate machine-generated responses.
from langchain_core.prompts import PromptTemplate
# Create the prompt template
template = PromptTemplate(
input_variables=["user_inquiry", "background_info"],
template="""{background_info}
Question: {user_inquiry}
"""
)
# Define the background information
background_info = """
As the customer analyst, my role is to analyze the transaction patterns of customers. The feature engineering in table 'RetailSalesTable' is crucial for statistical exploration. For example:
- column 'Age' can be grouped into bins of age ranges, such as 21-25, 26-30, and so on.
Understanding the data in these columns helps us gain insights about our customers, enabling us to offer personalized services and develop effective marketing strategies.
"""
# Define user input
user_inquiry = "What is the number of transactions across different age ranges of customers?"
# Run the agent with the formatted template
agent_executor.run(template.format(background_info=background_info, user_inquiry=user_inquiry))
Here are the key highlights from the corresponding outputs.
With the aid of the prompt, the agent now successfully categorizes the transactions into multiple age groups. It is worth mentioning that there are potentially other ways to achieve the same goal, such as using few-shot examples to demonstrate the question-answer pairs.
#3 Plot chart for multi-dimensional features
We used the create_sql_agent
agent to calculate basic statistics and generate insights. With the art of prompt design, the agent is consequently fit to perform the tasks. To support asking new inquiries that cannot be achieved solely by SQL query, we need to develop our custom tool.
Show a grouped bar graph visualizing the answer to the following question: What are the relationships between product category, average total amount, and gender?
In this example, the inquiry involves data exploration and visualization. We will create our agent based on create_sql_agent
, and will add the tool PythonREPLTool
for executing Python commands, such as using Matplotlib
for visualization.
Let’s see how the tools are implemented in practice.
from langchain import LLMChain
from langchain.agents import (AgentExecutor, Tool, ZeroShotAgent)
from langchain_experimental.tools import PythonREPLTool
# Define a description to suggest how to determine the choice of tool
description = (
"Useful when you require to answer analytical questions about customers. "
"Use this more than the Python REPL tool if the question is about customer analytics,"
"like 'How many customers are there?' or 'count the number of transactions by age group'. "
"Try not to use clause in the SQL."
)
# Create a Tool object for customer data with the previously defined agent executor 'create_sql_agent' and description
customer_data_tool = Tool(
name="Customer",
func=agent_executor.run,
description=description,
)
# Create the whole list of tools
tools = [PythonREPLTool()]
tools.append(customer_data_tool)
# Define the prefix and suffix for the prompt
prefix = "Below are tools that you can access:"
suffix = (
"Pass the relevant part of the request directly to the Customer tool.\n\n"
"Request: {input}\n"
"{agent_scratchpad}"
)
# Create the prompt using ZeroShotAgent
# Use agent_scratchpad to store the actions previously used, guiding the subsequent responses.
agent_prompt = ZeroShotAgent.create_prompt(
tools, prefix=prefix, suffix=suffix, input_variables=["input", "agent_scratchpad"]
)
# Create an instance of ZeroShotAgent with the LLMChain and the allowed tool names
zero_shot_agent = ZeroShotAgent(
llm_chain=LLMChain(llm=llm, prompt=agent_prompt),
allowed_tools=[tool.name for tool in tools]
)
# Create an AgentExecutor which enables verbose mode and handling parsing errors
agent_executor = AgentExecutor.from_agent_and_tools(
agent=zero_shot_agent, tools=tools, verbose=True, handle_parsing_errors=True
)
# Define user input
user_inquiry = "Use a grouped bar graph to visualize the result of the following inquiry: " \
"What are the relationships between product category, average total amount, and gender?"
# Run the agent to generate a response
agent_executor.run(user_inquiry)
The flow of SQL agent output is similar to the previous examples we went through above, so it is omitted here. The subsequent output from the Python REPL tool is presented below.
The customized tool combination successfully translates the natural language inquiry into an SQL query. The summarized query results are then used to generate a grouped bar chart, which clearly and effectively illustrates the relationships through the x-axis, y-axis, and legend.
While the overall design and execution process look smooth, there are indeed some limitations in the current design. For example, suppose we want to generate a scatter plot with most of the transaction data points. In that case, the execution process should generate a long query output that covers all the respective information. However, the agent’s output may not be ideal, due to the occasional use of the LIMIT
clause by the agent (that sets an upper limit on the number of tuples), or query result larger than the maximum token limit (in our case, 4096 tokens). The variety of visualizations that can be generated may thus be limited.
#4 Make coherent conversation
In reality, business users often have follow-up questions once they receive the results of a customer analysis. To address these situations, we need to enhance the existing basic LLM application to become more chat-like. We add memory buffers to retain past interactions, allowing the LLM to generate responses for the ongoing conversation’s context. This works by constantly storing LLM outputs and referencing the memory storage before generating a response.
Initial question: How do customers adapt their shopping habits during different seasons?
Follow-up question: Can you elaborate more?
We supplement and rectify the customized tools combination below:
from langchain.memory import ConversationBufferMemory
# Skipped here - Define your own prefix, suffix, and description with "chat_history" for the prompt
# Keep the original list of tools
# Create the prompt using ZeroShotAgent with additonal "chat_history" as input variables
agent_prompt = ZeroShotAgent.create_prompt(
tools, prefix=prefix, suffix=suffix, input_variables=["input", "chat_history", "agent_scratchpad"],
)
# Create an instance of ZeroShotAgent with the LLMChain and the allowed tool names
zero_shot_agent = ZeroShotAgent(
llm_chain=LLMChain(llm=llm, prompt=agent_prompt),
allowed_tools=[tool.name for tool in tools]
)
# Initiate memory which allows for storing and extracting messages
memory = ConversationBufferMemory(memory_key="chat_history")
# Create an AgentExecutor with memory parameter
agent_chain = AgentExecutor.from_agent_and_tools(
agent=zero_shot_agent, tools=tools, verbose=True, handle_parsing_errors=True, memory=memory
)
# Define initial question as user input
user_inquiry = "How do customers adapt their shopping habits during different seasons?"
# Run the agent to generate a response
agent_executor.run(user_inquiry)
# Define follow-up question as user input
user_inquiry = "Can you elaborate more?"
# Run the agent to generate another response
agent_executor.run(user_inquiry)
Agent’s responses:
In the follow-up question “Can you elaborate more?”, we intentionally did not provide any hints/ keywords about what we were asking, but the agent demonstrated its ability to continue the analysis of shopping habits in various seasons. This indicates the effectiveness of using memory and shows its advantage by providing a more in-depth description across product categories and seasons.
Wrapping it up
We conducted experiments to explore the key functionalities and potential approaches of LangChain in developing a customer analytics application based on LLM:
- Calculating statistics, by using an
create_sql_agent
agent to query the database and retrieve relevant statistical information. - Insight generation, by applying the prompt template to define the key data features.
- Visualization, by using the combination of a custom agent and tool
PythonREPLTool
. - Conversation capabilities, by adding the memory buffer to store and retrieve the chat history.
The choices of wordings in natural language inquiry often do not align precisely with those in the database schema. It was observed that the LangChain executor sometimes does not work as expected, and may even hallucinate, particularly when identifying the data relationships to generate a graph. Therefore, the development of codes requires iterative troubleshooting. While the LangChain framework may only be somewhat reliable and effective in handling diverse customer analytics tasks, it can still offer users with marginal advantage when they have a compelling need to discover insights beyond the traditional analysis dashboards.
The design of this application is only the initial stage, and there are further possibilities to discover. For example, customer data is sometimes available in text format, such as customer reviews or product descriptions. LangChain provides the tagging function so that we can conduct a comprehensive analysis by labeling the sentiments, language, style, and more.
Before you go
If you enjoy this reading, I invite you to follow my Medium page and LinkedIn page. By doing so, you can stay updated with exciting content related to data science side projects, Machine Learning Operations (MLOps) demonstrations, and project management methodologies.