Unlocking the Power of AutoGen, Langchain, and Spark: Transforming LLM Applications

Introduction:
The convergence of AutoGen, Langchain, and Spark represents a transformative moment in the development of Language Model (LLM) applications. This article explores how the integration of AutoGen’s multi-agent framework, Langchain’s blockchain-based language model, and the power of Spark can redefine the possibilities for LLM applications. Together, these technologies create a dynamic synergy that leverages distributed intelligence, decentralization, and high-performance computing to push the boundaries of what is achievable in the field of LLM applications.
Advantages of Combining AutoGen, Langchain, and Spark:
- Distributed Intelligence: The amalgamation of AutoGen, Langchain, and Spark allows applications to tap into a distributed network of language models and distributed computing capabilities. This collective intelligence enhances problem-solving capabilities by drawing upon a vast pool of knowledge and computational resources.
- Secure Collaboration: Langchain’s blockchain infrastructure and Spark’s data processing capabilities ensure secure and transparent interactions between AutoGen agents. Transactions and interactions are recorded on the blockchain, providing data integrity and traceability. This is particularly vital in applications where data security and transparency are paramount.
- Customizable Agents with LLM Backbone: AutoGen’s agents can be tailored to harness the power of Langchain’s language model while leveraging Spark’s data processing abilities. This customization enhances the conversational and problem-solving capabilities of the agents, and their ability to process and analyze data at scale.
- High-Performance Computing: The inclusion of Spark in the integration brings high-performance computing to the table, enabling applications to process large volumes of data efficiently. This is crucial for applications that require real-time data analysis and decision-making.

Auto Generated Agent Chat: Task Solving with Provided Tools as Functions:
AutoGen, at the core of this integration, introduces a pioneering concept of conversable agents that operate through the synergy of Language Model Models (LLMs), tools, and human inputs. These agents are designed to collaborate and effectively tackle tasks through automated chat, forming a unique and powerful paradigm in the realm of LLM applications. This framework seamlessly combines tool usage and human participation within the context of multi-agent conversations, opening up a wealth of opportunities for developers and revolutionizing the way we approach problem-solving with AI.
The primary features of this innovative approach include:
1. Multi-Agent Collaboration: AutoGen enables multiple agents to communicate, collaborate, and collectively address complex tasks. These agents can bring together distinct areas of expertise, combining the computational power of LLMs, the precision of tools, and the nuanced insights of humans.
2. Automated Chat: The use of chat as the primary medium for interactions between agents offers a natural and intuitive means of communication. It enhances the efficiency and flexibility of the collaboration, making it accessible and user-friendly.
3. Integration of Tools: AutoGen seamlessly integrates tools as a part of the conversation, allowing agents to utilize various resources and utilities to enhance problem-solving. This integration significantly expands the capabilities of the agents by making a wide array of tools available at their virtual fingertips.
4. Human Participation: One of the standout features of AutoGen is the seamless inclusion of human inputs in the conversation. This dynamic interaction between AI agents and humans enables applications to incorporate the valuable aspects of human judgment, context, and expertise.
For in-depth information and a comprehensive guide on implementing this feature, developers are encouraged to explore the detailed documentation available at following link.
In this article, we will provide a practical demonstration of how to leverage AutoGen’s capabilities through the integration of AssistantAgent and UserProxyAgent, making use of the latest features of OpenAI models . This demonstration will walk you through the steps to initiate these agents and engage them in meaningful problem-solving tasks. The key components of this process include:
1. AssistantAgent Initialization: To begin, you must initialize the AssistantAgent by providing a well-defined prompt and the corresponding function configurations. The prompt is crucial for setting the context and instructing the AssistantAgent on how to approach the task.
2. UserProxyAgent Configuration: The UserProxyAgent is configured to receive and execute function calls initiated by the AssistantAgent. This is a pivotal role, as the UserProxyAgent ensures that the specified functions are executed as per the instructions, effectively making the AI-agent’s intent a reality.
3. Consistency in Descriptions and Functions: A critical aspect of this implementation is to ensure that the descriptions provided in the AssistantAgent align precisely with the functions that need to be executed by the UserProxyAgent. This alignment is crucial for the seamless execution of tasks.
4. System Message Verification: It is advisable to verify the system message within the AssistantAgent to confirm that the instructions provided therein are consistent with the descriptions and function calls. This step enhances the reliability and accuracy of the entire process.
In conclusion, the introduction of AutoGen, combined with the capabilities of Langchain and Spark, reshapes the landscape of LLM applications. By integrating distributed intelligence, blockchain security, and high-performance computing, developers can harness the collective power of these technologies to create innovative, secure, and efficient applications that excel in problem-solving, data processing, and multi-agent collaboration. This synergy marks a pivotal shift in the field of AI, opening doors to a new era of possibilities and applications.

Code Implementation: Harnessing the Power of AutoGen, Langchain, and Spark
In this section, we’ll dive deep into the implementation of AutoGen, Langchain, and Spark’s integration, demonstrating how to leverage this dynamic combination to create powerful Language Model (LLM) applications. While the following code implementation is a simplified example, it lays the groundwork for understanding how these technologies can be harmoniously integrated to achieve exceptional results in the field of LLM applications.

Step 1: Setting Up The Environment
%pip install "pyautogen[mathchat]~=0.1.0"
%pip install Langchain
%pip install pysparkStep 2: Import the Libraries
import json
# Create a list of OpenAI configuration settings
config_list = [
{
"model": "gpt-3.5-turbo",
"api_key": "",
}
]
# Save the configuration list to a file
with open("OAI_CONFIG_LIST.json", "w") as f:
json.dump(config_list, f)
import autogen
config_list = autogen.config_list_from_json(
env_or_file="OAI_CONFIG_LIST.json",
file_location=".",
)
assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])models to use: ['gpt-3.5-turbo']from langchain.agents import create_spark_sql_agent
from langchain.agents.agent_toolkits import SparkSQLToolkit
from langchain.chat_models import ChatOpenAI
from langchain.utilities.spark_sql import SparkSQL
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
schema = "langchain_example"
spark.sql(f"CREATE DATABASE IF NOT EXISTS {schema}")
spark.sql(f"USE {schema}")
csv_file_path = "./sample_data/california_housing_train.csv"
table = "california_housing_train"
spark.read.csv(csv_file_path, header=True, inferSchema=True).write.option("path", "file:/content/spark-warehouse/langchain_example.db/california_housing_train").mode("overwrite").saveAsTable(table)
spark.table(table).show()Output:
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
| -114.31| 34.19| 15.0| 5612.0| 1283.0| 1015.0| 472.0| 1.4936| 66900.0|
| -114.47| 34.4| 19.0| 7650.0| 1901.0| 1129.0| 463.0| 1.82| 80100.0|
| -114.56| 33.69| 17.0| 720.0| 174.0| 333.0| 117.0| 1.6509| 85700.0|
| -114.57| 33.64| 14.0| 1501.0| 337.0| 515.0| 226.0| 3.1917| 73400.0|
| -114.57| 33.57| 20.0| 1454.0| 326.0| 624.0| 262.0| 1.925| 65500.0|
| -114.58| 33.63| 29.0| 1387.0| 236.0| 671.0| 239.0| 3.3438| 74000.0|
| -114.58| 33.61| 25.0| 2907.0| 680.0| 1841.0| 633.0| 2.6768| 82400.0|
| -114.59| 34.83| 41.0| 812.0| 168.0| 375.0| 158.0| 1.7083| 48500.0|
| -114.59| 33.61| 34.0| 4789.0| 1175.0| 3134.0| 1056.0| 2.1782| 58400.0|
| -114.6| 34.83| 46.0| 1497.0| 309.0| 787.0| 271.0| 2.1908| 48100.0|
| -114.6| 33.62| 16.0| 3741.0| 801.0| 2434.0| 824.0| 2.6797| 86500.0|
| -114.6| 33.6| 21.0| 1988.0| 483.0| 1182.0| 437.0| 1.625| 62000.0|
| -114.61| 34.84| 48.0| 1291.0| 248.0| 580.0| 211.0| 2.1571| 48600.0|
| -114.61| 34.83| 31.0| 2478.0| 464.0| 1346.0| 479.0| 3.212| 70400.0|
| -114.63| 32.76| 15.0| 1448.0| 378.0| 949.0| 300.0| 0.8585| 45000.0|
| -114.65| 34.89| 17.0| 2556.0| 587.0| 1005.0| 401.0| 1.6991| 69100.0|
| -114.65| 33.6| 28.0| 1678.0| 322.0| 666.0| 256.0| 2.9653| 94900.0|
| -114.65| 32.79| 21.0| 44.0| 33.0| 64.0| 27.0| 0.8571| 25000.0|
| -114.66| 32.74| 17.0| 1388.0| 386.0| 775.0| 320.0| 1.2049| 44000.0|
| -114.67| 33.92| 17.0| 97.0| 24.0| 29.0| 15.0| 1.2656| 27500.0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
only showing top 20 rowsStep 3: Connect to Spark
spark_sql = SparkSQL(schema=schema)
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")
toolkit = SparkSQLToolkit(db=spark_sql, llm=llm)
agent_executor = create_spark_sql_agent(llm=llm, toolkit=toolkit, verbose=True)agent_executor.run("Describe the california_housing_train table")Output
> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input: ""
Observation: california_housing_train
Thought:I see that there is a table called "california_housing_train" in the database. I can use the "schema_sql_db" tool to get more information about this table.
Action: schema_sql_db
Action Input: "california_housing_train"
Observation: CREATE TABLE spark_catalog.langchain_example.california_housing_train (
longitude DOUBLE,
latitude DOUBLE,
housing_median_age DOUBLE,
total_rooms DOUBLE,
total_bedrooms DOUBLE,
population DOUBLE,
households DOUBLE,
median_income DOUBLE,
median_house_value DOUBLE)
3 rows from california_housing_train table:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
-114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
-114.47 34.4 19.0 7650.0 1901.0 1129.0 463.0 1.82 80100.0
-114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
*/
Thought:The "california_housing_train" table has the following columns: longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, and median_house_value. It contains information about housing in California. I can now answer the question.
Action: None
Final Answer: The "california_housing_train" table has the following columns: longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, and median_house_value.
> Finished chain.Lets try the same problem using other libraries within Langchain
# Define a function to generate llm_config from a LangChain tool
def generate_llm_config(tool):
# Define the function schema based on the tool's args_schema
function_schema = {
"name": tool.name.lower().replace (' ', '_'),
"description": tool.description,
"parameters": {
"type": "object",
"properties": {},
"required": [],
},
}
if tool.args is not None:
function_schema["parameters"]["properties"] = tool.args
return function_schema
from langchain.tools.file_management.read import ReadFileTool
import autogen
from langchain.tools.spark_sql.tool import (
InfoSparkSQLTool,
ListSparkSQLTool,
QueryCheckerTool,
QuerySparkSQLTool,
)
tools = []
function_map = {}
for tool in toolkit.get_tools():
tool_schema = generate_llm_config(tool)
print(tool_schema)
tools.append(tool_schema)
function_map[tool.name] = tool._run
# Construct the llm_config
llm_config = {
#Generate functions config for the Tool
"functions": tools,
"config_list": config_list, # Assuming you have this defined elsewhere
"request_timeout": 120,
}
user_proxy = autogen.UserProxyAgent(
name="user_proxy",
is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
human_input_mode="NEVER",
max_consecutive_auto_reply=10,
code_execution_config={"work_dir": "coding"},
)
print(function_map)
# Register the tool and start the conversation
user_proxy.register_function(
function_map = function_map
)
chatbot = autogen.AssistantAgent(
name="chatbot",
system_message="For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done.",
llm_config=llm_config,
)
user_proxy.initiate_chat(
chatbot,
message="Describe the table names california_housing_train",
llm_config=llm_config,
)Output:
{'name': 'query_sql_db', 'description': '\n Input to this tool is a detailed and correct SQL query, output is a result from the Spark SQL.\n If the query is not correct, an error message will be returned.\n If an error is returned, rewrite the query, check the query, and try again.\n ', 'parameters': {'type': 'object', 'properties': {'query': {'title': 'Query', 'type': 'string'}}, 'required': []}}
{'name': 'schema_sql_db', 'description': '\n Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables.\n Be sure that the tables actually exist by calling list_tables_sql_db first!\n\n Example Input: "table1, table2, table3"\n ', 'parameters': {'type': 'object', 'properties': {'table_names': {'title': 'Table Names', 'type': 'string'}}, 'required': []}}
{'name': 'list_tables_sql_db', 'description': 'Input is an empty string, output is a comma separated list of tables in the Spark SQL.', 'parameters': {'type': 'object', 'properties': {'tool_input': {'title': 'Tool Input', 'default': '', 'type': 'string'}}, 'required': []}}
{'name': 'query_checker_sql_db', 'description': '\n Use this tool to double check if your query is correct before executing it.\n Always use this tool before executing a query with query_sql_db!\n ', 'parameters': {'type': 'object', 'properties': {'query': {'title': 'Query', 'type': 'string'}}, 'required': []}}
{'query_sql_db': <bound method QuerySparkSQLTool._run of QuerySparkSQLTool(db=<langchain.utilities.spark_sql.SparkSQL object at 0x793ee7084430>)>, 'schema_sql_db': <bound method InfoSparkSQLTool._run of InfoSparkSQLTool(db=<langchain.utilities.spark_sql.SparkSQL object at 0x793ee7084430>)>, 'list_tables_sql_db': <bound method ListSparkSQLTool._run of ListSparkSQLTool(db=<langchain.utilities.spark_sql.SparkSQL object at 0x793ee7084430>)>, 'query_checker_sql_db': <bound method QueryCheckerTool._run of QueryCheckerTool(db=<langchain.utilities.spark_sql.SparkSQL object at 0x793ee7084430>, llm=ChatOpenAI(client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, model_name='gpt-3.5-turbo-16k', temperature=0.0, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy=''), llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['query'], template='\n{query}\nDouble check the Spark SQL query above for common mistakes, including:\n- Using NOT IN with NULL values\n- Using UNION when UNION ALL should have been used\n- Using BETWEEN for exclusive ranges\n- Data type mismatch in predicates\n- Properly quoting identifiers\n- Using the correct number of arguments for functions\n- Casting to the correct data type\n- Using the proper columns for joins\n\nIf there are any of the above mistakes, rewrite the query. If there are no mistakes, just reproduce the original query.'), llm=ChatOpenAI(client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, model_name='gpt-3.5-turbo-16k', temperature=0.0, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='')))>}
user_proxy (to chatbot):
Describe the table names california_housing_train
--------------------------------------------------------------------------------
chatbot (to user_proxy):
***** Suggested function Call: schema_sql_db *****
Arguments:
{
"table_names": "california_housing_train"
}
**************************************************
--------------------------------------------------------------------------------
>>>>>>>> EXECUTING FUNCTION schema_sql_db...
user_proxy (to chatbot):
***** Response from calling function "schema_sql_db" *****
CREATE TABLE spark_catalog.langchain_example.california_housing_train (
longitude DOUBLE,
latitude DOUBLE,
housing_median_age DOUBLE,
total_rooms DOUBLE,
total_bedrooms DOUBLE,
population DOUBLE,
households DOUBLE,
median_income DOUBLE,
median_house_value DOUBLE)
;
/*
3 rows from california_housing_train table:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
-114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
-114.47 34.4 19.0 7650.0 1901.0 1129.0 463.0 1.82 80100.0
-114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
*/
**********************************************************
--------------------------------------------------------------------------------
chatbot (to user_proxy):
The table `california_housing_train` has the following columns:
- longitude (DOUBLE)
- latitude (DOUBLE)
- housing_median_age (DOUBLE)
- total_rooms (DOUBLE)
- total_bedrooms (DOUBLE)
- population (DOUBLE)
- households (DOUBLE)
- median_income (DOUBLE)
- median_house_value (DOUBLE)
Here are a few sample rows from the `california_housing_train` table:
1. longitude: -114.31, latitude: 34.19, housing_median_age: 15.0, total_rooms: 5612.0, total_bedrooms: 1283.0, population: 1015.0, households: 472.0, median_income: 1.4936, median_house_value: 66900.0
2. longitude: -114.47, latitude: 34.4, housing_median_age: 19.0, total_rooms: 7650.0, total_bedrooms: 1901.0, population: 1129.0, households: 463.0, median_income: 1.82, median_house_value: 80100.0
3. longitude: -114.56, latitude: 33.69, housing_median_age: 17.0, total_rooms: 720.0, total_bedrooms: 174.0, population: 333.0, households: 117.0, median_income: 1.6509, median_house_value: 85700.0
--------------------------------------------------------------------------------
user_proxy (to chatbot):
--------------------------------------------------------------------------------
chatbot (to user_proxy):
TERMINATE
--------------------------------------------------------------------------------Conclusion
In conclusion, the integration of AutoGen, Langchain, and Spark represents a watershed moment in the world of Language Model (LLM) applications. This triumvirate of technologies reshapes the very foundations of how we approach problem-solving, data processing, and multi-agent collaboration, offering a host of advantages that include distributed intelligence, blockchain security, and high-performance computing.
By combining AutoGen’s multi-agent framework with Langchain’s blockchain-backed language model, we can create conversable agents with unparalleled capabilities. These agents can draw upon the vast pool of distributed knowledge, seamlessly integrate with human inputs, and leverage a wide array of tools to solve complex tasks.
The inclusion of Spark in this integration elevates the system’s performance to new heights, enabling real-time data analysis and decision-making, a critical feature in today’s fast-paced world of data-driven applications.
Moreover, AutoGen introduces the concept of collaborative task-solving through automated chat, unifying the efforts of language models, tools, and human expertise. This not only enhances the efficiency of problem-solving but also fosters an environment that is both user-friendly and intuitive.
As we’ve seen, the code implementation underscores the potential of this integration. While our example is simplified, it serves as a foundation for creating applications that excel in multi-agent collaboration, data processing, and problem-solving. Developers can look forward to harnessing the collective intelligence, security, and performance of AutoGen, Langchain, and Spark to pioneer innovative solutions across a range of domains.
This synergy marks a pivotal shift in the field of AI and blockchain technologies. The possibilities are endless, from decentralized conversational assistants and secure knowledge-sharing platforms to data-intensive applications that demand high-performance computing. The integration of AutoGen, Langchain, and Spark opens doors to a new era of possibilities and applications, and we can anticipate exciting developments on the horizon.
“Stay connected and support my work through various platforms:
- GitHub: For all my open-source projects and Notebooks, you can visit my GitHub profile at https://github.com/andysingal. If you find my content valuable, don’t hesitate to leave a star.
- Patreon: If you’d like to provide additional support, you can consider becoming a patron on my Patreon page at https://www.patreon.com/AndyShanu.
- Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal.
- The Kaggle: Check out my Kaggle profile for data science and machine learning projects at https://www.kaggle.com/alphasingal.
- Hugging Face: For natural language processing and AI-related projects, you can explore my Huggingface profile at https://huggingface.co/Andyrasika.
- YouTube: To watch my video content, visit my YouTube channel at https://www.youtube.com/@andy111007.
- LinkedIn: To stay updated on my latest projects and posts, you can follow me on LinkedIn. Here is the link to my profile: https://www.linkedin.com/in/ankushsingal/."
Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.
Remember, each “Like”, “Share”, and “Star” greatly contributes to my work and motivates me to continue producing more quality content. Thank you for your support!
Resources:
