The provided content outlines the creation of a voice-based ChatGPT clone named "Talkie," which leverages LangChain, Whisper, and PyTTX3 to interact with users through voice commands, perform internet searches, and access local files.
Abstract
The article details the development of "Talkie," a voice-activated ChatGPT clone that utilizes the LangChain framework to facilitate interactions with a large language model (LLM) via spoken language. The application employs OpenAI's Whisper for speech recognition and PyTTX3 for text-to-speech conversion. "Talkie" is capable of searching the internet and local files for information, providing responses through a combination of chaining components such as LLMChain, AgentExecutor, and RetrievalQA. The system operates through a Flask web application and uses various technologies including the MediaStream Recording API, Chroma vector database, and Serper API for Google searches. The article emphasizes the significance of LangChain in democratizing the creation of LLM applications, marking a shift from the need for extensive resources to train models to a more accessible application development landscape.
Opinions
The author views LangChain as a pivotal tool in the evolution of LLM applications, facilitating the construction of complex systems without the necessity of training models from scratch.
The introduction of ChatGPT plugins and LangChain is seen as fostering an exciting ecosystem of AI applications based on LLMs.
The author implies that the ability of "Talkie" to access and provide information from recent articles demonstrates the powerful capabilities of LLM applications in real-world scenarios.
There is an opinion that the development of LLM applications is becoming more accessible to a wider range of developers, as evidenced by the creation of "Talkie" using existing libraries and APIs.
The author suggests that the future of LLM applications is bright and rapidly expanding, with "Talkie" serving as an example of what can be achieved with current technology.
Create a Voice-based ChatGPT Clone That Can Search on the Internet and local files
Using LangChain, Whisper, and PyTTX3 to build a voice-based ChatGPT clone that can search the Internet and also local files
Picture generated with Midjourney
ChatGPT has been the rage since OpenAI introduced it a few months ago, in November 2022. Since then, large language models (LLMs) from tech companies, startups and even universities started popping up. Then a couple of weeks ago, OpenAI introduced ChatGPT plugins, creating an exciting ecosystem of AI apps based on LLMs.
However, even before ChatGPT was released, there were already attempts to build applications on top of LLMs. One of the more popular libraries for building LLM applications is LangChain, which came out around October 2022.
In this article, I’ll be using LangChain to build a voice-based ChatGPT clone. By this, I meant that the user interacts with the ChatGPT clone (which I call Talkie) by voice.
This is easiest to explain with a video (turn on the volume, please).
The ChatGPT clone, Talkie, was written on 1 April 2023, and the video was made on 2 April. Normally, there is no way an LLM would know such recent information, but using LangChain, I made Talkie search on the Internet and responded based on the information it found. If you have watched the ChatGPT browser plugin video, this is a bit like it.
In fact, I used a few of the same files I used in the previous article as well, and they came out nicely.
So how does it work? It’s quite simple, as it turns out.
How it works
Talkie is a Python Flask web application that runs only on your desktop. Once you start it, you can click on the blue record button (with the microphone icon) on the browser. This starts a MediaRecorder on the browser recording your voice message into an audio Blob. When you’re done, click the stop button, and the recording is sent to a web application handler.
The handler then does 3 things:
Use OpenAI Whisper to transcribe the message recording into input text
Pass the input text into a LangChain object to get a response
Use PyTTX3 to play the response output as a voice message
The MediaStream Recording API (also known as MediaRecorder API) is a Javascript API that makes it possible to capture media data. It comprises of a single major interface, the MediaRecorder, which captures data through a series of dataavailable events. In Talkie, I used MediaRecorder to capture voice messages and create a WebM audio recording sent to the local server.
Whisper
Whisper is an open-source automatic speech recognition (ASR) AI model trained by OpenAI on 680,000 hours of supervised data collected from the web. I used Whisper to transcribe voice messages to text.
LangChain
LangChain is a framework for building LLM applications. There are a few ways of doing this, and LangChain (despite its name) doesn’t favour one over the other.
The eponymous way of doing this is by chaining multiple components together to form a single wrapper object called a chain. For example, the most commonly used type of chain is an LLMChain, which combines a PromptTemplate and an LLM. It takes user input, format it accordingly to the template and passes it to the model to get a response. We will be creating an LLMChain in a while.
Another way of building an LLM application is using an agent. Agents use an LLM to determine which actions to take and in what order. An action can either be using a tool and observing its output or returning a response to the user. Tools are functions that agents can use to interact with the world. Example tools are Google search, database lookup, calling APIs like Wolfram Alpha etc. Tools can even be other chains or agents. We will also create a chat agent that uses Serper, a Google search API service.
To access local files, I used another chain, the RetrievalQA chain. I explained this in a previous article, but basically, to access local files, we load the files into an index. Then we query and pass the results of the query to the LLM as part of the prompt. This technique is also called prompt engineering. In this case, I used Chroma, an open-source vector database (it calls itself an embedding database, but it’s the same thing) to store the index.
PyTTX3
PyTTX3 is a Python text-to-speech library that wraps around different TTS engines, including SAPI5 (Windows), NSSpeechSynthesizer (MacOS) and eSpeak (Linux).
You can install all the required Python libraries using:
$ pip install -r requirements.txt
In particular, Whisper requires ffpmeg.
MediaRecorder Javascript
The MediaRecorder is in a script embedded in an HTML file, which I won’t be showing here (you can check it out from the repository later).
Before we start using MediaRecorder, we need to ensure your browser supports it. If it does, we create a MediaRecorder object and pass the data stream to it.
// set up basic variables for appconst record = document.querySelector('.record');
const border = document.querySelector('.border');
const canvas = document.querySelector('.visualizer');
const mainSection = document.querySelector('.main-controls');
const printout = document.querySelector('.printout');
const canvasCtx = canvas.getContext("2d");
let recording = false;
let audioCtx;
if (navigator.mediaDevices.getUserMedia) {
const constraints = { audio: true };
// this stores the data for the audio bloblet chunks = [];
let onSuccess = function(stream) {
const mediaRecorder = newMediaRecorder(stream);
visualize(stream);
// when the user clicks on the record button
record.onclick = function() {
// start the recordingif (recording == false) {
mediaRecorder.start();
record.style.background = "red";
record.innerHTML = "<i class='fa-solid fa-stop'></i>";
recording = true;
} else {
// stop the recording
mediaRecorder.stop();
record.style.background = "";
record.innerHTML = "<i class='fa-solid fa-microphone'></i>";
recording = false;
}
}
// when the user clicks on the stop button
mediaRecorder.onstop = function(e) {
// create an audio blob in webm formatconst blob = newBlob(chunks, { type: "audio/webm" });
// add it to the formconst formData = newFormData();
formData.append('audio', blob, 'recording.webm');
// send the audio blob to the serverfetch('/record', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => {
// display the respond on the browser
out = '<div class="text-primary-emphasis fw-bolder pt-3">' + data["input"] +
'</div><div class="text-body-emphasis">' + data["output"] + '</div>';
printout.innerHTML += out;
// move to the bottom of the div
border.scrollTop = border.scrollHeight;
})
.catch((error) => {
console.error('Error:', error);
});
chunks = [];
}
// add the data into chunks when its available
mediaRecorder.ondataavailable = function(e) {
chunks.push(e.data);
}
}
let onError = function(err) {
console.log('The following error occured: ' + err);
}
navigator.mediaDevices.getUserMedia(constraints).then(onSuccess, onError);
}
else {
console.log('getUserMedia not supported on your browser!');
}
...
Then we tie the record button’s click event to toggle between starting and stopping the MediaRecorder. Once the record button is clicked, it will start capturing the dataavailable events and push the data into a chunks list.
We also capture the stop event on the MediaRecorder. When the MediaRecorder is stopped, we create a Blob with the chunks list and specify the mime type as audio/webm.
Then we place the audio Blob into a form and send it to the local server. The server should return with the transcribed input and the LLM response, which we place on the browser.
The server
The local server is a Flask web application. I placed the Flask application in a file named app.py. This imports the server variable from server.py, which we use to start the server.
I used the dotenv library to load various environment variables from a .env file, including OPENAI_API_KEY, SERPER_API_KEY and VOICE. The 2 API keys are self-explanatory, and VOICE is the voice ID used to tell the TTS engine the voice to read the text aloud.
import logging
import webbrowser
from contextlib import redirect_stdout
from io import StringIO
from dotenv import load_dotenv
load_dotenv()
from server import server
logger = logging.getLogger(__name__)
if __name__ == '__main__':
# open browser in a separate threadwith redirect_stdout(StringIO()):
chrome_path = 'open -a /Applications/Google\ Chrome.app %s'
webbrowser.get(chrome_path).open("http://localhost:3721")
# start server
server.run("127.0.0.1", 3721, debug=True)
I also start up the Chrome browser at the same time. It assumes you’re running on a MacOS computer and have Chrome installed. I use Chrome because MediaRecorder doesn’t work reliably on Safari.
The heart of Talkie is in the server.py file, especially the record handler.
import os
import whisper
from flask import Flask, render_template, request
from chains import get_chat_chain, get_search_agent, get_qa_chain
import pyttsx3
# get path for static files
static_dir = os.path.join(os.path.dirname(__file__), 'static')
ifnot os.path.exists(static_dir):
static_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'static')
# audio file
audio_file = 'recording.webm'# load whisper model
model = whisper.load_model('small.en')
chat_chain = get_chat_chain()
search_agent = get_search_agent()
qa_chain = get_qa_chain()
# start server
server = Flask(__name__, static_folder=static_dir, template_folder=static_dir)
@server.route('/')deflanding():
return render_template('index.html')
@server.route('/record', methods=['POST'])defrecord():
# get file from request and save it
file = request.files['audio']
file.save(audio_file)
# transcribe the audio file using Whisper and extract the text
audio = whisper.load_audio(audio_file)
result = model.transcribe(audio)
text = result["text"]
# remove the temp audio fileif os.path.exists(audio_file):
os.remove(audio_file)
# predict the response to get the output# output = chat_chain.predict(human_input=text)# output = search_agent.run(input=text)
output = qa_chain.run(text)
# say out the response
engine = pyttsx3.init()
engine.setProperty('rate', 190)
engine.setProperty('voice', os.environ['VOICE'])
engine.say(output)
engine.startLoop()
# remove the temp audio fileif os.path.exists(audio_file):
os.remove(audio_file)
return {"input": text, "output": output.replace("\n", "<br />")}
First, I get the audio data from the form and save it into a temporary file, which I clean up before I return the results. Then I use Whisper, which I am using the small.en model, and load the audio file. Once the audio file is loaded, I use the model to transcribe the audio file into text.
Next, I use the chat chain or the search agent I created earlier (which I will describe in a short while) and pass it to the input text. The chain or agent will return a response, and I use the PyTTX3 engine to say it out loud.
The chains.py file has all the LangChain stuff. I set up 2 functions:
Create and return an LLMChain chat chain
Create and return an AgentExecutor that has an agent that uses the Serper API tool to do a Google search
Either function returns a response output that will be read out loud by PyTTX3.
from langchain import OpenAI, LLMChain, PromptTemplate
from langchain.memory import ConversationBufferWindowMemory, ConversationBufferMemory
from langchain.agents import initialize_agent, Tool
from langchain.chat_models import ChatOpenAI
from langchain.utilities import GoogleSerperAPIWrapper
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
# get a chat LLM chain, following a prompt templatedefget_chat_chain():
# create prompt from a template
template = open('template', 'r').read()
prompt = PromptTemplate(
input_variables=["history", "human_input"],
template=template
)
# create a LLM chain with conversation buffer memoryreturn LLMChain(
llm=OpenAI(temperature=0),
prompt=prompt,
verbose=True,
memory=ConversationBufferWindowMemory(k=10),
)
# get a chat chain that uses Serper API to search using Google Searchdefget_search_agent():
# set up the tool
search = GoogleSerperAPIWrapper()
tools = [ Tool(name = "Current Search", func=search.run, description="search")]
# create and return the chat agentreturn initialize_agent(
tools=tools,
llm=ChatOpenAI(),
agent="chat-conversational-react-description",
verbose=True,
memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True)
)
defget_qa_chain():
vectordb = Chroma(persist_directory='.',
embedding_function=OpenAIEmbeddings())
retriever = vectordb.as_retriever()
return RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=retriever)
The get_chat_chain function first creates a PromptTemplate from a template file shown below. Notice the 2 variables embedded in this template, which we also declare as we create the PromptTemplate object.
Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks,
from answering simple questions to providing in-depth explanations and
discussions on a wide range of topics. As a language model, Assistant
is able to generate human-like text based on the input it receives,
allowing it to engage in natural-sounding conversations and provide
responses that are coherent.
Assistant is constantly learning and improving, and its capabilities
are constantly evolving. It is able to process and understand large
amounts of text, and can use this knowledge to provide accurate and
informative responses to a wide range of questions. Additionally,
Assistant is able to generate its own text based on the input it
receives, allowing it to engage in discussions and provide explanations
and descriptions on a wide range of topics.
Overall, Assistant is a powerful tool that can help with a wide range
of tasks and provide valuable insights and information on a wide range
of topics. Whether you need help with a specific question or just want
to have a conversation about a particular topic, Assistant is here to
assist.
{history}
Human: {human_input}
Assistant:
We use this template to store both the current input and the previous versationBufferWindowMemory. We specify that it keeps records of up to 10 past human inputs.
The function then initialises an LLMChain with the prompt and also the OpenAI LLM with the temperature set to 0 (default is 0.7) and returns it. By default, this is the text-davinci-003 GPT-3 model.
The get_search_agent function creates and returns an AgentExecutor that contains an agent that is of the type chat-conversational-react-description. It also has a tools list with a single tool, the GoogleSerperAPIWrapper, which from its name, you can tell uses the Serper API to call Google search. The agent uses ChatOpenAI, which by default uses the gpt-3.5-turbo chat model. It also uses the ConversationBufferMemory memory.
The get_qa_chain function needs an index to be preloaded with the local files and documents. To do this, we have a load.py file.
import os
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from dotenv import load_dotenv
load_dotenv()
ifnot (os.path.exists('chroma-collections.parquet') and
os.path.exists('chroma-embeddings.parquet')):
loader = DirectoryLoader(os.environ['LOAD_DIR'])
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
vectordb = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory='.')
vectordb.persist()
The loader checks if the index files exist (we are using the default names, chroma-collections.parquet and chroma-embeddings.parquet). If they don’t, we use the DirectoryLoader to load all files in a given directory. Then we use the loader to convert the files into documents, and we then use a text splitter to split the documents into chunks of size 1000. After that, we use OpenAI embeddings to convert the document chunks and store them into Chroma index files.
In the get_qa_chain function, we load the index from the Chroma files and then get a retriever from it, which we can use to create the RetrievalQA chain.
LLM applications
Without sounding grandiose, while ChatGPT plugins feel like the beginning of the LLM ecosystem, LangChain seems like the dawn of the LLM application.
This is a big deal. Before this, training LLM models seemed the main thing and was the domain of the large tech companies training proprietary, closed models. You need to spend tens and hundreds of millions of dollars with huge amounts of data and compute over months or years.
Then it was fine-tuning models, which was great because you don’t need to train a model from scratch. Now you just need to provide a small amount of data (relatively) on top of a generic model to ‘tune’ it.
Then you don’t even need to do that. You can just do in-context learning and ‘train’ the generic model on-the-fly. Cue prompt engineering in.
With LangChain now (and I’m sure there will be more libraries and tools coming out of the woodwork), the front gate is now kicked open, and the raging hordes have poured in. Paraphrasing Simon Willison — that furious typing sound you can hear is thousands of hackers around the world churning out LLM applications and taking over the world.