Passing An Audio File To LLM
In this article, I’ll explain about how we can pass an audio file to LLM and I’m taking OpenAI as our LLM.
There are many people who prefer audio and video tutorials over reading along with our podcast lovers as listening seems to be more effective for them as compared to reading a book, an e-book or an article, and it is quite common that after a certain period of time, we may forget some of the portions of our tutorial. Now, in order to get the insights again, re-watching or re-listening is the only option, which could be very time-consuming.
So, the best solution is to come up with a small AI-based application by writing just a few lines of code which can analyze the audio and respond to all the questions that are asked by the user.
Here, utilizing generative AI could be the best option, but the problem is, we can’t pass audio directly as it is text-based. Let’s deep dive into this article, to understand how we can make this work in a step-by-step fashion.
High-level steps
To execute the solution from end-to-end, we need to work with below components/libraries:
Audio to Text Generator
- For transcript generation, we will be using AssemblyAI
Embedding Generator
- For generating the embeddings, we will be using OpenAIEmbeddings
Vector Database
- Chroma will be used as an in-memory database for storing the vectors
Large Language Model
- OpenAI as LLM
And all these are wrapped under a library called Langchain, so we will be highly utilizing that too.
First of all, we need to grab the keys as shown below:
Get An OpenAI API Key
To get the OpenAI key, you need to go to https://openai.com/, login and then grab the keys using highlighted way:
Get An AssemblyAI API Key
To get the AssemblyAI key, you need to go to AssemblyAI | Account, login and then grab the keys using highlighted way:
Install Packages
Install these packages:
assemblyai>=0.17.0
openai>=0.28.0
sentence-transformers>=2.2.2
langchain>=0.0.278
chromadb>=0.4.8
tiktoken>=0.5.1
Import Required Packages
Do install the dependent libraries and import below packages:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
Transcribe Audio
Next step is to extract text out of an audio file. Here is the sample audio, I’ve used for this article.
doc = AssemblyAIAudioTranscriptLoader(file_path="https://storage.googleapis.com/aai-docs-samples/nbc.mp3").load()
Here is how doc looks like:
[Document(page_content=”Load time, a new president and new congressional makeup. Same old partisan divides, right? Yes and no. There’s the traditional red blue divide you’re very familiar with, but there’s a lot more below the surface going on in both parties…….”, metadata={‘language_code’: <LanguageCode.en_us: ‘en_us’>, ‘audio_url’: ‘https://storage.googleapis.com/aai-docs-samples/nbc.mp3', ‘punctuate’: True, ‘format_text’: True, ‘dual_channel’: None, ‘webhook_url’:26028, ‘end’: 26386, ‘confidence’: 0.94482}, {‘text’: ‘are’, ‘start’: 26418, ‘end’: 26566, ‘confidence’: 0.7851}, {‘text’: ‘what’, ‘start’: 26588, ‘end’: 26726, ‘confidence’: 0.99999}, …
Let’s update the metadata to something we want using below code:
doc[0].metadata = {“audio_url”:doc[0].metadata[“audio_url”]}
Chunk The Text
I’m taking chunk size as 700 but you can change this number.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=0)
texts = text_splitter.split_documents(doc)
texts will look like this:
[Document(page_content=”Load time, a new president and new congressional makeup. Same old partisan divides, right?…, Document(page_content=”supporters of former President Donald Trump. We’re going to call them the Trump Republican. Another 17% …, Document(page_content=”Republicans are firmly against compromising with Biden in order to gain consensus on legislation, as y…, Document(page_content=”make it easier on yourself to form a governing coalition, something the Biden White House may want to think about. When we come back…
Generate Embeddings And Save To Database
In this step, we will generate embeddings for above text using below lines:
db = Chroma.from_documents(texts,OpenAIEmbeddings())
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
Query And Get The Response
This step will create QA chain and passing query to that, will give an answer.
chain = RetrievalQA.from_chain_type(llm,retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}))
query = "What this audio file is all about?"
chain({"query":query})
Here is the output:
{‘query’: ‘What this audio file is all about?’, ‘result’: ‘The audio file discusses the current political landscape in the United States, specifically focusing on the divisions within the Democratic and Republican parties. It mentions the emergence of four political parties within these two major parties and discusses their differing views on compromising with President Biden to pass legislation.’}
You can see that how easy it was to generate the text and get our questions answered.
Takeaway
I hope you got an idea about how to read an audio file and ask questions to it. At any point, if you feel anything is unclear, please feel free to watch my video explaining these steps: