Summary

The article outlines a method for converting audio content into text and using AI to answer questions about the audio's content.

Abstract

The article discusses a solution for individuals who prefer audio learning by introducing an AI-based application that can transcribe audio files into text and then use a Large Language Model (LLM) to respond to user queries about the audio content. It details the use of AssemblyAI for audio transcription, OpenAIEmbeddings for generating embeddings, Chroma as an in-memory database for storing vectors, and OpenAI's LLM for processing questions. The process involves transcribing the audio, chunking the text, generating embeddings, and querying the database to provide answers, with all components wrapped in the Langchain library. The author provides step-by-step instructions, including obtaining API keys, installing necessary packages, and writing the code to execute each step, ultimately demonstrating how the system can provide insights into the audio content.

Opinions

The author believes that listening to audio is a more effective way of learning for many people compared to reading.
They suggest that re-listening to audio for information retrieval can be time-consuming, implying the need for a more efficient method.
The author posits that an AI-based application can save time by allowing users to ask questions and receive answers about audio content without revisiting the entire audio.
They highlight the Langchain library as a valuable tool that simplifies the integration of various AI components necessary for the application.
The author implies that the described method is superior to traditional re-listening or re-watching by emphasizing the ease of generating text and querying the AI.
By providing a referral link to join Medium and mentioning a cost-effective AI service, the author endorses these platforms as valuable resources for readers interested in AI applications.

Passing An Audio File To LLM

In this article, I’ll explain about how we can pass an audio file to LLM and I’m taking OpenAI as our LLM.

There are many people who prefer audio and video tutorials over reading along with our podcast lovers as listening seems to be more effective for them as compared to reading a book, an e-book or an article, and it is quite common that after a certain period of time, we may forget some of the portions of our tutorial. Now, in order to get the insights again, re-watching or re-listening is the only option, which could be very time-consuming.

So, the best solution is to come up with a small AI-based application by writing just a few lines of code which can analyze the audio and respond to all the questions that are asked by the user.

Here, utilizing generative AI could be the best option, but the problem is, we can’t pass audio directly as it is text-based. Let’s deep dive into this article, to understand how we can make this work in a step-by-step fashion.

High-level steps

To execute the solution from end-to-end, we need to work with below components/libraries:

Audio to Text Generator

For transcript generation, we will be using AssemblyAI

Embedding Generator

For generating the embeddings, we will be using OpenAIEmbeddings

Vector Database

Chroma will be used as an in-memory database for storing the vectors

Large Language Model

OpenAI as LLM

And all these are wrapped under a library called Langchain, so we will be highly utilizing that too.

First of all, we need to grab the keys as shown below:

Get An OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, login and then grab the keys using highlighted way:

Get An AssemblyAI API Key

To get the AssemblyAI key, you need to go to AssemblyAI | Account, login and then grab the keys using highlighted way:

Install Packages

Install these packages:

assemblyai>=0.17.0
openai>=0.28.0
sentence-transformers>=2.2.2
langchain>=0.0.278
chromadb>=0.4.8
tiktoken>=0.5.1

Import Required Packages

Do install the dependent libraries and import below packages:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

Transcribe Audio

Next step is to extract text out of an audio file. Here is the sample audio, I’ve used for this article.

doc = AssemblyAIAudioTranscriptLoader(file_path="https://storage.googleapis.com/aai-docs-samples/nbc.mp3").load()

Here is how doc looks like:

[Document(page_content=”Load time, a new president and new congressional makeup. Same old partisan divides, right? Yes and no. There’s the traditional red blue divide you’re very familiar with, but there’s a lot more below the surface going on in both parties…….”, metadata={‘language_code’: <LanguageCode.en_us: ‘en_us’>, ‘audio_url’: ‘https://storage.googleapis.com/aai-docs-samples/nbc.mp3', ‘punctuate’: True, ‘format_text’: True, ‘dual_channel’: None, ‘webhook_url’:26028, ‘end’: 26386, ‘confidence’: 0.94482}, {‘text’: ‘are’, ‘start’: 26418, ‘end’: 26566, ‘confidence’: 0.7851}, {‘text’: ‘what’, ‘start’: 26588, ‘end’: 26726, ‘confidence’: 0.99999}, …

Let’s update the metadata to something we want using below code:

doc[0].metadata = {“audio_url”:doc[0].metadata[“audio_url”]}

Chunk The Text

I’m taking chunk size as 700 but you can change this number.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=0)
texts = text_splitter.split_documents(doc)

texts will look like this:

[Document(page_content=”Load time, a new president and new congressional makeup. Same old partisan divides, right?…, Document(page_content=”supporters of former President Donald Trump. We’re going to call them the Trump Republican. Another 17% …, Document(page_content=”Republicans are firmly against compromising with Biden in order to gain consensus on legislation, as y…, Document(page_content=”make it easier on yourself to form a governing coalition, something the Biden White House may want to think about. When we come back…

Generate Embeddings And Save To Database

In this step, we will generate embeddings for above text using below lines:

db = Chroma.from_documents(texts,OpenAIEmbeddings())
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

Query And Get The Response

This step will create QA chain and passing query to that, will give an answer.

chain = RetrievalQA.from_chain_type(llm,retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}))
query = "What this audio file is all about?"
chain({"query":query})

Here is the output:

{‘query’: ‘What this audio file is all about?’, ‘result’: ‘The audio file discusses the current political landscape in the United States, specifically focusing on the divisions within the Democratic and Republican parties. It mentions the emergence of four political parties within these two major parties and discusses their differing views on compromising with President Biden to pass legislation.’}

You can see that how easy it was to generate the text and get our questions answered.

Join Medium with my referral link - Shweta Lodha

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

shweta-lodha.medium.com

Takeaway

I hope you got an idea about how to read an audio file and ask questions to it. At any point, if you feel anything is unclear, please feel free to watch my video explaining these steps: