How to setup and run MultiModal RAG in 4 lines of code!!
Doing cool things with data!

Introduction
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to enhancing the capabilities of large language models (LLMs) by incorporating external knowledge sources. By combining the generation capabilities of LLMs with the ability to retrieve relevant information from databases, RAG models can produce more informed and contextual outputs.
Until about 6–9 months ago, setting up and running RAG models was a complex and time-consuming process, involving multiple components and intricate configurations. Fortunately, recent innovations in the field have significantly simplified the process of building and deploying RAG models. These innovations include the availability of various open-source vector databases, seamless integration with both open-source and closed-source language models, flexible chunking and embedding strategies, and the ability to incorporate data from multiple sources.
One library that I recently used is EmbedChain. I have been a long term user of Langchain, so that tends to be my go-to. But I was pleasantly surprised that I could set up a multimodal RAG pipeline on EmbedChain in less than 10 minutes. I want to share the steps with you so you can also speed up your RAG deployments and experimentation with this.
About EmbedChain
EmbedChain is an open-source framework that makes it easy to build and deploy retrieval-augmented generation (RAG) applications powered by large language models (LLMs). Its “Conventional but Configurable” approach caters to both software and machine learning engineers.
Key advantages of EmbedChain include:
- Simplifies RAG Development: Building robust RAG pipelines involves complexities like data integration, chunking, indexing, vector storage, and more. EmbedChain streamlines this process.
- Flexible Architecture: Choose components like LLMs, vector databases, data loaders, chunkers, and retrieval strategies to tailor the pipeline to your needs.
- Efficient Data Handling: EmbedChain automatically loads data, generates embeddings for relevant chunks, and stores them in your chosen vector database.
- User-Friendly APIs: Beginners can build LLM apps in just 4 lines of code, while advanced users can deeply customize the RAG pipeline.
The core workflow is straightforward:
- Add Data: Automatically load, chunk, embed, and index your data sources.
- Query: Turn user questions into embeddings to retrieve relevant documents.
- Generate: Use retrieved documents to craft precise answers with an LLM.
Whether you’re an expert or novice, EmbedChain abstracts away RAG complexities so you can focus on building powerful AI applications tailored to your data and use case.
Testing EmbedChain on Multimodal pipeline including PDFs and Youtube Videos
So let’s build our short and simple EmbedChain pipeline. For this experiment, I will be choosing a mixture of Youtube videos and PDFs. I am curious on learning how the SORA model works based on the information/theories online and on youtube. (There is no official paper from OpenAI, just a technical report with limited details)
I start by defining my sources
youtube_sources = ['https://www.youtube.com/watch?v=fG3IE9dkyKY',
'https://www.youtube.com/watch?v=5SOKVN3hav4',
'https://www.youtube.com/watch?v=r6Go6dGxrxg']
pdf_sources = ['2402.17177.pdf', 'Sora_technical_report_OpenAI.pdf']And import the library
import os
os.environ["OPENAI_API_KEY"] = "sk-"
from embedchain import App
from embedchain.models.data_type import DataTypeGetting your app up and running is 3 simple steps:
- Define the EmbedChain app. You can optionally pass a config. I will share details of my config below
- Add your data to the app. Use the DataType to tell the app which type of data to expect, example YOUTUBE_VIDEO and PDF_FILE for me. This is so elegant in its design. At this step, your data will be chunked, embedded and added to a vector store
3. Query your app
This is it!
## Define the EmbedChain app
app = App.from_config(config=config)
## Add your sources to the app
for video in youtube_sources:
app.add(video, data_type=DataType.YOUTUBE_VIDEO)
for pdf in pdf_sources:
app.add(pdf, data_type=DataType.PDF_FILE)
## Query the app
app.query("Is the CLIP model used in SORA pipeline. If yes, how?")The library is flexible so that if you want to customize specific things you can. This is done by setting your config file as shown below. But this is optional. You can use the default config for getting started.
## Define your params
config = {
'vectordb': {
'provider': 'chroma',
'config': {
'collection_name': 'my-collection',
'dir': 'db',
'allow_reset': True
}
},
'embedder': {
'provider': 'openai',
'config': {
'model': 'text-embedding-3-small'
}
},
'llm': {
'provider': 'openai',
'config': {
'model': 'gpt-3.5-turbo-0125',
'temperature': 0.5,
'top_p': 1,
'stream': False,
'prompt': (
"Use the following pieces of context to answer the query at the end.\n"
"If you don't know the answer, just say that you don't know, don't try to make up an answer.\n"
"$context\n\nQuery: $query\n\nHelpful Answer:"
),
'system_prompt': (
"You are an expert at looking at the provided context and answering user's query."
),
}
}
}The RAG responses were good.
### Query
app.query("Is the CLIP model used in SORA pipeline. If yes, how?")
### Response
"Yes, the CLIP-like conditioning mechanism in Sora receives
LLM-augmented user instructions and potentially visual prompts
to guide the diffusion model in generating styled or themed videos.
This aspect of Sora's functionality showcases significant advancements
in the vision domain."
### Query
app.query("Was image captioning used to generate training data for SoRA? If yes, which model and how")
### Response
"Yes, image captioning was used to generate training data for SoRA.
The model utilized for this purpose is a video captioner capable of
producing detailed descriptions for videos. This video captioner was
trained to generate high-quality (video, descriptive caption) pairs
for all videos in the training data, which were then used to fine-tune
SoRA to improve its instruction following ability."Conclusion
EmbedChain is a promising open-source framework that allows you to quickly build powerful retrieval-augmented generation (RAG) applications. By efficiently integrating language models and data from multiple sources, EmbedChain simplifies the creation of context-aware AI that understands natural queries. Its flexibility and ease of use make it an attractive option for leveraging the full capabilities of RAG technology across various domains and skill levels. Hope this short blog encourages you to give this a shot.
At Opal AI, we have built multi-agent pipelines for our customers to solve real world problems. Email me at [email protected] if you are interested in collaborating together.






