
LANGCHAIN — Multi-Modal RAG Template
Software is a great combination between artistry and engineering. — Bill Gates
A multi-modal RAG (Retrieval Augmented Generation) template is a tool that allows developers to create applications for processing and generating responses using both text and visual content. This tutorial will provide an overview of how to use the LangChain multi-modal RAG template for slide decks.
Design
The multi-modal RAG template for slide decks follows the same principle as RAG apps on text documents, but it retrieves relevant visual content from slide decks. There are two general approaches to this problem:
Multi-modal Embeddings
This approach involves extracting the slides as images and using multi-model embeddings to retrieve the relevant slide image(s) based on the user input. The retrieved images are then passed to a multi-modal language model (in this case, GPT-4V) for answer synthesis.
Multi-vector Retriever
In this approach, the slides are again extracted as images, but GPT-4V is used to summarize each image. The image summaries, along with links to the original images, are then retrieved based on similarity to the user input. Finally, the retrieved images are passed to GPT-4V for answer synthesis.
Evaluation
To evaluate these methods, a public benchmark based on an investor presentation slide deck from Datadog was created. The benchmark consists of 10 questions, and the performance of the multi-modal approaches was compared with text-only RAG using LangSmith.
The evaluation results showed that both multi-modal approaches significantly outperformed text-only RAG. The multi-vector retriever with image summary achieved the highest accuracy, demonstrating the effectiveness of using GPT-4V for structured data extraction from images.
Deployment
The LangChain team has released a template that uses both Chroma and OpenCLIP multi-modal embeddings, making it easy for developers to get started with multi-modal RAG applications for slide decks. The template allows users to upload a presentation and run the playground with just two commands.
Conclusion
Multi-modal LLMs have the potential to unlock the visual content in slide decks for RAG applications. The benchmark evaluation showed that both multi-modal approaches outperformed text-only RAG. While there are trade-offs between the approaches, the release of the template will aid in the testing and deployment of multi-modal RAG apps.
In conclusion, the multi-modal RAG template provides a powerful tool for developers to create applications that can effectively process and generate responses using both text and visual content from slide decks.






