How RAG will change with Multi Modal LLM and Generative AI Introduction

Summary

The website content discusses the evolution of RAG (Retrieval-Augmented Generation) with the integration of Multi Modal Large Language Models (LLMs) and Generative AI, particularly focusing on the implications of GPT4 Vision models for enhancing RAG applications.

Abstract

The article titled "How RAG will change with Multi Modal LLM and Generative AI" delves into the transformative potential of multi-modal RAG systems, emphasizing the new capabilities brought by GPT4 Vision models. It outlines the shift from traditional text-based RAG to multi-modal systems that can interpret and utilize both text and visual information. The author highlights three emerging patterns for integrating text and images in RAG solutions, which range from minimal disruption to existing systems to more advanced, disruptive approaches that involve creating a unified index for both text and image embeddings. These patterns aim to leverage the rich content in infographics and other visual materials that were previously inaccessible to RAG applications. The article concludes by acknowledging the evolving nature of the field and invites readers to contribute their insights and experiences with multi-modal RAGs.

Opinions

The author believes that multi-modal LLMs, such as GPT4 Vision, will significantly enhance RAG applications by enabling them to process and utilize visual information, which was a limitation in the past.
There is an anticipation that the integration of multi-modal data will lead to the development of new use cases and the need for updated patterns in RAG solutions.
The article suggests that while the existing RAG systems can benefit from the new multi-modal capabilities with minimal changes, more integrated approaches will likely become standard as the technology matures.
The author expresses that it's too early to determine the superiority of any single implementation pattern, as each has its strengths in specific scenarios.
Acknowledging the limitations of the current approach, the author points out that in Pattern 1, questions can only be asked in text format, which might not be ideal for all use cases.
The author is optimistic about the future of multi-modal RAGs, indicating that the field is evolving and that more patterns and insights will emerge as the technology advances.

How RAG will change with Multi Modal LLM and Generative AI Introduction

Problem Statement

So, RAG is pretty 2023, and with the availability of GPT4 vision models, we are talking about multi-modal RAG. What needs to be considered now, what changes we might encounter and what new updates we need to focus on for this multi-modal LLMs is covered as part of this blog. My focus for this blog is to consolidate some of the “asks” from my customers with some new use cases around multi-modal LLMs and publish them as thoughts and patterns. Some of these patterns are little futuristic and might be art of possible with the multi-modal RAGs.

So, let us dive into the RAG for multi-modals.

Solution

The multi modal LLMs give us more power. There were many projects in the past where infographic information like bar charts and pie charts that contained tons of content and materials were simply ignored coz we did not had the means (or the means was expensive and not time to material conducive) to decipher and use the knowledge to power our applications. Now, with the arrival of multi modal LLMs (we will be discussing with GPT4 Vision)- we have the option to extend our existing RAG based solutions and make them even better!

Although multi modal LLMs have opened up new use cases and options which I might discuss in a separate blog. In this one, I am focussing on RAG itself (hence the title). For RAG based solution in LLMs, below are some of the patterns emerging on how these text and images can be combined.

There are 3 clear patterns — (I am sure we can come up with more), but for now lets discuss more on these 3.

Imagine a use case where we received 100s of files of varied formats. The formats can be PDFs or DocX or similar etc. Now the content of the files have images ( not OCR data but info graphics like bar charts or graphs or even images that have maps and overlays etc.), our responsibility is to educate our RAG application to learn on the knowledge from these documents that include the text as well as images.

We have 3 implementation patterns we can adopt. It is too early to tell which one is better coz each one works well based on specific scenarios.

Pattern 1 is where I feel most of the existing RAG can benefit without any disruptive changes. In this case we extract the images from the documents into an isolated landing area. We build and run execution process that can load each image at a time and process them using GPT4 vision model to extract content, summary, synopsis and key value pairs as text blob. Once the information is available we can use existing embedding models to convert text into vectors to be stored into cognitive search ( or any other vector DB). When the application is invoked by the users ( or system), the question is converted to embedding ( or we can do advanced similarity search on question) and used for search against the vector database. The results will be either content from the text or the image ( text synopsis of the image ). For generating the response, we can pass the content itself or if we have added reference of the image to the image synopsis then we can also retrieve the original content and display image.

One of the downside of this approach is that we can only ask questions as text.

For pattern 2 – we have a hybrid option. In this case, we build 2 indexes ( again minimum disruption)- one we build with text embeddings using the usual models and save in existing index. The other is where we use a multi modal embeddings ( like CLIP) to vectorise the image itself and save in a separate index. Once the application executes and someone asks a question, either providing a text question or image uploaded as question; either way the question can be converted to an embedding ( based on type) and used to search across both the indexes to gather possible answer and finally using LLM to generate response.

The third and final pattern is a bit disruptive but this will become the standard process soon as more and more companies will implement their own multi modal embeddings model. In this case – we build a single index. The index will contain both the text and image embeddings. We run 2 pipelines – one to process the text where the text content is embedded with multi modal embeddings like CLIP and stored in vector database with references and links. The second pipeline is exactly same as first one (with multi modal embeddings etc) except that instead of text we process images and load them into the vector DB. Once all the data is loaded then any question asked ( image or text based) will be first embedded and then used to perform a search against the single index that has all the embedded data now . The result then can be used as a context and used for response generation.

Conclusion

Multi Modal LLM based RAGs are making things interesting. As this is an evolving space and I am sure more patterns will evolve and new options will change the landscape. I will keep updating this blog as I learn more and publish some pros and cons of these options. Do add your comments and your suggestions below.