avatarNayan Paul

Summary

The web content outlines practical strategies for enhancing accuracy, performance, and cost-efficiency in advanced Retrieve, Augment, Generate (RAG) applications powered by Generative AI and Large Language Models (LLMs).

Abstract

The article emphasizes the challenges and solutions in developing RAG applications with LLMs that meet business expectations in terms of accuracy, performance, and cost. The author, who has extensive experience in the field, presents a series of vetted and implemented recommendations categorized into cost optimizations and performance & accuracy optimizations. These include caching, prompt management, natural language rules engine (NLRE) implementation, hallucination prevention, prompt size optimization, chunk optimization, output structuring, post-production monitoring, the use of smaller language models (SLMs), rate limiting, managed vs. classic Prompt Tokens Units (PTUs), and load balancing. Additionally, the article discusses end-to-end optimization of the RAG pattern, model evaluation, dynamic model selection, memory management, multithreading, question optimization, hierarchical document management, custom ontology, query routing, task breakdown, and planning. The author stresses the importance of these strategies for building production-grade applications and provides references to their other blogs for detailed implementation guidance.

Opinions

  • The author believes that building LLM applications that truly deliver business value is challenging, with many initial designs being only proofs of concept rather than production-ready.
  • They advocate for a thorough evaluation of each recommendation per use case to impact business outcomes positively.
  • The author suggests that some advanced optimization techniques may not be necessary for proofs of concept but are crucial for production-grade applications.
  • They emphasize the importance of post-production monitoring for continuous optimization based on feedback.
  • The author is in favor of using smaller language models (SLMs) where appropriate and considers rate limiting a prudent approach to cost management.
  • They highlight the need for a balanced selection of models and the use of advanced techniques like agents, memory management, and custom ontologies to enhance performance and accuracy.
  • The author values the creation of a reusable prompt management system and considers it a key component in optimizing LLM applications.
  • They propose the use of a natural language rules engine to guide conversations and reduce the number of calls to LLMs.
  • The author sees the management of memory and the breakdown of tasks as essential for providing specific and accurate responses in conversational applications.
  • They recommend the use of hierarchical document management and custom ontologies to improve domain-specific accuracy and business adoption.

Practical ways to improve accuracy, performance and optimize cost for advanced RAG applications with GenAI and LLMs

Problem Statement

I say this to everyone I meet — “building LLM applications are easy, building LLM applications which are accurate and can really produce business value is hard”.

Many a times we have build RAG based LLM powered applications only to realize that acheiving business expectations in terms of performance and accuracy and cost objective near to impossible. Many a times organizations have to re-build and re-design applications for the initial method was just a POC (proof of concept) and it did not had the regior, design and considerations for building the apploicatins that can be deployed to production.

In this blog, I will focus on what can we do to ensure we can build and deploy business ready applications.

All the below steps are something I have personally vetted, researched, implemented and used in many production grade applications.

Without much ado, here we go -

Solution

The solution is a list of recommendations that we should evaluate per use case and decide which of these make sense to your project and impact the business outcomes. I have divided all the key recommendations into 2 categories. One, those which impact the bottom line — the cost and second the performance & accuracy.

Now, I understand that some of these recommendations are advanced scenarios and you might be asking me the value of implementing them for POC’s, but just to recollect, the whole blog is all about your desire to implement a production grade LLM Powered application in the first place. If you want to stop at POCs then obviously there is no real need to evaluate all of these recommendations as some of these need additional technical assessment specific to your use case.

Lastly, the below section is a highlight of considerations, how to implement each of the recommendations are already provided in my other blogs. This blog answers the why and what (not the how).

This the disclaimers out of the way, lets dive into the pointers for optimizing RAG applications.

Cost Optimizations :

  • Avoid calling LLMs all the time : The short answer is ‘caching’. Does not matter which critical application you are building, the human behavior ensures that through out a specific time span, audience have tendency to ask repetitive questions. I have other blogs that discusses how to implement caching, recendy and order of refreshes etc. However, when I have implemented cache in my LLM applications, I have seen projects where the overall cost have reduced to more than 50% in some scenarios.
  • Organize the number of calls to LLM to get right answer (Prompt Management ) : I have done a blog on this as well. Prompt management in simple words mean how to de-couple the prompt from the applicatin layer and manage, govern, test, evaluate and finalize the optimal prompt for the application. One we invest in the right prompt template then the answers are optimized as well. This will reduce number of back and forth and number of calls to LLM to get right answer.
  • Optimize the number of calls to LLM to get right answer (NLRE) : Done that piece as well as a blog. This is a noval concept. For some of my customers, I have implemented natural language rules engine that can guide the conversation based on rules to desired structure and through that process collect the right set of questions and points that needs to be collated before asking LLM to respond. Many a times we get sub optimal answers to questions because we have asked partial question. The NLRE can provide that structure and guidance to interacting with LLMs.
  • Get answer correct the first time by preventing Hallucinations : Hallucinations are the ‘achilles heel’ for LLM applications. Now, this is not some knob that we can just turn and fix. Hallucinations needs to be addressed at every step of the journey. By parsing the right data, building optimal knowledge base, cross verifying response and evaluating outcome based on the factual correctness and business value. Once we optimize every step of the application and have a fallback process, we should be good with getting the right answer thereby reducing the calls to LLMs.
  • Prompt Size Optimization through Compressions : The size of input (or prompt) that goes to LLM decides the bulk of the cost for each completion. Ideally in RAG, we extract the right and meaningful context and include them as part of the prompt. Not all of the context items have equal importance. Tagging the chunks and compressing the context without loosing meaning are some of the advanced techniques used to reduce cost.
  • Chunk optimization through re-ranking : Similar to the one above, in this case, instead of compressing the meaning we can filter the chunks through a second round of reranking and optimizations. This is also one of the newer and advance optimization techniques.
  • Reduce the size of response from LLM through output structuring : The other side of coin. The cost can be controlled and managed by reducing the number of words and tokens that comes out of LLM. From that point, enabling output parsing and structuring ensures that we just ask for enough information from LLM and not make the responses unnecessarily verbose.
  • Post production monitoring to continuously monitor and get feedback (to optimize) : Not all optimizations can be preemptive and pre-defined. A lot of the optimizations and ideas come form iterative work on the aspects of RAG that can be optimized. This means we need to build and setup a post production monitoring capability that can continuously monitor the applications across cost, SLA, accuracy, kinds of prompts etc. And the intention is not to keep monitoring, but use the feedback to extend and optimize the application (by optimizing prompts, output etc. etc.)
  • Smaller models (SLMs) : Talk of the town. With Azure launching newer SLA models, a lot of the use cases (or part of larger use cases) can be implemented by small language models instead of being over reliant to large language models. A lot of organizations are adopting and using SLMs along with LLMs to solve complex use cases.
  • Rate Limiting : This is another of the newer and advanced techniques. Azure OpenAI models specifically have “pay per use”, “classic PTU” and managed “PTU” options to choose from right sizing and controls. With PTU’s we can also have a threshold and limits to use of the LLMs. Use of the right sizing and controls with rate limiting is better “safe than sorry” story. Rate Limiting is also reactive rather than pro-active.
  • Observability KPIs for quota management — Extension of the rate limiting but more direction around what changes and setup we need across DEV/ UAT and PROD and also across the quota needed per use case.
  • Managed VS Classic PTUs VS load balancing — This section focusses on a specific ask. Imagine we are tasked to implement an application on GPT-4 where there are 100 active users on that system at any point. Considering the restrictions on the number of calls per min, we can stitch together a solution by implementing a load balancing across multiple LLMs or PTU’s. This setup is more geared towards “path to production” where we can do volumetric, traffic and demographic analysis to get the optimized costing.

Performance & Accuracy Optimizations :

  • Optimize the RAG pattern end-to-end : A blog that I have already written. Not a lot to put here as one liner but some of the considerations starting from proper ingestion and extraction strategies, chunking, optimizations, metadata etc. all come together to provide the optimized application outcome that business wants. Needless to say, the Data and AI engineering effort is equally important to the prompt engineering effort.
  • Choose Right Model Through Evaluations : Models are different, different in their accuracy, cost etc. Hence the overall performance and cost are tied to model selection. One of the key solutions is around building a model evaluation framework based on the use case and business knowhow and use the evaluation framework to identify the correct model to use. This is typically done by collecting ground truths and using the ground truths to score model responses collectively.
  • Optimize Runtime model calls using switchboards : Something I have discussed and blogged about. Instead of just relying on selecting the models for a specific use case, we can make a case for selecting models dynamically based on the intent and classification of the question itself. In this scenario, the model is selected on demand based on the question itself. This approach can reduce the cost a lot but not having a default model to answer any question any time.
  • Advanced capabilities like Agents : We all know LLM agents. They specialize in solving for a specific use case and task. If we can build our applications with ‘agent first’ approach and then choose the right agent for the right task, we can optimize the response and add efficiency for the tasks.
  • Manage memory to get more specific answers in one shot: Typically, in conversational based use cases, it is important to keep track of the conversations and provide required context from earlier history to answer questions. If we can externalize and manage the memory in an efficient way then we can only ‘light up’ required parts of the memory in a conversation, thereby getting better performance and accuracy.
  • Custom Multi Threading : Sometimes we ignore the basic. In some of my projects, just by introducing python based multithreading and speeding up the search, assembly of context etc. we have got huge performance gains. This is one for everyone to try out.
  • Questions Optimizations : One of the advanced techniques. Using questio formatting, question optimization, multi-hop answering etc. we can get huge accuracy improvements. Now, this one needs to be evaluated per use case basis coz sometimes the accuracy comes at a cost of time or money.
  • Hierarchical document management : Another of the advanced concepts. In this case, the knowledge repo is not only setup and managed as a sequence of items but can be organized into 2 structures. One that is managed as a sequence of items but the second one is typically managed in a hierarchical way so that search, commonality and retrieval can be achieved quickly.
  • Custom Ontology/ Semantic layer : Use cases are domain specific. The accuracy and business adoption is implemented through business mapping. Adding an ontology and semantic layer and pairing those through a knowledge graph or incorporating them through natural language rules engine ensures that the application has more knowledge about the domain that the baseline LLMs.
  • Query Routing : More like function calling but with more broader reach. In this kind of a setup, the LLM applications can route queries to independent processors who in turn will have a customized RAG underneath. This is more like a RAG of RAGs.
  • Task Breakdown and Planning : I have blogged about this. Build the application through a MVC setup. The Model is the LLM itself but the controller is a special processor that can take a qustion, break that down and design a plan to execute each plan. Each plan is then executed through the view part of MVC to guide the outcomes. Finally, the controller assembles the answer and get the final result.

References :

OpenAI
ChatGPT
Gpt 4
Large Language Models
NLP
Recommended from ReadMedium