Streamline Your Prompts to Decrease LLM Costs and Latency
Discover 5 techniques to optimize token usage without sacrificing accuracy
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yOBg0C11KoAt-bDLci5nxg.png)
High costs and latency are one of the key obstacles when launching LLM Apps in production, both strongly related to prompt size
The versatility of Large Language Models led many to believe that they can be treated as a silver bullet solution for pretty much any task. When you combine LLMs with access to tools including RAG and API calling and give them detailed instructions they can often perform on a near-human level.
The key problem with this all-in-one approach might quickly lead to exploding prompt sizes, which result in infeasible costs and latency for each call.
If you use LLMs to optimize a high-value activity such as coding, keeping costs down is not the top priority — you can wait half a minute for code you would normally write for half an hour. However, when using LLMs in customer-facing apps, where you expect thousands of chats, costs and latency can make or break your solution.
This article is the first from a series, where I will share insights from building an LLM-powered Real Estate Search Assistant ‘Mieszko’ for Resider.pl. My key focus is to bridge the gap between building an impressive POC and the struggles of actually using LLMs in production.
Prompt is all you have
While building Mieszko I strongly relied on LangChain, an amazing framework, which helps to structure your LLM apps. It adds neat levels of abstraction and provides efficient prompt templates, which you can call with a few lines of code.
Its ease of use can make you forget, that no matter how complex your solution is, at the end of the day every component is chucked back into the long piece of text sent to LLM — our prompt. Here is a high-level overview of what your LMM Agent prompt might look like.
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*317B1zDZHvVOgcecHNdqzg.png)
What drives costs and latency?
LLMs are some of the most complex AI models ever built, however their latency and costs are driven mainly by processed tokens both on input and output, not the actual difficulty of the task.
At the time of writing this article, the formula for OpenAI’s flagship GPT-4 Turbo pricing looks as follows:
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fczZl_5PlS7vqPF29I0D7g.gif)
Latency is not as straightforward but from my use case experiments the best approximation function came down to:
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AozJymp8MS2Tgu7L1k0WYw.gif)
From a cost perspective input tokens are usually the main contributors. They are 3x cheaper than output tokens, but for more complex tasks your prompt will be much longer than the user-facing outputs.
Latency however is driven mainly by output tokens, which take around 200 times longer to process in comparison to input.
When using LLM Agents, ~80% of the tokens will come from input, which contains both the initial prompt and the Agent’s reasoning. Those input tokens will be the key cost components but they will have limited influence on response time.
Count your tokens
When you are building any app, apart from the initial prompt template itself you will need at least some of the following components:
- Memory (all previous messages have to be added to each call)
- Tools (such as APIs available for LLM calls), with their detailed instructions
- Retrieval Augmented Generation (RAG) and the Context it produces
Technically you can add as many tools as you want, however if you rely on the simplest Agent&Tools architecture, it won’t scale very well. For every tool available, you will need to send detailed instructions such as API docs every time you call the Agent.
When preparing a new component it is worth considering how many tokens it will add to each call. If you are using OpenAI models you can quickly evaluate how token-hungry your new tool or additional instructions are here:
- Directly on website https://platform.openai.com/tokenizer
- If you prefer to keep it in Python you can use tiktoken with this simple function:
import tiktoken
def num_tokens_from_string(string: str, model_name: str) -> int:
try:
encoding = tiktoken.encoding_for_model(model_name)
except KeyError as e:
raise KeyError(f"Error: No encoding available for the model '{model_name}'. Please check the model name and try again.")
num_tokens = len(encoding.encode(string))
return num_tokens
How to keep your prompts streamlined without sacrificing accuracy
At first encounter, LLMs may seem overwhelming, but at the end of the day, it’s worth remembering that we are still working with software. This means that we have to expect bugs, manage levels of abstraction and modularity, and look for more efficient solutions to handle subtasks.
Here are some general tips I found especially useful while developing Mieszko. For most of these topics I will write follow-up articles in the forthcoming weeks, so if you don’t want to miss out I encourage you to subscribe to my profile here to receive notifications when they are published.
1. Split large prompts into several layered calls
Modularity and Abstraction are some of the key Software Engineering principles. Trying to handle more complex problems with a single prompt is as efficient as writing spaghetti code.
When building Mieszko one of the key performance boosts came from splitting prompts into two components:
- Decision Agent which had general guidelines about the next steps it could take and how to handle outputs from Execution prompts
- Execution Agents/Chats with detailed instructions about specific steps such as offer search, data comparison, or real estate knowledge.
Layered Decision and Execution diagram
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hW25pl-WEAxJmxB3o1yy0g.png)
This architecture allowed to first pick which task-specific prompt needs to be used without sending token-heavy execution instructions with every call, reducing average token usage by over 60%.
2. Always monitor the final prompt sent in a call
The final prompt the LLM receives might come a long way from your initial template. Enriching a prompt template with tools, memory, context and Agent’s internal reasoning might explode your prompt by thousands of tokens.
Furthermore, LLMs can have the resilience of Rocky Balboa, getting up and providing reasonable answers even with bugged and contradicting prompts.
Going through dozens of calls and understanding what the LLM actually saw gave insights leading to key breakthroughs and bug elimination. I strongly recommend using LangSmith for in-depth analysis.
If you want the easiest solution you can also enable debugging in LangChain, which will provide you with the exact prompt sent in each call along with a tonne of useful information.
import langchain
langchain.debug=True
3. If something can be handled with a few lines of code think twice before feeding it to an LLM
One of the biggest mistakes when working with LLMs is forgetting that you can still code. In my case, some of the biggest improvements came from handling some most trivial tasks upstream/downstream from LLM calls using Python functions.
This is especially efficient using LangChain, where you can easily chain LLM calls with classic Python functions. Below is a simplified example I used to handle the issue of losing LLM defaulting to English despite instructions to keep the answers in the same language as the message.
We don’t need an LLM to detect a language, we can do it faster and more accurately with solutions like Google Translate API or even a simple python library. Once we know the input language we can explicitly pass it to prompt instructions reducing the workload that needs to be handled in LLM calls.
from langdetect import detect
prompt = """
Summarize the following message in {language}:
Message: {input}
"""
prompt = ChatPromptTemplate.from_template(prompt)
def detect_language(input_message):
input_message["language"] = detect(input_message["input"])
return input_message
detect_language_step= RunnablePassthrough.assign(input_message=detect_language)
chain_summarize_in_message_language = (
detect_language_step
| RunnablePassthrough.assign(
language=lambda x: x["input_message"]["language"]
)
| prompt
| llm )
4. Be frugal with Agent prompts, as they are usually called at least twice
Agents with tools can take LLM capabilities to the next level, however, they are also very token-hungry. Here is a diagram showing the general logic of LLM Agent providing an answer to a query.
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4dwDhcdVrACxjHkY4vU_sQ.png)
As shown above, Agents usually need to call LLM at least twice: first to plan how to use tools and then parse their outputs to provide the final answer. This caveat makes any prompt savings achieved double.
For some easier tasks Agent approach might be an overkill and a simple instruction prompt with Completion will provide similar results twice as fast. In the case of Mieszko, we decided to replace Multi-Agent architecture for Agent+Task specific Completion for the majority of tasks.
5. Use LLMs for reasoning but calculate and aggregate with SQL or Python
The most recent top LLMs have come a long way from the days of early GPT-3.5, which couldn’t get simple mathematical formulas right. Nowadays they are more than capable of aggregating hundreds of numbers for tasks such as calculating segment averages.
But they are even better at writing Python or SQL, which can execute complex calculations with 100% accuracy without the need for trillions of parameters. Sending hundreds of numbers in LLM calls cost hundreds of tokens — in the best case scenario each 3 digits get converted to a token, so if you have high orders of magnitude a single number can be represented by multiple tokens.
To get faster analytical answers for less use LLM to understand the question and available data to translate it into more appropriate analytical language such as SQL or Python.
Execute created code within a function, which connects to the data and feeds only the final, aggregated outputs for LLM to comprehend.
Summary
I hope these general tips gave you a better understanding of costs and latency drivers in LLM Apps and how to optimize them. I wanted to cover as much ground as a teaser of key lessons learned from developing my first LLM-powered production App. In future articles, I will dive into more specific, hands-on examples starting with:
- Improve LLM performance and reliability with Python guardrails
- Boost your LLM Agents with Modularity and Abstraction
- Understand and monitor your LLM Apps with LangSmith
Last but not least a big shout-out to Filip Danieluk who is the co-creator of the Mieszko engine, endless hours of discussions and pair-programming led to insights described in this series which are as much mine as Filip’s.