OpenAI Models’ Drift, Fast & Stupid? — New Insights and Consequences
Your business might be at risk! We address in this blog post, one of the most dangerous side effect of relying on proprietary models: model Drift! Imagine building an AI product based on OpenAI models, and then noticing something weird… the model doesn’t seem to behave the same way! Worst: the performance of your product went down drastically!
Model Drift and Impact AI products: If you are working with OpenAI APIs you need to be aware of this!
The Problem: You invested time and money on an AI solution relying on OpenAI APIs, but now, your solution doesn’t work (as well) anymore!
Is there data supporting this? What can you do about it??

Disclamer: This blog post is based on
- This new research paper analyzing GPT-4 and GPT3.5 drift.
- 2023's overall evolution of LLMs (read my 2023 AI retrospective)
- My experience building AI products both and as Head of AI in a scale-up and, and as an AI/ML consultant (I was mentioned by Microsoft as one of the early AI solution designers on Azure OpenAI services).
Migrating from OpenAI’s 0314 models to the 0613 version proved to be challenging due to the very reasons we will discuss below!
What is Model Drift?
Definition: Model drift refers to the change in model behavior and performance over time.
This can occur due to a variety of reasons, including changes in the underlying data, updates to the model, or shifts in the problem space (e.g., if production data changed compared to what you used to deal with).
A recent study evaluating the behavior and performance of GPT-3.5 and GPT-4, two of OpenAI’s large language models (LLMs), found that their behavior can vary significantly over a relatively short period.
And this is a BIG problem!
Paper’s Key Findings
Before we explore at how we could explain & mitigate these issues, let’s look at the results reported in the research paper.

The study evaluated the models on four tasks: solving math problems, answering sensitive questions, code generation, and visual reasoning.
The findings were quite revealing:
- Solving Math Problems: GPT-4’s accuracy in solving math problems dropped from 97.6% in March to a mere 2.4% in June. Surprisingly, on the other hand, GPT-3.5’s accuracy improved from 7.4% to 86.8%. The response length of GPT-4 became much more compact, while GPT-3.5’s response length grew by about 40%. This is a tremendous improvement for GPT3.5, but we can’t ensure that it will be the case across all scenarios, including the one you are using in Production…
- Answering Sensitive Questions: GPT-4 answered fewer sensitive questions from March (21.0%) to June (5.0%) while GPT-3.5 answered more (from 2.0% to 8.0%). It was likely that a stronger safety layer was deployed in the June update for GPT-4, while GPT-3.5 became less conservative.
- Code Generation: A significant decrease in the expected quality … The code generation responses of the models became more verbose and less directly executable over time.
- Visual Reasoning: For both GPT-4 and GPT-3.5, there was a 2% improvement in the exact match rate from March to June. However, for more than 90% of visual puzzle queries, the March and June versions produced the exact same generation.
Personal Note: It’s not “all bad”. Yet, this is a narrow analysis of the models’ performance showing a significant change overtime! As a developer, I’m expecting the API to remain at least consistent, maintaining a similar performance overtime! Otherwise,
- at best, it will induce a significant testing and adaptation overhead (both in terms of time & cost) to maintain the product at a satisfactory performance level. As a matter of fact, some issues might be due to a different output formatting (easily fixable with prompt engineering).
- Worst case, the performance decreases so much making the API useless to tackle the tasks handled in production!! This is a real business issue!
How to explain this Drift in the case of OpenAI models ?
The behavior of these models can change over time due to various factors, which is sometimes referred to as ‘drift’.
Here are some of the potential reasons for this drift:
- Enhanced Safety and Alignment: To ensure the models align more closely with user needs and to mitigate misuse, OpenAI may add safety layers on top of the base model. These measures could potentially influence the model’s outputs.
- Model Size Reduction for Efficiency: To improve the speed and reduce the cost of running the models, OpenAI might employ techniques like quantization, which reduces the precision of the numbers used in the model. This could potentially lead to subtle changes in the model’s behavior. Quantization is known to be a very simple trick to make the model much smaller and faster. But it comes at a performance cost!
- Updated Training Datasets: The training dataset for the model is updated periodically to reflect user feedback. Small changes in the dataset can lead to improvements in some areas while potentially decreasing performance in others.
These changes are generally aimed at maintaining safety, reducing costs, and enhancing user experience. While these updates are designed to improve the model’s overall performance, they may also lead to observable drifts in the model’s responses over time…
The Implications? Trust Issues and Business Impact!
These findings highlight the need for continuous monitoring of LLM quality and the challenges of integrating LLMs into larger workflows. The study also questions whether LLM services like GPT-4 are consistently getting better over time.
For users or companies who rely on LLM services as a component in their ongoing workflow, it is recommended that they implement similar monitoring analysis for their applications. This will help in identifying and mitigating the effects of model drift, ensuring that their AI products continue to deliver optimal performance.
- Your business might be at risk: this is a big deal! Service performance decrease means that you could see your users churn without actually knowing why! And a lost client is very hard to get back!
- AI Ops is becoming a real thing: building an actual workflow to continuously test and validate AI models/APIs in production will (re)emerge as an essential skill. You can’t trust the AI model API provider ; you’ll have to validate it through out the whole experimentation and production cycle.
A New Argument to Move to Open Source Models?
The only way to ensure that the intrinsic performance of your models does not change is to own your model!
New open source models are emerging such as MPT-30B or LLaMA v2 announced by Meta released this week. This model is fully open source and available for commercial use.
Fine-tuning the model and hosting it will come at (a significant) cost compared to relying on a plug and play solution such as OpenAI. Yet it does come with a full control over the evolution of such as model and might avoid some bad surprise.
Plug and play AI platforms enabling you to fully own, fine-tune and leverage your models, will emerge as a very strong alternative.
Conclusion
Model drift, as observed in OpenAI’s GPT-4 and GPT3.5 models, is a significant concern for businesses. The drift or change in model performance over time can lead to reduced product efficacy and customer satisfaction.
This raises an imperative need for businesses to adopt AI Ops, continuously testing and validating their AI models in production to catch and mitigate the effects of model drift.
Additionally, the emergence of open-source models like MPT-30B and LLaMA v2 offers a potential viable alternative. Businesses could now exert full control over their models, potentially mitigating the risks associated with proprietary models’ drift and ensuring more consistent performance.
This comes of course at an increased R&D cost… you’ll have to find the trade-off.
