avatarIgnacio de Gregorio

Summarize

LoraLand, An Army of ChatGPT Killers is Here

The AI Version of David vs Goliath Has Been Set

Is this the moment open-source was waiting for? Corporations need to pay attention to this.

And you should too.

Predibase, an LLM platform, has announced a set of models that might change the course of history for Generative AI at the enterprise level.

Named LoraLand, this suite of very small models outperforms ChatGPT (GPT-4 version) in many downstream tasks, all the while being able to deploy all models in the same GPU and with an average of 8 dollars spent on fine-tuning, which sounds absolutely outrageous.

A new chapter opens in the Generative AI war between private and open models, a chapter you are trying for yourself today.

This insight and others have mostly been previously shared in my weekly newsletter, TheTechOasis.

If you want to be up-to-date with the frenetic world of AI while also feeling inspired to take action or, at the very least, to be well-prepared for the future ahead of us, this is for you.

🏝Subscribe below🏝

The Hardest Decision

The emergence of foundation models, pre-trained models that can be used for a plethora of different tasks, officially ignited the flame of Enterprise AI back in 2022.

A Painful Journey

Before that, deploying AI was a challenge and a very risky one.

You had to train a specific model for every single task, and the chances of failure were almost 90% according to Gartner.

But now AI comes with foundation models that are very good already, and that we simply need to ground to the task at hand.

This is usually done witth prompt engineering, optimizing the way we interact with the model to maximize performance, as fine-tuning is simply too expensive, which is precisely the paradigm that is about to change.

However, at the end of the day, it all boils down to choosing the right LLM for the right task, which is unequivocally one of the hardest decisions in Generative AI today.

But at first, if you’re just starting, the decision might seem quite straightforward.

Proprietary vs open-source

A quick look at the most popular benchmarks will show that ChatGPT, Gemini, and Claude are the best models.

Also, they are heavily capitalized, enabling them to offer dirt-cheap prices. However, using these proprietary models comes at a cost, but not an economic one:

You have absolutely no control over the model.

This could mean that, for instance, the model could be updated unexpectedly and force you into reengineering your deployments.

Adding insult to injury, you also need to trust these companies not to use your confidential data being sent to these models.

And, by the way, these companies have every incentive in the world to use your data to further refine their models, especially considering that data is becoming a licensed ($$) asset.

On the other hand, open-source models like LLaMa or Mixtral8×7B, while offering generally lower quality, allow you to have absolute control over the model.

Also, you have guarantees that your confidential data, the most important asset for companies today, is never compromised.

But, luckily, the quality gap with frontier private models like ChatGPT or Gemini can be closed using fine-tuning (further training a model to a particular task).

Data comes first, size second

Even though open-source models indeed lag behind today, with enough fine-tuning on a particular task, you can increase the performance of your open-source model beyond what ChatGPT can offer.

There are plenty of examples of models that are more than ten or a hundred times smaller than GPT-4 beating it with enough fine-tuning.

However, the issues with this are two-fold when approaching this task conventionally:

  1. The Fine-tuning trade-off. Fine-tuning to a particular downstream task sacrifices model generability due to knowledge forgetting.

For most enterprise use cases, this is usually not a problem. I mean, to create a customer service assistant you don’t need your chatbot to be capable of rapping about Norwegian salmon.

2. The business case changes. Instead of paying a price per token as OpenAI offers, LLM providers will now charge you for dedicated instances, as your fine-tuned model will have to run in an independent, dedicated GPU model/cluster, which in turn is far more expensive.

For example, doing this yourself in AWS with on-demand pricing for a g5.2xlarge to serve a custom llama-2–7b model will cost you $1.21 per hour, or about $900 per month to serve 24x7, for every instance of the model, which of course can scale abruptly.

Therefore, even though fine-tuning is the ideal case scenario for most enterprise use cases that generally require top performance in one task, it’s prohibitively expensive.

But with Predibase’s announcement, things have now changed.

Before you continue reading, please be aware I am not sponsored nor affiliated with Predibase, and that I’m simply using their disruptive new offering as a proxy to exemplify how open-source will eventually become, according to my humble opinion, an economically-viable option, which was not the case before, independently of the platform you choose.

Choosing between one LLM platform or another (with other examples like Databricks, together.ai, or Groq), is something that requires time and careful consideration, and you should know that you don’t actually need a platform if you are tech savvy enough.

Thus, choosing your options should come with the required due diligence which is an effort by no means addressed by this article.

An Army of Experts

There’s no way around it. Predibase achievement simply doesn’t make sense.

In fact, it seems they have achieved the impossible: offering models superior to the best models out there in a cost-efficient way.

Specifically, they have released 25 fine-tuned versions of Mistralx7B that are superior to ChatGPT’s most advanced version for a particular task, despite being quite literally hundreds of times smaller.

They have built this success along two dimensions:

  • QLoRA fine-tuning
  • LoRAX deployment framework

QLoRA, Quantized low-ranking fine-tuning

QLoRA is without a doubt the hottest way of fine-tuning models today. In it, two aspects come into play:

  1. Quantization, where we reduce the precision of the parameters stored in memory. Think about it as instead of storing the parameters of the model in full precision, like 0.288392384923, you round them up and store them as 0.3, obviously incurring a rounding error in place of reduced memory requirements.

I am not covering quantization in length in this article, but the key intuition is that quantization aims to find the ‘sweet spot’ between making the model less memory demanding while degrading performance the minimum possible.

2. LoRA fine-tuning, where we optimize a network by adding an adapter that allows us to fine-tune only a small portion of the weights.

About the latter, the idea is simple: The conventional approach to fine-tuning is updating all the parameters in a network to train a model on the new set of data, which is very expensive considering we are talking about billions of parameters updated multiple times.

However, in LoRA’s case, you only update a very small portion by benefiting from the fact that most downstream tasks an LLM performs are intrinsically low-rank.

But what does that mean and why does it matter so much?

The rank of a matrix is the lowest number of rows or columns that are linearly independent, meaning that you can’t reconstruct them by combining other rows or columns.

In simple terms, it represents a measure of the information redundancy in a matrix, meaning that those rows/columns that are linearly dependent do not add additional information.

In other words, to get optimal results we simply need to optimize the small meaningful portion of the weights that do matter.

So what do we do?

Well, we take the matrix of weights of the model and decompose it into two matrices that represent its low-rank equivalents.

Please be aware that this is an oversimplification. In reality, there is no such thing as ‘one weight matrix’. This is done on a layer-by-layer basis, but the principle we are discussing still applies.

As you can see below, we can decompose a 5-by-5 matrix into two matrices 2-by-5 and 5-by-2 (with 2 being the rank of this particular matrix).

Consequently, we do not update the full-sized matrix, but the low-rank ones, reducing the number of parameters to update from 25 to 10 (in this particular case).

This works because in most cases LLMs have a much higher parameter (weights) number than needed for any given task, meaning we don’t have to update the whole matrix to obtain the same results, just the portion of weights that matter.

And this benefit compounds incredibly well as you can see in the image below, where even when you have a rank of 512 (compared to 2 as we saw before) you are still only updating 1.22% of the total weights of the model while still getting top improvements.

Adding to this, as the original base model has been quantized, the memory requirements for handling the base model, in this case Mistralx7B, drop even further while increasing the overall impact.

However, we are still having the biggest problem: every fine-tuned model requires its dedicated GPU, right?

As long as that stands true, it defeats the purpose of fine-tuned models by making them economically unfeasible.

But this is where LoRAX comes in.

One GPU, 100 LLMs

The idea behind LoRAX is that, as each adapter adds its own set of weights to the original, unchanged model, the model’s base weights are the same independently of the fine-tuned version used, as long as all fine-tuned models come from the same base model (in this case, Mistralx7B, although it’s LLM-agnostic as you may imagine).

Consequently, what LoRAX allows is to efficiently manage a set of fine-tuned models (called adapters) that are loaded and evicted dynamically from one single GPU depending on the types of requests users send:

Source: Predibase

In concise terms, depending on the different requests the users send, the model automatically detects what adapters are required in any given case, and loads their weights into the base model.

Source: Predibase

And what does all this sum up to cost-wise?

Well, for an average approximate price of 200 dollars (8 dollars per fine-tuning), you have a set of 25 models that individually outcompete GPT-4 in their given tasks, while running in one single A100 GPU with all the added benefits of having total control over your models.

For a more detailed explanation of LoRAX, read here.

Open-source’s ‘it moment’?

All in all, this is, quite literally, a dream come true for customers and enterprises alike, by achieving the best of both worlds, economically viable LLMs that offer top performance at particular tasks.

In fact, PrediBase might just have given us a glimpse of the future of Enterprise GenAI, as it’s becoming harder and harder to look away from open-source as the primary solution for companies willing to embrace the GenAI revolution, be that using Predibase or any other of the multiple LLM platforms available to you, considering the quality/price ratio they are starting to offer.

On a final note, if you have enjoyed this article, I share similar thoughts in a more comprehensive and simplified manner for free on my LinkedIn.

Looking forward to connecting with you.

Try LoraLand models for free here.

Artificial Intelligence
Technology
Business
Open Source
ChatGPT
Recommended from ReadMedium