Meta’s Self-Rewarding Models, the Key to SuperHuman LLMs?

Meta, the company behind Facebook, Whatsapp, and Rayban’s Meta glasses, has announced a recent, highly promising AI breakthrough, Self-Rewarding Language Models.

Their results have allowed their LLaMa-2 70B fine-tuned model to surpass models like Claude 2, Gemini Pro, and GPT-4 0613, despite being at least an order of magnitude smaller.

However, that is not the true breakthrough, as these new models also show signs of being a reasonable path to creating the first superhuman LLMs, even if that means humans taking one step closer to losing complete control over our best AI models.

But what does that mean? And is that a good thing?

Let’s find out.

This insight and others I share in Medium have mostly been previously shared in my weekly newsletter, TheTechOasis.

If you want to be up-to-date with the frenetic world of AI while also feeling inspired to take action or, at the very least, to be well-prepared for the future ahead of us, this is for you.

🏝Subscribe below🏝

Subscribe | TheTechOasis

The newsletter to stay ahead of the curve in AI

thetechoasis.beehiiv.com

The Rise of a New Alignment Method

To this day, in all frontier models like ChatGPT, or Claude, humans play a crucial role in their creation.

Alignment, the secret sauce

As explained in my newsletter from two weeks ago, the later stages of the training process of our best language models include human preference training.

In a nutshell, we make our models achieve higher utility and reduce the risk of harmful responses by teaching them to respond in the way a human expert would.

The previous link goes into much more detail, but the gist is that we have to build a very expensive human preferences dataset, which essentially is a whole bunch of different sets of two responses to any given prompt, where a human expert has decided which one is better.

This requires extensive human (expert) labor. Next, once you have this dataset, you have to take two possible directions:

Source: DPO research paper (Rafailov et al)

Tangible Rewarding through Reinforcement Learning from Human Feedback (RLHF), where you build a reinforcement learning pipeline that requires a reward model (at least of the same size and quality as the model being trained) and use a policy optimization process where the trained model learns to maximize the reward, aka you measure the reward model’s score on your model’s responses and you train it to achieve higher scores.
Intrinsic Rewarding through Direct Preference Optimization (DPO), where the model optimizes against the optimal policy instead of materializing the reward. In other words, the model implicitly maximizes the rewards by explicitly finding the optimal policy. In simpler terms, you simply perform an algebra trick over the RLHF method to avoid materializing a reward (which would require a separate reward model), dramatically decreasing the complexity and costs of the alignment.

Although very new, DPO is already being cited as a major breakthrough, as it shows equal or even better results than RLHF, while being dramatically cheaper and easier to build.

But Meta has taken DPO and gone a step further with the question… do we actually need humans?

The SuperAlignment Problem

Albeit their undeniable credentials, both RLHF and DPO are still bottlenecked by us, humans.

The reason is that they require the human preferences dataset, which means that these models can only aspire to be as good as the humans building the dataset.

In other words, they are constrained by the limitations of our race.

Therefore, how can we align the superhuman models of our future if they require us to be aligned?

Recently, OpenAI theorized that there is some potential in humans being able to align superior, superhuman models, as the weak-to-strong generalization paradigm suggests that we can still teach a model how to behave without making it dumber when ‘forcing’ it into our limitations.

However, researchers concluded that this was definitely not enough, and that we need “something else” to guarantee our superhuman models of the future don’t spiral out of control.

And that thing could be Self-Rewarding Models, Meta’s way of saying “Humans, get out of the way”.

The Self-Improving Paradigm

In this paper, Meta suggests a new method where the models are trained using the DPO method we just talked about while allowing the model to generate its own, non-human rewards.

If proved at scale, it’s the best of both worlds and, unequivocally, a complete revolution.

Let’s take a look.

Circling back to the two previously covered methods, RLHF is very expensive and requires a reward model.

DPO does not, but it also requires humans to define the boundaries and how the model should behave.

Also, in both cases, the rewards don’t get better over time, they are fixed and are based on how good our reward model is.

Instead, Meta’s new iterative framework defines a training pipeline where the current model (Mt) first generates the set of responses and scores them, aka autonomously builds the preference dataset, and then using this preference dataset and the DPO method (meaning there’s no need for a reward model) to obtain the new, aligned model, Mt+1.

Then, they take Mt+1 and repeat the process to get a new model, Mt+2, and so on.

In other words, the model is not only becoming better aligned with each iteration (the objective all along) but it’s also learning to get better at scoring responses, which in turn explains why the model in the newer iteration is better than the previous one.

In layman’s terms, while in traditional alignment methods humans are required to build the preference dataset, here the model plays both the role of aligner and aligned.

And the results?

In just three iterations, the fine-tuned LLaMa model was already on par with the best of the bunch, and the self-improving mechanism showed no signs of saturation.

It’s, quite literally, the first time we have seen a self-improving LLM that doesn’t require humans to dictate what’s “good”.

This is huge, as we have already seen what self-improving methods achieve, with examples like the superhuman AlphaGo completely obliterating everyone in the game of Go by playing against itself.

If we now have the power to train LLMs in self-improving pipelines, we could build superhuman language models, whatever that turns out to be.

But as with everything, there’s a trade-off.

Alienating Humans

It’s no secret that humans are the bottleneck to building superhuman models.

However, eliminating humans from the training process and relinquishing any ‘say’ on the matter is something that needs to be carefully evaluated.

We are already extremely bad at explaining how LLMs ‘think’…, but at least we have control over them.

Now, while the former issue is far from solved, we are proposing to let them decide how to optimize, possibly losing control over them.

This lack of control is simply frightening, as the possibility of these models going rogue is far from being zero.

On the flip side, this could expand a new world of possibilities and breakthroughs that our limited minds cannot comprehend and thus cannot achieve without these superhuman models.

So the question is… where do we draw the line?