|PERSPECTIVES| AI| LARGE LANGUAGE MODELS|
A Requiem for the Transformer?
Will be the transformer the model leading us to artificial general intelligence? Or will be replaced?
The transformer has dominated the world of artificial intelligence for six years, achieving state-of-the-art in all subdomains of artificial intelligence. From natural language processing (NLP) to computer vision to sound and graphs, there are dedicated transformers with excellent performance.
- How much longer will this domain last?
- Is the transformer really the best architecture out there?
- Will it be replaced in the near future?
- What are the threats to its dominance?
This article attempts to answer these questions. Starting with why the transformer has been so successful and what elements have allowed it to establish itself in so many different domains, we will analyze whether it still has unchallenged dominance, what elements threaten its supremacy, and whether there are potential competitors.
A brief history of an empire
“All empires become arrogant. It is their nature.” ― Edward Rutherfurd
Empires inevitably fall, and when they do, history judges them for the legacies they leave behind. — Noah Feldman
“Attention Is All You Need” Is the basis of artificial intelligence as we know it today. The roots of generative AI and its success are in a single seed: the transformer.
The transformer was initially designed to solve the lack of parallelization of RNNs and to be able to model long-distance relationships among words in a sequence. The idea was to provide the model with a system to discriminate which parts of the sequence were most important (where to pay attention). This was all designed to improve machine translation.
These elements, though, allowed the transformer to understand the text better. In addition, parallelization allowed the model to scale both as a size and as on larger datasets. The rise of GPUs further showed the benefits of a parallelizable architecture like the Transformer.
The Transformer thus emerged as the new king of AI. An empire grew in a very short time. In fact, today, all the popular models are Transformer: ChatGPT, Bard, GitHub Copilot, Mistral, LLaMA, Bing Chat, stable diffusion, DALL-E, Midjourney, and so on.

This is because Transformer was quickly adapted to so many tasks beyond language.
Even the vastest empires fall at some point; what is happening to the Transformer dominion?
The giant with feet of clay

When the transformer was introduced, its performance shocked the world and gave rise to a parameter race. For a time we saw a kind of growth in models to the extent that it was called the “new Moore law of AI.” The growth continued until Megatron (530 B) and Google PaLM (540 B) were released in 2022. And yet we still haven’t seen the trillion parameters.
When deep convolutional networks showed their efficiency (VGG models), ConvNets went from 16 layers of VGG16 to 201 layers of DenseNet201 in a short time. Results and performance aside, it is a testament to the interest of the community. This pattern of horizontal and vertical growth (and incremental changes to the base model) stopped in 2021 when the community became convinced that Vision Transformers (ViTs) were superior to ConvNets.

Why did the growth of transformers stop? Were they replaced as well?
No, but some of the premises that led to transformer growth have disappeared.
This growth was motivated by the so-called power law. According to OpenAI, by increasing the number of parameters, properties emerge abruptly. So scaling the models leads the model to develop properties that would not be observable below a certain scale. Too bad, that for Stanford researchers these properties are a mirage derived from a bias.
Scaling up a model means spending much more. More parameters, more computation, more infrastructure, more electrical consumption (and more carbon emissions). Is it worth it?
Actually, DeepMind with Chinchilla said that performance increases not only with the number of parameters but also with the amount of data. So if you want a model with billions of parameters you have to have enough tokens to train it. Too bad, that we humans don’t produce enough to train a model from a trillion parameters.

In addition, it is not just the quantity of text that impacts the performance of a model. it is the quality of the text. This is also a sore point because collecting huge amounts of text without filtering is not a good idea (aka downloading without criteria from the Internet).
Also, generating text using artificial intelligence is not a good idea. In theory, one could use an LLM and ask it to produce text indefinitely. The problem is that the model trained with this text can only mimic another LLM, and certainly not outperform it.
Overall, our key takeaway is that model imitation is not a free lunch: there exists a capabilities gap between today’s open-source LMs and their closed-source counterparts that cannot be closed by cheaply fine-tuning on imitation data. (source)
An additional point is that these huge models are also problematic for deployment. Smaller models have good performance, especially for some tasks. One can distill and get much smaller models specialized for a specific task.
Take home message: the huge transformer paradigm is in crisis. The idea that every year we will see a bigger and bigger model is over.
After all, the issue is practicality (and cost). AI can cost a lot of money once it goes into production. For example, Microsoft is reportedly losing huge amounts of money on GitHub Copilot ($20 per user per month). According to one report, ChatGPT costs $700,000 per day, and investors may no longer cover the cost if ChatGPT does not become profitable.
Therefore we can expect companies more interested in developing smaller models with a specific task and business in mind.
Okay, the transformer no longer grows, but is it still the best architecture in the game?
well, let’s talk about that in the next section…
Convolution is still on fire
First, why was the transformer successful everywhere?
In its initial description, the transformer brought together three basic concepts: starting with a position-aware representation of the sequence (embedding + positional encoding), relating the elements of the sequence (self-attention), and constructing a hierarchical representation (layer stacking).
When the article Attention is All You Need was published it was based on a decade of research in NLP and put together the best of what had been published previously:
- word embedding was revolutionary in 2013 in being able to transform words into vector representations. In addition, operations on embedding had logical and grammatical meaning.
- Self-attention was an improvement on the revolutionary idea that not all elements of the sequence are important. Plus solving the long-standing problem of recurring neural networks and their vanishing gradient.
- The hierarchical representation, on the other hand, came from twenty years of convolutional neural networks where we realized that by stacking layers the model learns an increasingly complex representation of the data.

These elements made him successful in the NLP field, but at the same time, they were the key to winning in other fields as well. First, the fact that it had a very weak inductive bias made it adaptable to almost any type of data. Second, hierarchical representation and connecting elements of a sequence have applications far beyond NLP.
A story of success, except that the transformer has remained the same as it was in 2017 and is beginning to age badly.
The beating heart of transformers is ultimately self-attention. But it is a heart that pumps too much blood. In fact, its quadratic computational cost is huge.
Therefore, several groups have tried to try to find a linear substitution to attention. However, all of these variants have been shown to have inferior performance.
And what seemed a good substitute? Nothing less than the old convolution. As they showed in Hyena, by adapting convolution a little bit, you get a good model with transformer-like performance.
This is ironic because since 2021 Vision Transformers (ViTs) have been believed to be superior to ConvNets in computer vision. This seemed to be the end of the uncodified dominance of convolutional networks (ConvNets) in what until recently was their realm. But instead?
It seems that the ConvNets have had their revenge. Astonishing, like thinking that dinosaurs would return to dominance over continents by driving out mammals. In truth, a recent article published by DeepMind basically states that the comparison between ViTs and ConvNets was not fair. By providing the same compute budget to ConvNets these have similar performance to ViTs on ImageNet.
Another article seems to go in the same direction, convolutional networks seem to be competitive with transformers:
The same winners also win at smaller scales. Among smaller backbones, ConvNeXt-Tiny and SwinV2-Tiny emerge victorious, followed by DINO ViT-Small. (source)

There are three primary factors that influence the performance of such a model: its architecture, the pretraining algorithm, and the pretraining dataset. (source)
Now if the pretraining algorithm and pretraining dataset are the same only the model architecture remains. However, all things being equal, the alleged superiority of ViTs does not seem to emerge. So much so that it seems like an admission of defeat what the DeepMind authors claim:
Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly. (source)
Ouch. So we can say that ViTs are not superior to convolutional networks at least in computer vision. Is it?
We note however that ViTs may have practical advantages in specific contexts, such as the ability to use similar model components across multiple modalities. (source)
The authors point out that they could potentially still be superior because useful when we are interested in multimodal models. Considering that features can also be extracted from a convolutional network, it is certainly more convenient to use the same model across multiple modalities.
However this is a very important point, empirical data show that at least in computer vision the transformer is not superior to other architectures. This leads us to wonder whether its dominance will soon be questioned in other fields of artificial intelligence as well. For example, what is happening in the core field of the transformer? It is still the best model in natural language processing?
The text Dominion has a fragile basis
Short answer: yes, but its supremacy could end. Let’s start with why it has been so successful in NLP.
The initial advantage of the transformer on RNNs is that easily parallelized. This led to the initial euphoria and rush to the parameter. In the process, we realized what allowed the Transformer to win in NLP: in-context learning.
In-context learning is a very powerful concept: all it takes is a few examples and the model is capable of mapping a relationship between input and output. All this without even updating a single parameter.
Basically, this was an unanticipated (and not yet 100 % understood) effect of self-attention. According to Anthropic, there are induction heads that practically connect different parts of the model and allow this mapping.
This miracle is the basis of the supposed reasoning capabilities of the models. In addition, the fact that one could experiment so much with the prompt allowed for incredible results.
In practice, without having to train the model again, prompting techniques could be created to improve the model’s capabilities in inference. Chain of thought is the best example of this approach. Using this ploy, an LLM is able to solve problems that require reasoning (math problems, coding problems, and so on).

However, one must take into account that:
However, this multi-step generation process does not inherently imply that LLMs possess strong reasoning capabilities, as they may merely emulate the superficial behavior of human reasoning without genuinely comprehending the underlying logic and rules necessary for precise reasoning. (source)
Translated, we have created a parrot that has seen the entire human knowledge and can connect the question in the prompt with what it has seen during the training.
Why is this advantage extremely precarious?
Because the parrot does not have to be a transformer. We need any model that takes natural language instructions as input and can do in-context learning, after which we can use the whole arsenal of prompt engineering techniques as if it were a transformer.
Ok, so if we do not need necessarily the transformer, where is our new “stochastic parrot”?
Bureaucracy slows down innovation
The main reason is that research in industry is currently focused on bringing the transformer (despite its flaws) into production. Also, it is risky to put a better architecture into production but whose behavior we know less about.
Let’s dig more about it…
First, Google, META, Amazon, and other big tech have huge amounts of resources. Such large companies, however, are burdened by an elephantine internal bureaucracy:
Google is a “once-great company” that has “slowly ceased to function” thanks to its bureaucratic “maze.” (source)
This increase in bureaucracy, results in reduced productivity and an overall slowdown. In order to implement a small change, one must have the approval of increasingly long chains of command and follow increasingly complex protocols. In short, it seems that big tech has the same problem that has plagued empires.
This obviously impacts innovation as well:
“If I had to summarize it, I would say that the signal to noise ratio is what wore me down. The innovation challenges … will only get worse as the risk tolerance will go down.” Noam Bardin, former Google executive. (source)
Of course, there are also well-founded reasons for companies like Google or Microsoft to be more cautious in their choices. For example, Google lost billions in capitalization when Bard incorrectly answered a question about the James Webb Space Telescope.







