Do LLMs really need to be that big?

An article by Meta, Cisco, and MIT researchers suggests not.

Researchers from some of the top companies in the AI space recently showed that large language models (LLMs) can undergo a reduction of 40%-50% of their layers with little to no loss in performance.

Their technique was applied to models varying in size from 2 billion to 70 billion parameters across various families, including Llama, Qwen, Mistral, and Phi.

The original research article is accessible using the link below.

The Unreasonable Ineffectiveness of the Deeper Layers

Andrey Gromov Co-first authors; direct correspondence to [email protected], [email protected], and [email protected]. Meta…

arxiv.org

Article summary

It presents a study on the effectiveness of a simple layer-pruning strategy for large language models (LLMs), revealing that significant portions of these models can be removed with minimal impact on performance.

Authored by a team from Meta FAIR, UMD, Cisco, Zyphra, MIT, and Sequoia Capital, the study focuses on optimizing layer pruning to enhance the efficiency of AI models, specifically targeting the often overlooked deeper layers of LLMs.

By employing a technique known as parameter-efficient finetuning (PEFT), including methods like quantization and Low-Rank Adapters (QLoRA), the researchers demonstrated that it’s possible to significantly reduce the computational resources required for finetuning and improve inference speed without sacrificing accuracy.

The core discovery is that large fractions of LLMs’ layers can be pruned away without negatively affecting their question-answering capabilities. This is achieved by identifying and removing similar or redundant layers and then applying a minimal amount of finetuning to adjust the pruned models.

This approach not only suggests a potential for considerable efficiency gains in deploying AI models but also prompts a reevaluation of how deep learning networks utilize their parameters, hinting at an underutilization of deeper layers or an overemphasis on the importance of shallow layers.

In practice, this means AI developers might not need as many computational resources as previously thought, making advanced AI technologies more accessible to a broader range of users and applications. This is also good news for the energy “doomers” who propose that the AI revolution will be stopped in its tracks due to a lack of the electrical supplies needed to power the huge numbers of GPUs that LLMs typically need to create their models.

The findings also pose intriguing questions for future research, particularly regarding the optimal utilization of neural network layers and the potential for further innovations in AI efficiency.

The numbers game

The article highlights several key figures that substantiate the authors’ claims about the effectiveness of layer pruning in large language models (LLMs).

Up to 50% Layer Pruning with Minimal Performance Degradation: The study finds that up to half of the layers in popular LLMs can be pruned with “minimal degradation in downstream performance.” This significant reduction demonstrates the potential for optimizing LLMs without compromising their ability to perform complex tasks like question-answering.
Single GPU Experiments: All experiments were conducted using a single A100 GPU, emphasizing the reduced computational resources required for both finetuning and inference after layer pruning. This makes state-of-the-art AI models more accessible and cost-effective to deploy.
Efficiency Gains: The combination of quantization and Low-Rank Adapters (QLoRA) with layer pruning is highlighted as a method to further “reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand.” This approach is touted for its practicality in enhancing the efficiency of deploying AI models.
Robustness to Pruning: The research illustrates that the LLMs are robust to the deletion of layers, suggesting that “current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.” This observation raises questions about the optimal architecture and training strategies for LLMs.

Summary

These findings show the potential of layer pruning as a powerful tool for optimizing LLMs, reducing their computational footprint while maintaining or only slightly impacting their performance capabilities. This research opens up new avenues for making advanced AI models more sustainable and accessible, challenging conventional beliefs about the necessity of all layers in deep learning architectures. Also, if the predicted energy savings materialize it’ll make the green lobby happier, which is no bad side effect either.

Ok, that’s all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content.

If you liked this content, I think you’ll find these related articles interesting too.

Summarize

Do LLMs really need to be that big?

An article by Meta, Cisco, and MIT researchers suggests not.

The Unreasonable Ineffectiveness of the Deeper Layers

Andrey Gromov Co-first authors; direct correspondence to [email protected], [email protected], and [email protected]. Meta…

Article summary

The numbers game

Summary

Groq and its LPU: Revolutionizing AI Computation

Could this be the next big advance in AI development?

Customizing Large Language Models

Customize, run and save LLMs using OLLAMA and the Modelfile