T </b>library developed by the Hugging Face team.</p><div id="1963"><pre><span class="hljs-keyword">from</span> peft <span class="hljs-keyword">import</span> LoraConfig, TaskType
<span class="hljs-comment"># OR </span>
<span class="hljs-comment"># target_modules = [</span>
<span class="hljs-comment"># "query_key_value",</span>
<span class="hljs-comment"># "dense",</span>
<span class="hljs-comment"># "dense_h_to_4h",</span>
<span class="hljs-comment"># "dense_4h_to_h",</span>
<span class="hljs-comment"># ]</span></pre></div><p id="ac2d">You can also target all the dense layers in the transformers architecture:</p><div id="6856"><pre><span class="hljs-comment"># From https://github.com/artidoro/qlora/blob/main/qlora.py</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">find_all_linear_names</span>(<span class="hljs-params">args, model</span>):
cls = torch.nn.Linear
lora_module_names = <span class="hljs-built_in">set</span>()
<span class="hljs-keyword">for</span> name, module <span class="hljs-keyword">in</span> model.named_modules():
<span class="hljs-keyword">if</span> <span class="hljs-built_in">isinstance</span>(module, cls):
names = name.split(<span class="hljs-string">'.'</span>)
lora_module_names.add(names[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(names) == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> names[-<span class="hljs-number">1</span>])</pre></div><p id="085f">You’ve now just to add the “initialized” adapters to a pretrained model of your choice.</p><div id="fd22"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM
<span class="hljs-keyword">from</span> peft <span class="hljs-keyword">import</span> get_peft_model
model = AutoModelForCausalLM.from_pretrained(model_id)
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()
<span class="hljs-comment"># "trainable params: 1855499 || all params: 355894283 || trainable%: 0.5213624069370061"</span></pre></div><p id="bcfc">Once trained, you can either save the adapters separately or merged them into the model.</p><div id="a43b"><pre><span class="hljs-comment"># Save only adapaters</span>
lora_model.save_pretrained(...)
<span class="hljs-comment"># Save merged model</span>
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained(...)</pre></div><blockquote id="bfdf"><p>Sources:
<a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora#common-lora-parameters-in-peft">Hugging Face PEFT — LoRA</a> documentation</p></blockquote><h1 id="762b">Quantization</h1><p id="4bee">I cannot talk about LoRA without talking about quantization. Both of these techniques were fused in the paper <a href="https://arxiv.org/abs/2305.14314"><i>QLORA: Efficient Finetuning of Quantized LLMs</i></a>, and were later implemented into the Hugging Face through the library <code>bitsandbytes</code> , <code>peft</code> , and <code>accelerayte</code> .</p><p id="5f71">Let’s dig into it!</p><h2 id="cf26">What is the quantization process?</h2><p id="51b8">Quantization is a technique that reduces the precision of an element without losing the overall <i>meaning </i>of it.</p><p id="533e">For instance, in the case of a picture, quantization consists in reducing the number of pixels, while keeping a decent resolution of the image.</p><figure id="ea15"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*QB6VpIqS6YpuWwWn.jpg"><figcaption>Quantization of a picture</figcaption></figure><blockquote id="490c"><p>But how does it apply to float numbers?
Wait a minute, what is a float number?</p></blockquote><p id="6d2b">Certainly, one cannot grasp the concept of quantization without first understanding how computers represent numbers.</p><h2 id="2bb0">Float numbers fundamentals</h2><figure id="6425"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*GeT8lpMG6juQmWtq.png"><figcaption>Floating Point 32 representation (<a href="https://en.wikipedia.org/wiki/Single-precision_floating-point_format">source</a>)</figcaption></figure><p id="d792">Our computers are binary, which means they only exchange information through 0 and 1’s.</p><p id="8fb3">In order to represent numbers, a specific system called the <b>Floating-Point format</b> was designed, which allows computers to understand a wide range of numerical values. The most common representation is the <b>single-precision floating-point </b>format, composed of <b>32 bits</b> <i>(one bit = 0 or 1)</i>.</p><p id="81e5">Various formats exist, such as <b>half-precision (16 bits) </b>or<b> double-precision (64 bits)</b>. In short, the greater the number of bits used, the broader the range of numbers that can be accommodated.</p><p id="4b3f">Here’s a video I suggest you watch if you want to know how the float-point format works:</p>
<figure id="9704">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fgc1Nl3mmCuY%3Fstart%3D910%26feature%3Doembed%26start%3D910&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dgc1Nl3mmCuY&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fgc1Nl3mmCuY%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube" allowfullscreen="" frameborder="0" height="480" width="480">
</div>
</div>
</figure></iframe></div></div></figure><blockquote id="0bcb"><p>Ok great Jérémy, but what do I do with that information?</p></blockquote><p id="711b"><b>What if I tell you you can still achieve great performance without this degree of precision?</b></p><p id="d064">Models like <b>GPT-3.5 </b>or <b>Bloom-175B </b>are really large (as you may understand since they’re composed of a bunch of parameters.</p><p id="4557">In <b>FP32 </b>format, this would represent:</p><p id="cb91"><b>175*10⁹. 4 bytes = 700Gb, </b>or <b>350Gb in half-precision, which would have to be stored in the GPU for the fine-tuning process!</b></p><blockquote id="dde5"><p>Remember than <b>1 byte = 8 bits!</b></p></blockquote><p id="1966">So how do we shrink these models?</p><h2 id="019d">Quantization: from FP32 to Int8</h2><p id="57bc"><i>As you may have understood if you have watched the youtube video above,</i> <b>Int8</b> represents any number between <b>[-127, 127].</b></p><blockquote id="82fd"><p>1 bit for the sign, and 7 bit for the number: 2⁷ = 128,
And don’t forget 0 ! Thus <b>[-127, 127]</b></p></blockquote><p id="f803">Let’s say you want to reduce a vector of float numbers into Int8 format:</p><p id="eba4"><b>v = [-1.2, 4.5, 5.4, -0.1]</b></p><p id="500f">What you can do is define the maximum of <b>v </b>(here <b>5.4</b>) and scale all the numbers into the <b>range of Int8 [-127, 127]</b>. To do this, you need to calculate the coefficient</p><p id="83be"><b>α = 127 / max(v) = 127 / 5.4 ~ 23.5</b></p><p id="ecbc">Now if you scale all numbers in<b> v</b> by <b>α,</b> and you round the result, you get:</p><p id="e716"><b>α.v = [-28, 106, 127, -2]</b></p><p id="a8d9">Now, if you want to <b>de-quantize </b>this vector, you just have to do the opposite maneuver, and you get back to the initial vector!</p><p id="0ae7"><b>v = [-1.2, 4.5, 5.4, -0.1] </b><i>after de-quantization</i></p><p id="b5ab">Great, you performed quantization and de-quantization without losing any information!</p><p id="b191">At least that’s what you might think, but in reality, we did lose precision when rounding each value. Nevertheless, in this specific case, the difference is not substantial because we decided to represent numbers with only one decimal.</p><h2 id="a12d">Now, what happens in the case of the presence of an outlier?</h2><p id="f79e">Let’s say we now have this vector:</p><p id="01e0"><b>v’ = [-1.2, 70, 5.4, -0.1]</b></p><p id="59a0">The highest number is now 70, which can be considered an outlier<b>. And if we reproduce the exact same process, we got after qe-quantization:</b></p><p id="e2f3"><i>de-quantized </i><b>v’ =</b> <b>[-1.1, 70, 5.5, 0.0]</b></p><p id="099d">A loss of precision starts to appear!</p><p id="d43c"><b>Now, let’s consider the same loss applied to an LLM consisting of 7 billion parameters: the lack of precision will accumulate across the neural network, causing the complete loss of meaningful information and resulting in pure noise.</b></p><blockquote id="5d30"><p>Furthermore, let’s remember we chose 8-bit format, but the result would be much worse with 4-bit or even 3-bit. Try it yourself!</p></blockquote><p id="5c9d" type="7">The problem is that at a scale of 6.7B parameters and above, 75% of hidden state sequences are affected (with outliers). So this absolutely wrecks quantization, Tim Dettmers</p><p id="2d8a">But a group of people found a way to apply quantization to LLMs!</p><h2 id="5d82">The LLM.int8( ) made quantization possible at large scales</h2><p id="1080">The paper <a href="https://arxiv.org/pdf/2208.07339.pdf">LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale</a> introduced a method to bypass this <i>outlier issue</i>.</p><figure id="168d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*0Lm-NygUiW0Q8mSf"><figcaption>LLM-int8 method (<a href="https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/">source</a>)</figcaption></figure><p id="e5a1">Instead of quantizing the integrality of parameters, which would lead to a decrease in performance as we have just showcased, quantization is used during the matrices multiplication process, combining <b>mixed-precision decomposition</b> and <b>vector-wise quantized </b><i>(see the image below)</i></p><p id="8223"><b>In other words, during the <i>matrices multiplication process</i>, vectors containing outliers (above a <i>threshold</i>) are extracted from the weights matrix resulting in two multiplications. Small numbers matrices (representing 99.9% of the values according to the <a href="https://arxiv.org/pdf/2208.07339.pdf">paper</a>) are <i>quantized</i>, while large numbers are kept in <i>FP16</i>.</b></p><p id="6e45">The <i>small-numbers-multiplication </i>output is then <b>de-quantized, </b>and added to the other output, following the <b>mixed_precision decomposition principles</b>.</p><figure id="ac4d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3aHpt_oKTn7irJG7IaX2Zg.png"><figcaption>Matrices multiplication using vector-wise quantization and mixed-precision decomposition. Large numbers are grouped together and kept in FP16, whi
Options
le small ones are quantized in Int8.(<a href="https://arxiv.org/pdf/2208.07339.pdf">paper</a>)</figcaption></figure><blockquote id="c5dd"><p><b>Important notes: </b>
The <b>quantization</b> technique introduced here is used during inferences only (matrix multiplications), which means you actually don’t have a lighter model composed of 8-bits numbers!
However, the <b>GPU memory footprint</b> <b>during inference is reduced</b>, which is different!
You actually even have a <b>slighlty heavier model </b>because of this technique implementation! (by 0.1% for models up to 13B according to the <a href="https://arxiv.org/pdf/2208.07339.pdf">paper</a>)</p></blockquote><p id="b1be">Experiments on BLOOM-175B showed a reduction of the <b>memory footprint by a factor of 1.96x without any performance degradation!</b></p><p id="ed88">This technique enables access to large models that could not previously fit into GPU memory. Then Hugging Face spread it worldwide!</p><h2 id="cdaa">Does it mean you can fine-tune this quantized model?</h2><p id="475f"><b>No.</b></p><p id="ca0c">Indeed, recent papers have shown that this technique only works for inferences and is not suitable for training (<a href="https://arxiv.org/abs/2304.13013">source</a>). But does it mean we cannot fine-tune a quantized model?…</p><p id="c86f"><i>You probably know where I’m heading with this.</i></p><p id="fd84">What if we could reduce the<b> GPU memory footprint using quantization</b>, <i>AND</i> <b>train new adapters with the LoRA technique</b>?</p><p id="c6fd"><a href="https://arxiv.org/pdf/2305.14314.pdf"><b>QLoRA</b></a> was born! And the research team succeeded to quantize the pretrained model to <b>4-bit</b>!</p><blockquote id="4b47"><p>In the <a href="https://arxiv.org/pdf/2305.14314.pdf">paper</a>, new techniques have been implemented to successfully quantize the models, like <b>double-quantization </b>and <b>4-bit NormalFloat</b> that I haven’t mentioned in this article. If you want to know more about it, I suggest you to read the QLoRA paper, in addition to checking out the <a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes">blog post</a> from Hugging Face.</p></blockquote><h2 id="b81e">How to use quantization in your code?</h2><p id="52c3">First, you need to install the <code>bitsandbytes</code> and <code>accelerate</code> libraries</p><div id="ac02"><pre>pip install -q bitsandbytes
pip install -q accelerate
pip install -q peft==0.4.1</pre></div><p id="b05c">You can then load the model with the <i>4-bit </i>or <i>8-bit </i>quantization by passing the argument <code>load_in_4bit=True</code> or <code>load_in_8bit=True</code>when calling the <code>from_pretrained</code> method .</p><div id="39c4"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(<span class="hljs-string">"facebook/opt-350m"</span>,
load_in_4bit=<span class="hljs-literal">True</span>,
device_map=<span class="hljs-string">"auto"</span>
)
...</pre></div><p id="51c2">You can also play with the advanced usage with the <code>BitsAndBytesConfig</code> class.</p><div id="284e"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> BitsAndBytesConfig
model_nf4 = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=<span class="hljs-string">"auto"</span>
quantization_config=nf4_config
)</pre></div><p id="3e90">Your model is <b>almost </b>ready for inferences.</p><p id="836d">Indeed, to enable your model to run nicely, you need to:</p><ul><li><b>Freeze the quantized parameters to prevent any training</b>,</li><li><b>Cast all layer norms and the LM head in FP32 to ensure the stability of your model (which haven’t been quantized)</b></li><li><code>model.enable_input_require_grad()</code> <b>in case you use <i>gradient checkpointing</i></b></li></ul><div id="5a5b"><pre><span class="hljs-keyword">for</span> name, param <span class="hljs-keyword">in</span> model.named_parameters():
<span class="hljs-comment"># freeze base model's layers</span>
param.requires_grad = <span class="hljs-literal">False</span>
<span class="hljs-comment"># cast all non int8 or int4 parameters to fp32</span>
<span class="hljs-keyword">for</span> param <span class="hljs-keyword">in</span> model.parameters():
<span class="hljs-keyword">if</span> (param.dtype == torch.float16) <span class="hljs-keyword">or</span> (param.dtype == torch.bfloat16):
param.data = param.data.to(torch.float32)
<span class="hljs-keyword">if</span> use_gradient_checkpointing:
<span class="hljs-comment"># For backward compatibility</span>
model.enable_input_require_grads()</pre></div><p id="af80">This preparation is now handled in the latest <code>peft==0.4.1</code> library with the <code>prepare_model_for_kbit_training()</code> method (<a href="https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py">source code</a>).</p><p id="be93"><b>That’s it! You now have a quantized model!</b></p><blockquote id="5ddf"><p>Source:
<a href="https://huggingface.co/blog/hf-bitsandbytes-integration">A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale</a>, Hugging Face blog post
<a href="https://arxiv.org/pdf/2208.07339.pdf">LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale</a>, paper
<a href="https://arxiv.org/pdf/2305.14314.pdf">QLORA: Efficient Finetuning of Quantized LLMs</a>, paper
<a href="https://www.youtube.com/watch?v=gc1Nl3mmCuY&ab_channel=0612TVw%2FNERDfirst">Floating point numbers</a>, youtube
<a href="https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/">Tim Dettmers blog post about LLM.int8()</a>
<a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes">Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA</a>, Hugging Face blog post</p></blockquote><h1 id="5da1">Let’s wrap up in one code!</h1><p id="25eb">Now you have an overview of <b>Gradient Checkpointing, LoRA, and Quantization</b>, let’s write the code to prepare an LLM from the Hugging Face hub for fine-tuning.</p><div id="4521"><pre>pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git</pre></div><div id="cf7d"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> (
AutoModelForCausalLM,
BitsAndBytesConfig
)
<span class="hljs-keyword">from</span> peft <span class="hljs-keyword">import</span> (
get_peft_model,
LoraConfig,
TaskType,
prepare_model_for_kbit_training
)</pre></div><div id="a33d"><pre><span class="hljs-comment"># Import the model</span>
gradient_checkpointing = <span class="hljs-literal">True</span>
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
use_cache=<span class="hljs-literal">False</span> <span class="hljs-keyword">if</span> gradient_checkpointing <span class="hljs-keyword">else</span> <span class="hljs-literal">True</span>, <span class="hljs-comment"># this is needed for gradient checkpointing</span>
device_map=<span class="hljs-string">"auto"</span>,
load_in_4bit=<span class="hljs-literal">True</span>
)
<span class="hljs-comment"># Prepare the model (freeze, cast FP32, enable_require_grads, activate gradient checkpointing)</span>
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=gradient_checkpointing
)</pre></div><div id="a09a"><pre><span class="hljs-comment"># Prepare Peft model by adding Lora</span>
peft_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=modules,
lora_dropout=0.1,
bias=<span class="hljs-string">"none"</span>,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)</pre></div><p id="1135">Your <code>model</code> is now ready to be fine-tuned with optimal GPU memory management!</p><h2 id="bde3">Side note</h2><p id="78de">Hugging Face has made the process much easier by creating <b>SFTTrainer, </b>a subclass of <code>Trainer</code> that handles everything we have talked about until now.</p><p id="0b7a">In a few lines of code, you can prepare your model for efficient fine-tuning with quantization and Lora.</p><p id="de6b">I suggest you have a look at the <a href="https://huggingface.co/docs/trl/main/en/sft_trainer">official documentation</a>.</p><div id="56d7"><pre><span class="hljs-keyword">from</span> trl <span class="hljs-keyword">import</span> SFTTrainer
model = AutoModelForCausalLM.from_pretrained(
<span class="hljs-string">"EleutherAI/gpt-neo-125m"</span>,
load_in_4bit=<span class="hljs-literal">True</span>,
device_map=<span class="hljs-string">"auto"</span>,
)
trainer.train()</pre></div><h1 id="fe2f">To conclude</h1><p id="e14e">In this article, we addressed a challenge that arises during the fine-tuning of Large Language Models: <b>how to fit the training on a single GPU.</b></p><p id="9f87">We focused on 3 techniques to reduce the memory footprint: <b>gradient checkpointing</b>, <b>LoRA, </b>and <b>Quantization (</b>which lead us to <b>QLoRA)</b>.</p><p id="be22">We then saw how to apply these techniques to our code by leveraging the Hugging Face implementations with <b>PEFT, BitsAndBytes, </b>and <b>Transformers.</b></p><p id="c870">The goal of this article was to provide a deep, yet simple, view of the existing techniques you can leverage to fine-tune your own LLMs in your projects.</p><p id="95ec">I’m convinced that using a technique already implemented in libraries is a thing, <b>but knowing what it does and when to use it</b> will make you a better ML Engineer / Data Scientist!</p><p id="5105">I hope you enjoyed the reading!</p><p id="1a2f">I didn’t think I would dig so deep into papers and code repositories to write this article. But overall it was worth it since I learned so much in the process!</p><p id="8612"><a href="https://medium.com/@jeremyarancio/subscribe"><b>You can join my newsletter to get notified of my latest articles.</b></a></p><p id="0347">If you enjoy reading stories like these and want to support me as a writer, you can do it by getting a <a href="https://medium.com/@jeremyarancio/membership">Medium subscription</a> (it’s $5 a month, giving you unlimited access to articles like this one, and I get a small commission without any additional fee for you!).</p><p id="b14d">Happy coding!</p></article></body>
Fit Your LLM on a single GPU with Gradient Checkpointing, LoRA, and Quantization: a deep dive
Whoever has ever tried to fine-tune a Large Language Model knows how hard it is to handle the GPU memory.
“RuntimeError: CUDA error: out of memory”.
This error message has been haunting my nights.
3B, 7B, or even 13B parameters models are large and the fine-tuning is long and tedious. Running out of memory during training can be both frustrating and costly.
But don’t worry, I got you!
In this article, we’re going through 3 techniques you have to know or already use without knowing how they work: Gradient Checkpointing, Low-Rank Adapters, and Quantization.
These will help you avoid running out of memory during your training and save you a lot of time.
If you’re not familiar with fine-tuning an LLM, I wrote an article on this topic where I walk you through fine-tuning Bloom-3B on the Lord Of The Rings books.
Gradient checkpointing is a technique that uses dynamic computing to store only a minimal number of layers during neural network training.
To understand this process, we need to understand how back-propagation is performed and how layers are stored in the GPU memory throughout the process.
Forward and backward propagation fundamentals
The forward and backward propagations are the two phases of deep neural network training.
During the forward pass, the input is vectorized (transforming images into pixels and texts into embeddings), and each element is processed throughout the neural network via a succession of linear multiplications and activation functions (non-linear functions such as sigmoid or ReLU).
The output of the neural network, referred to as the head, is designed to produce the desired output, such as classification or next-word prediction. The vectorized prediction is then compared to the expected result, and the loss is calculated using a specific loss function, such as cross-entropy, L2-norm, etc…
The back-propagation can begin.
Based on the loss value, the weights and biases of each layer will be updated with the objective of minimizing the loss. This update process starts from the end of the neural network and propagates toward the beginning.
Now we give a quick and simple reminder of the neural network training algorithm, let’s talk about how calculations are stored in the memory.
“Poor-memory” algorithm
A simple approach would be to retain only the essential layers required for the back-propagation and release them from memory once their usage is complete.
But as you can observe in the image above, the peak number of layers stored in memory at the same time is non-optimal. We need to find a way to store a lower number of elements in memory while keeping the back-propagation working.
“Poor computation time” algorithm
A way to reduce the memory footprint would consist in re-computing each layer during the back-propagation from the beginning of the neural network.
But in this case, the computation time would increase significantly, making the training unpracticable in the case of large models.
So what if we could have the best of both worlds?
Who said gradient checkpointing?
Optimize computation and memory with Gradient checkpointing
This technique saves “checkpoints” to compute “missing” layers during back-propagation.
In simpler terms, instead of computing the layer from the beginning, as demonstrated in the previous example, the algorithm starts the computation from the nearest checkpoint.
An optimal strategy to balance memory storage and computation time would be to take a checkpoint every O(sqrt(n)), with n the number of layers. This way, the number of additional computations for a backward computation would correspond to one additional feed-forward pass.
This technique enables training larger models on smaller GPUs at the expense of an additional computational time (~20%).
How to implement gradient checkpointing with Transformers
You can easily use the gradient checkpointing technique with the following code in the transformerslibrary.
from transformers import AutoModelForCausalLM, TraininArguments
model = AutoModelForCausalLM.from_pretrained(
model_id,
use_cache=False, # False if gradient_checkpointing=True
**default_args
)
model.gradient_checkpointing_enable()
Low-Rank Adapters (LoRA)
LoRA is a technique developed by the Microsoft team to accelerate the fine-tuning of large language models. To evaluate this approach, they implemented it on GPT-3 175B and achieved a substantial reduction in the number of trained parameters.
But how does it work?
Low-Rank Adapters approach (source) (You’ll finally understand this image after reading this section!)
Their approach consists in freezing all parameters of a pretrained model and embedding new trainable parameters to specific modules in the transformers’ architecture, like the Attention modules(query, key, value but also works for others modules).
To implement those adapters, they exploit the linearity of dense layers, shown in the following equation, with x (dimension: d) and h (dim: k) as the layers before and after multiplication, Wo asthe pretrained weights, and B & A as the new weights matrices.
The dimension of the matrices B & A are respectively (d x r) and (r x k) with r << min(d, k).
In other words, new dense layers are graffed on the existing ones without complexifying the training process. But instead of training the integrality of parameters, only a tiny portion is updated!
During the fine-tuning process, the weights matrix BA is initialized to 0 and follows a linear scale of α/r, with α a constant. When optimizing the weights with the Adam algorithm,α is roughly the same as the learning rate.
Different LoRA configurations have been tested and it results from the paper that r=8 (or above) applied to a variety of modules performs the best.
Hyperparameter tuning of LoRA (query, key, value, output) (source)
Once your LoRA model is fine-tuned, you can merge the weights together to obtain a single model or save only the adapters independently and load the pretrained model separately from the existing ones.
How to implement LoRA in your code?
It is possible to exploit the LoRA technique by using the PEFT library developed by the Hugging Face team.
You can also target all the dense layers in the transformers architecture:
# From https://github.com/artidoro/qlora/blob/main/qlora.pydeffind_all_linear_names(args, model):
cls = torch.nn.Linear
lora_module_names = set()
for name, module in model.named_modules():
ifisinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] iflen(names) == 1else names[-1])
You’ve now just to add the “initialized” adapters to a pretrained model of your choice.
from transformers import AutoModelForCausalLM
from peft import get_peft_model
model = AutoModelForCausalLM.from_pretrained(model_id)
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()
# "trainable params: 1855499 || all params: 355894283 || trainable%: 0.5213624069370061"
Once trained, you can either save the adapters separately or merged them into the model.
# Save only adapaters
lora_model.save_pretrained(...)
# Save merged model
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained(...)
I cannot talk about LoRA without talking about quantization. Both of these techniques were fused in the paper QLORA: Efficient Finetuning of Quantized LLMs, and were later implemented into the Hugging Face through the library bitsandbytes , peft , and accelerayte .
Let’s dig into it!
What is the quantization process?
Quantization is a technique that reduces the precision of an element without losing the overall meaning of it.
For instance, in the case of a picture, quantization consists in reducing the number of pixels, while keeping a decent resolution of the image.
Quantization of a picture
But how does it apply to float numbers?
Wait a minute, what is a float number?
Certainly, one cannot grasp the concept of quantization without first understanding how computers represent numbers.
Our computers are binary, which means they only exchange information through 0 and 1’s.
In order to represent numbers, a specific system called the Floating-Point format was designed, which allows computers to understand a wide range of numerical values. The most common representation is the single-precision floating-point format, composed of 32 bits(one bit = 0 or 1).
Various formats exist, such as half-precision (16 bits) or double-precision (64 bits). In short, the greater the number of bits used, the broader the range of numbers that can be accommodated.
Here’s a video I suggest you watch if you want to know how the float-point format works:
Ok great Jérémy, but what do I do with that information?
What if I tell you you can still achieve great performance without this degree of precision?
Models like GPT-3.5 or Bloom-175B are really large (as you may understand since they’re composed of a bunch of parameters.
In FP32 format, this would represent:
175*10⁹. 4 bytes = 700Gb, or 350Gb in half-precision, which would have to be stored in the GPU for the fine-tuning process!
Remember than 1 byte = 8 bits!
So how do we shrink these models?
Quantization: from FP32 to Int8
As you may have understood if you have watched the youtube video above,Int8 represents any number between [-127, 127].
1 bit for the sign, and 7 bit for the number: 2⁷ = 128,
And don’t forget 0 ! Thus [-127, 127]
Let’s say you want to reduce a vector of float numbers into Int8 format:
v = [-1.2, 4.5, 5.4, -0.1]
What you can do is define the maximum of v (here 5.4) and scale all the numbers into the range of Int8 [-127, 127]. To do this, you need to calculate the coefficient
α = 127 / max(v) = 127 / 5.4 ~ 23.5
Now if you scale all numbers in v by α, and you round the result, you get:
α.v = [-28, 106, 127, -2]
Now, if you want to de-quantize this vector, you just have to do the opposite maneuver, and you get back to the initial vector!
v = [-1.2, 4.5, 5.4, -0.1] after de-quantization
Great, you performed quantization and de-quantization without losing any information!
At least that’s what you might think, but in reality, we did lose precision when rounding each value. Nevertheless, in this specific case, the difference is not substantial because we decided to represent numbers with only one decimal.
Now, what happens in the case of the presence of an outlier?
Let’s say we now have this vector:
v’ = [-1.2, 70, 5.4, -0.1]
The highest number is now 70, which can be considered an outlier. And if we reproduce the exact same process, we got after qe-quantization:
de-quantized v’ =[-1.1, 70, 5.5, 0.0]
A loss of precision starts to appear!
Now, let’s consider the same loss applied to an LLM consisting of 7 billion parameters: the lack of precision will accumulate across the neural network, causing the complete loss of meaningful information and resulting in pure noise.
Furthermore, let’s remember we chose 8-bit format, but the result would be much worse with 4-bit or even 3-bit. Try it yourself!
The problem is that at a scale of 6.7B parameters and above, 75% of hidden state sequences are affected (with outliers). So this absolutely wrecks quantization, Tim Dettmers
But a group of people found a way to apply quantization to LLMs!
The LLM.int8( ) made quantization possible at large scales
Instead of quantizing the integrality of parameters, which would lead to a decrease in performance as we have just showcased, quantization is used during the matrices multiplication process, combining mixed-precision decomposition and vector-wise quantized (see the image below)
In other words, during the matrices multiplication process, vectors containing outliers (above a threshold) are extracted from the weights matrix resulting in two multiplications. Small numbers matrices (representing 99.9% of the values according to the paper) are quantized, while large numbers are kept in FP16.
The small-numbers-multiplication output is then de-quantized, and added to the other output, following the mixed_precision decomposition principles.
Matrices multiplication using vector-wise quantization and mixed-precision decomposition. Large numbers are grouped together and kept in FP16, while small ones are quantized in Int8.(paper)
Important notes:
The quantization technique introduced here is used during inferences only (matrix multiplications), which means you actually don’t have a lighter model composed of 8-bits numbers!
However, the GPU memory footprintduring inference is reduced, which is different!
You actually even have a slighlty heavier model because of this technique implementation! (by 0.1% for models up to 13B according to the paper)
Experiments on BLOOM-175B showed a reduction of the memory footprint by a factor of 1.96x without any performance degradation!
This technique enables access to large models that could not previously fit into GPU memory. Then Hugging Face spread it worldwide!
Does it mean you can fine-tune this quantized model?
No.
Indeed, recent papers have shown that this technique only works for inferences and is not suitable for training (source). But does it mean we cannot fine-tune a quantized model?…
You probably know where I’m heading with this.
What if we could reduce the GPU memory footprint using quantization, ANDtrain new adapters with the LoRA technique?
QLoRA was born! And the research team succeeded to quantize the pretrained model to 4-bit!
In the paper, new techniques have been implemented to successfully quantize the models, like double-quantization and 4-bit NormalFloat that I haven’t mentioned in this article. If you want to know more about it, I suggest you to read the QLoRA paper, in addition to checking out the blog post from Hugging Face.
How to use quantization in your code?
First, you need to install the bitsandbytes and accelerate libraries
You can then load the model with the 4-bit or 8-bit quantization by passing the argument load_in_4bit=True or load_in_8bit=Truewhen calling the from_pretrained method .
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m",
load_in_4bit=True,
device_map="auto"
)
...
You can also play with the advanced usage with the BitsAndBytesConfig class.
Indeed, to enable your model to run nicely, you need to:
Freeze the quantized parameters to prevent any training,
Cast all layer norms and the LM head in FP32 to ensure the stability of your model (which haven’t been quantized)
model.enable_input_require_grad()in case you use gradient checkpointing
for name, param in model.named_parameters():
# freeze base model's layers
param.requires_grad = False# cast all non int8 or int4 parameters to fp32for param in model.parameters():
if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
param.data = param.data.to(torch.float32)
if use_gradient_checkpointing:
# For backward compatibility
model.enable_input_require_grads()
This preparation is now handled in the latest peft==0.4.1 library with the prepare_model_for_kbit_training() method (source code).
Now you have an overview of Gradient Checkpointing, LoRA, and Quantization, let’s write the code to prepare an LLM from the Hugging Face hub for fine-tuning.
from transformers import (
AutoModelForCausalLM,
BitsAndBytesConfig
)
from peft import (
get_peft_model,
LoraConfig,
TaskType,
prepare_model_for_kbit_training
)
# Import the model
gradient_checkpointing = True
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
use_cache=Falseif gradient_checkpointing elseTrue, # this is needed for gradient checkpointing
device_map="auto",
load_in_4bit=True
)
# Prepare the model (freeze, cast FP32, enable_require_grads, activate gradient checkpointing)
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=gradient_checkpointing
)
# Prepare Peft model by adding Lora
peft_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=modules,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
Your model is now ready to be fine-tuned with optimal GPU memory management!
Side note
Hugging Face has made the process much easier by creating SFTTrainer, a subclass of Trainer that handles everything we have talked about until now.
In a few lines of code, you can prepare your model for efficient fine-tuning with quantization and Lora.
In this article, we addressed a challenge that arises during the fine-tuning of Large Language Models: how to fit the training on a single GPU.
We focused on 3 techniques to reduce the memory footprint: gradient checkpointing, LoRA, and Quantization (which lead us to QLoRA).
We then saw how to apply these techniques to our code by leveraging the Hugging Face implementations with PEFT, BitsAndBytes, and Transformers.
The goal of this article was to provide a deep, yet simple, view of the existing techniques you can leverage to fine-tune your own LLMs in your projects.
I’m convinced that using a technique already implemented in libraries is a thing, but knowing what it does and when to use it will make you a better ML Engineer / Data Scientist!
I hope you enjoyed the reading!
I didn’t think I would dig so deep into papers and code repositories to write this article. But overall it was worth it since I learned so much in the process!
If you enjoy reading stories like these and want to support me as a writer, you can do it by getting a Medium subscription (it’s $5 a month, giving you unlimited access to articles like this one, and I get a small commission without any additional fee for you!).