QLoRA: Key Quantization and Fine-tuning Techniques in the Era of Large Language Models
QLoRA combines both model quantization and LoRA parameter fine-tuning methods. By applying QLoRA, it is possible to:
- Fine-tune a LLM of 65B on a single 48GB GPU.
- A single 24GB graphics card can fine-tune a 33B LLAMA model, which means it is possible to fine-tune a 33B model on a single consumer-grade graphics card such as Nvidia 4090 or 3090.
QLoRA greatly reduces the threshold for fine-tuning. In an era of expensive computing power, this technology can save dozens of times the graphics memory and almost no loss of performance, significantly lowering the threshold for inference and training of large language models.
Core Idea
QLoRA[1] loads the model with 4-bit precision. During training, the values are dequantized to bf16 and then trained. By using LoRA[4], the original model parameters can be locked and not participate in training, reducing the amount of GPU memory required for training significantly.
The important improvements of QLoRA are shown in Figure 1:
This includes:
- 4-bit NormalFloat(NF4) Quantization: It adopts a new data type called NF (NormalFloat), which is theoretically optimal for weights that follow a normal distribution. Additionally, the NF type helps mitigate the impact of outliers.
- Dual Quantization: Further quantization of the scaled data after quantization.
- LoRA: Low-Rank Adaptation of large language models[2]
- Page Optimizer
Next, this article will discuss each of these improvements.
NF4 Quantization
K-bit Absmax Quantization
The main task of model quantization in deep learning is to convert high-precision floating-point numbers in neural networks into low-precision numbers.
It typically involves converting data types with more bits to data types with fewer bits, such as converting from a 32-bit floating-point number to an 8-bit integer.
To ensure the entire range of the lower-bit data type is utilized, the input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor. For example, the process of quantizing a 32-bit floating-point (FP32) tensor to an Int8 tensor with a range of [-127, 127]
is as follows:
where c is the quantization constant or quantization scale.
For example, let’s assume the tensor to be quantized is x = [-3.1, 1.4, 2.0, 99.0]
. We scale x
to the range [-127, 127]
, and the result is:
The most obvious drawback of this quantization method is that the overall distribution of the quantized values is significantly different from the original distribution. If there is a large value in the values to be quantized, most of the quantized values will be concentrated around 0, which will significantly degrade the performance. The quantization process does not fully utilize the available digits and loses the original differences or information.
To prevent the issue of outliers, one approach is to divide the input tensor into independently quantized blocks, where each block has its own quantization constant c
. This will be discussed in the section on Double Quantization. Another approach is through clever design, utilizing the limited 256(8-bit) or 16(4-bit) numbers available in the input data.
Quantile Quantization
So how can we make full use of these limited numbers?
The most intuitive idea is to start from the distribution of the input data. Sort all the values to be quantized in ascending order and divide them into sixteen equal parts. The smallest part is mapped to the first quantized number, the second part is mapped to the second quantized number, and so on. This approach ensures that the original data is evenly distributed across the quantized numbers, which is known as uniform quantization.
This method is called Quantile Quantization, and the commonly mentioned median is actually the half quantile. This quantization method can ensure that the quantized data distribution is as close as possible to the original distribution.
However, in practical applications, the cost of Quantile Quantization is high because for each batch of values to be quantized, corresponding quantiles need to be calculated based on their distribution.
4-bit NormalFloat(NF4)
Given the issues with naive Quantile Quantization, a natural approach is to find a more general method for dividing quantiles that doesn’t require re-computation for each batch of data.
According to the QLoRA paper, pre-trained parameters are generally in accordance with a zero-centered normal distribution with a standard deviation of σ. We can scale σ to transform all weights into a single fixed distribution that fully adapts to the data range specified by QLoRA.
Motivated by this, QLoRA calculates the values of qj based on the quantiles of the normal distribution.
The current problem is how to calculate 16 quantiles:
The specific process is as follows[3]:
def create_normal_map(offset=0.9677083, use_extra_value=True):
if use_extra_value:
# one more positive value, this is an asymmetric type
v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
v2 = [0]*(256-15) ## we have 15 non-zero values in this data type
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
else:
v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
v2 = [0]*(256-14) ## we have 14 non-zero values in this data type
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
v = v1 + v2 + v3
values = torch.Tensor(v)
values = values.sort().values
values /= values.max()
assert values.numel() == 256
return values
The 3 key functions of this code are:
norm.ppf
represents the Percent Point Function of the normal distribution, which is the inverse of the cumulative distribution function (CDF) - percentiles. Thenorm.ppf
function takes a probability value between0
and1
and returns the corresponding value on the x-axis. For example,norm.ppf(0.5)
will return approximately0.0
because in the standard normal distribution,50%
of the values are less than0
.norm.ppf(0.9)
will return approximately1.28
because in the standard normal distribution, around90%
of the values are less than1.28
.- The function
torch.linspace(offset, 0.5, n)
generates a sequence of evenly spaced values starting fromoffset
and ending at0.5
, with a total ofn
elements. This sequence is used as the input for thenorm.ppf
function. - The
use_extra_value
parameter determines the number of non-zero values in the mapping table. If it is True, the mapping table will have15
non-zero values; otherwise, it will have14
non-zero values.
The function create_normal_map
follows the following steps:
- Set
offset = 0.9677083
- Generate
8
probability values:torch.linspace(offset, 0.5, 9)[:-1] = tensor([0.9677, 0.9092, 0.8508, 0.7923, 0.7339, 0.6754, 0.6169, 0.5585])
- Find their pre-images under the standard Gaussian CDF, resulting in
v1 = [1.8481308221817017, 1.3361188173294067, 1.0397897958755493, 0.8144894242286682, 0.6245115995407104, 0.4548477232456207, 0.29742005467414856, 0.14707496762275696]
- Generate
256 - 15
zeros, formingv2
- Calculate 7 probability values:
torch.linspace(offset, 0.5, 8)[:-1] = tensor([0.9677, 0.9009, 0.8341, 0.7673, 0.7004, 0.6336, 0.5668])
- Find their pre-images under the Gaussian CDF, resulting in
v3 = [-1.8481308221817017, -1.2866554260253906, -0.9704037308692932, -0.7298591732978821, -0.5256847143173218, -0.34148547053337097, -0.1682722121477127]
- Normalize
v1 + v2 + v3
to the range[-1, 1]
to obtain the final values.
I drew a picture to illustrate the generation process of v1
and v3
, in order to facilitate better understanding. Please refer to Figure 2 below:
From Figure 2, it can be seen that the offset=0.9677
ensures that the generated mapping table has equal coverage on both sides of the normal distribution (0.0323 = 1–0.9677
), making the quantization process more uniform and balanced.
The paper provides the pre-calculated quantiles (with the padded zeros removed), as shown in Figure 3:
The standard Gaussian distribution is divided by these points, as shown in Figure 4:
In summary, the purpose of this function is to create 16
values of the NF4 data type, filled with zeros to be used in an 8-bit quantization function (including 256–16
zeros). The bitsandbytes library uses an 8-bit quantization method to “emulate” NF4.
Note that the reason the values are computed in two asymmetric groups(v1
and v3
) is: a problem for a symmetric k-bit quantization is that this approach does not have an exact representation of zero, which is an important property to quantize padding and other zero-valued elements with no error.
Points of NF4 are truncated according to the quantiles of the Gaussian distribution, with sparse points at both ends and dense points in the middle. The distribution of NF4 points is consistent with the data distribution. Compared to naive uniform partitioning, the efficiency of grid point allocation is greatly increased, while the loss in accuracy is not significant.
After obtaining the quantiles of NF4, for each value w
to be quantized, we can obtain the quantized value Q
by mapping each w
to the nearest qj
after downscaling by the quantization constant c
.
Double Quantization
Block-wise Quantization
We know that the essence of quantization is to map values from a larger range to a smaller range. We can use a constant c
to proportionally reduce the values. In this way, we can easily use the same constant c to dequantize the quantized values back to their original (approximate) form.
However, if our data contains outliers, this will affect the selection of c and cause other values to collapse within a small range. Block-wise provides a solution to this by quantizing one block at a time, with each block using its own independent quantization constant c
.
Since quantization constants are typically stored as FP32, the memory usage can become significant when there are a large number of blocks.
The approach of QLoRA
QLoRA divides the parameters into blocks of size 64
. Each block calculates a quantization constant, denoted as c
. QLoRA further quantizes the quantization constants into FP8 using Double Quant, with a block size of 256
. This further reduces the memory consumption.
- Before Double Quant, quantizing each parameter requires an additional
32/64 =0.5
bits of memory. - After Double Quant, quantizing each parameter only requires an additional
8/64 + 32 / (64*256) =0.127
bits of memory.
Page Optimizer
The Page Optimizer mechanism allows for transferring the optimizer to memory when GPU memory is limited. It can be loaded back when the optimizer state needs to be updated. It is said to effectively reduce the peak occupancy of GPU memory. The paper of QLoRA states that this mechanism is necessary to train a model with 33
billion parameters on a 24GB
GPU.
This mechanism can be easily configured by setting the parameters of TrainingArguments
: optim = ‘paged_adamw_32bit’
[4].
LoRA
The mechanism of LoRA[2] is not the focus of this article, so I won’t go into much detail here. In simple terms, micro-adjustment actually obtains a ΔW
based on the original base parameters. The shape and size of ΔW
are the same as the original parameters.
However, if we can decompose it into AB=ΔW
, the number of parameters in matrix A
and matrix B
is much smaller than W
. The purpose of LoRA is to train matrices A and B instead of the entire W
while freezing the original model. This greatly reduces the required computational cost.
Conclusion
This article mainly introduces the key points of QLoRA, focusing on the ideas and quantization process of NF4.
In fact, more and more open-source projects are starting to use QLoRA for fine-tuning large models.
Moreover, with frameworks like transformers and bitsandbytes, it is convenient to set up fine-tuning in the QLoRA manner.
Finally, if there are any omissions or errors in this article, please kindly provide feedback. Thank you.
References
[1]: T. Dettmers, A. Pagnoni, A. Holtzman and L. Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314. 2023.
[2]: E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. 2021.
[3]: T. Dettmers. bitsandbytes: create_normal_map.
[4]: T. Dettmers, A. Pagnoni, A. Holtzman and L. Zettlemoyer. QLoRA Github Project.
PlainEnglish.io 🚀
Thank you for being a part of the In Plain English community! Before you go:
- Be sure to clap and follow the writer️
- Learn how you can also write for In Plain English️
- Follow us: X | LinkedIn | YouTube | Discord | Newsletter
- Visit our other platforms: Stackademic | CoFeed | Venture