avatarFlorian June

Summary

QLoRA is a novel approach that combines model quantization with LoRA parameter fine-tuning to enable efficient training and inference of large language models on consumer-grade GPUs.

Abstract

The QLoRA technique represents a significant advancement in the field of machine learning by enabling the fine-tuning of large language models (LLMs) with up to 65 billion parameters on a single 48GB GPU. This is achieved through a combination of 4-bit model quantization, which includes innovative methods like NormalFloat(NF4) Quantization and Dual Quantization, and Low-Rank Adaptation (LoRA) for parameter updates. The core idea behind QLoRA is to load models with reduced precision, dequantize them during training, and utilize LoRA to update only a small subset of parameters, thus significantly reducing memory requirements. The paper discusses the NF4 Quantization method in detail, which is designed to handle the distribution of pre-trained model weights more effectively than traditional quantization methods. Additionally, QLoRA employs a Page Optimizer to manage memory spikes and Double Quantization to address outliers in data distribution. These improvements allow for the training of models with billions of parameters on GPUs with limited memory, democratizing access to LLM fine-tuning.

Opinions

  • The authors of QLoRA emphasize the importance of reducing the threshold for fine-tuning large language models, making it more accessible and cost-effective.
  • There is an opinion that the traditional quantization methods do not fully utilize the available digits and lose the original differences or information, which QLoRA aims to address.
  • The paper suggests that Quantile Quantization, while effective, is computationally expensive, and QLoRA provides a more general and efficient solution.
  • The use of NF4 Quantization is presented as a superior method for quantizing model weights, particularly because it is tailored to the normal distribution of pre-trained parameters.
  • The authors believe that the Page Optimizer is a necessary component for training large models on GPUs with limited memory, such as the 24GB GPU required for a 33 billion parameter LLAMA model.
  • The reference to LoRA indicates that the authors consider it a key technique for reducing the computational cost of fine-tuning large models by focusing on training a smaller set of parameters.
  • The conclusion of the article implies that QLoRA is gaining traction in the open-source community and that frameworks like transformers and bitsandbytes facilitate its adoption for fine-tuning tasks.

QLoRA: Key Quantization and Fine-tuning Techniques in the Era of Large Language Models

QLoRA combines both model quantization and LoRA parameter fine-tuning methods. By applying QLoRA, it is possible to:

  • Fine-tune a LLM of 65B on a single 48GB GPU.
  • A single 24GB graphics card can fine-tune a 33B LLAMA model, which means it is possible to fine-tune a 33B model on a single consumer-grade graphics card such as Nvidia 4090 or 3090.

QLoRA greatly reduces the threshold for fine-tuning. In an era of expensive computing power, this technology can save dozens of times the graphics memory and almost no loss of performance, significantly lowering the threshold for inference and training of large language models.

Core Idea

QLoRA[1] loads the model with 4-bit precision. During training, the values are dequantized to bf16 and then trained. By using LoRA[4], the original model parameters can be locked and not participate in training, reducing the amount of GPU memory required for training significantly.

The important improvements of QLoRA are shown in Figure 1:

Figure 1: Different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes. Source: [1]

This includes:

  • 4-bit NormalFloat(NF4) Quantization: It adopts a new data type called NF (NormalFloat), which is theoretically optimal for weights that follow a normal distribution. Additionally, the NF type helps mitigate the impact of outliers.
  • Dual Quantization: Further quantization of the scaled data after quantization.
  • LoRA: Low-Rank Adaptation of large language models[2]
  • Page Optimizer

Next, this article will discuss each of these improvements.

NF4 Quantization

K-bit Absmax Quantization

The main task of model quantization in deep learning is to convert high-precision floating-point numbers in neural networks into low-precision numbers.

It typically involves converting data types with more bits to data types with fewer bits, such as converting from a 32-bit floating-point number to an 8-bit integer.

To ensure the entire range of the lower-bit data type is utilized, the input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor. For example, the process of quantizing a 32-bit floating-point (FP32) tensor to an Int8 tensor with a range of [-127, 127] is as follows:

where c is the quantization constant or quantization scale.

For example, let’s assume the tensor to be quantized is x = [-3.1, 1.4, 2.0, 99.0]. We scale x to the range [-127, 127], and the result is:

The most obvious drawback of this quantization method is that the overall distribution of the quantized values is significantly different from the original distribution. If there is a large value in the values to be quantized, most of the quantized values will be concentrated around 0, which will significantly degrade the performance. The quantization process does not fully utilize the available digits and loses the original differences or information.

To prevent the issue of outliers, one approach is to divide the input tensor into independently quantized blocks, where each block has its own quantization constant c. This will be discussed in the section on Double Quantization. Another approach is through clever design, utilizing the limited 256(8-bit) or 16(4-bit) numbers available in the input data.

Quantile Quantization

So how can we make full use of these limited numbers?

The most intuitive idea is to start from the distribution of the input data. Sort all the values to be quantized in ascending order and divide them into sixteen equal parts. The smallest part is mapped to the first quantized number, the second part is mapped to the second quantized number, and so on. This approach ensures that the original data is evenly distributed across the quantized numbers, which is known as uniform quantization.

This method is called Quantile Quantization, and the commonly mentioned median is actually the half quantile. This quantization method can ensure that the quantized data distribution is as close as possible to the original distribution.

However, in practical applications, the cost of Quantile Quantization is high because for each batch of values to be quantized, corresponding quantiles need to be calculated based on their distribution.

4-bit NormalFloat(NF4)

Given the issues with naive Quantile Quantization, a natural approach is to find a more general method for dividing quantiles that doesn’t require re-computation for each batch of data.

According to the QLoRA paper, pre-trained parameters are generally in accordance with a zero-centered normal distribution with a standard deviation of σ. We can scale σ to transform all weights into a single fixed distribution that fully adapts to the data range specified by QLoRA.

Motivated by this, QLoRA calculates the values of qj based on the quantiles of the normal distribution.

The current problem is how to calculate 16 quantiles:

The specific process is as follows[3]:

def create_normal_map(offset=0.9677083, use_extra_value=True):

    if use_extra_value:
        # one more positive value, this is an asymmetric type
        v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
        v2 = [0]*(256-15) ## we have 15 non-zero values in this data type
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()
    else:
        v1 = norm.ppf(torch.linspace(offset, 0.5, 8)[:-1]).tolist()
        v2 = [0]*(256-14) ## we have 14 non-zero values in this data type
        v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()

    v = v1 + v2 + v3

    values = torch.Tensor(v)
    values = values.sort().values
    values /= values.max()

    assert values.numel() == 256

    return values

The 3 key functions of this code are:

  • norm.ppf represents the Percent Point Function of the normal distribution, which is the inverse of the cumulative distribution function (CDF) - percentiles. The norm.ppf function takes a probability value between 0 and 1 and returns the corresponding value on the x-axis. For example, norm.ppf(0.5) will return approximately 0.0 because in the standard normal distribution, 50% of the values are less than 0. norm.ppf(0.9) will return approximately 1.28 because in the standard normal distribution, around 90% of the values are less than 1.28.
  • The function torch.linspace(offset, 0.5, n) generates a sequence of evenly spaced values starting from offset and ending at 0.5, with a total of n elements. This sequence is used as the input for the norm.ppf function.
  • The use_extra_value parameter determines the number of non-zero values in the mapping table. If it is True, the mapping table will have 15 non-zero values; otherwise, it will have 14 non-zero values.

The function create_normal_map follows the following steps:

  • Set offset = 0.9677083
  • Generate 8 probability values: torch.linspace(offset, 0.5, 9)[:-1] = tensor([0.9677, 0.9092, 0.8508, 0.7923, 0.7339, 0.6754, 0.6169, 0.5585])
  • Find their pre-images under the standard Gaussian CDF, resulting in v1 = [1.8481308221817017, 1.3361188173294067, 1.0397897958755493, 0.8144894242286682, 0.6245115995407104, 0.4548477232456207, 0.29742005467414856, 0.14707496762275696]
  • Generate 256 - 15 zeros, forming v2
  • Calculate 7 probability values: torch.linspace(offset, 0.5, 8)[:-1] = tensor([0.9677, 0.9009, 0.8341, 0.7673, 0.7004, 0.6336, 0.5668])
  • Find their pre-images under the Gaussian CDF, resulting in v3 = [-1.8481308221817017, -1.2866554260253906, -0.9704037308692932, -0.7298591732978821, -0.5256847143173218, -0.34148547053337097, -0.1682722121477127]
  • Normalize v1 + v2 + v3 to the range [-1, 1] to obtain the final values.

I drew a picture to illustrate the generation process of v1 and v3, in order to facilitate better understanding. Please refer to Figure 2 below:

Figure 2: The generation process of v1 and v3

From Figure 2, it can be seen that the offset=0.9677 ensures that the generated mapping table has equal coverage on both sides of the normal distribution (0.0323 = 1–0.9677), making the quantization process more uniform and balanced.

The paper provides the pre-calculated quantiles (with the padded zeros removed), as shown in Figure 3:

Figure 3: Final values of NF4 data type. Source: [1]

The standard Gaussian distribution is divided by these points, as shown in Figure 4:

Figure 4: Standard Gaussian distribution is divided by values of NF4 data type

In summary, the purpose of this function is to create 16 values of the NF4 data type, filled with zeros to be used in an 8-bit quantization function (including 256–16 zeros). The bitsandbytes library uses an 8-bit quantization method to “emulate” NF4.

Note that the reason the values are computed in two asymmetric groups(v1 and v3) is: a problem for a symmetric k-bit quantization is that this approach does not have an exact representation of zero, which is an important property to quantize padding and other zero-valued elements with no error.

Points of NF4 are truncated according to the quantiles of the Gaussian distribution, with sparse points at both ends and dense points in the middle. The distribution of NF4 points is consistent with the data distribution. Compared to naive uniform partitioning, the efficiency of grid point allocation is greatly increased, while the loss in accuracy is not significant.

After obtaining the quantiles of NF4, for each value w to be quantized, we can obtain the quantized value Q by mapping each w to the nearest qj after downscaling by the quantization constant c.

Double Quantization

Block-wise Quantization

We know that the essence of quantization is to map values from a larger range to a smaller range. We can use a constant c to proportionally reduce the values. In this way, we can easily use the same constant c to dequantize the quantized values back to their original (approximate) form.

However, if our data contains outliers, this will affect the selection of c and cause other values to collapse within a small range. Block-wise provides a solution to this by quantizing one block at a time, with each block using its own independent quantization constant c.

Since quantization constants are typically stored as FP32, the memory usage can become significant when there are a large number of blocks.

The approach of QLoRA

QLoRA divides the parameters into blocks of size 64. Each block calculates a quantization constant, denoted as c. QLoRA further quantizes the quantization constants into FP8 using Double Quant, with a block size of 256. This further reduces the memory consumption.

  • Before Double Quant, quantizing each parameter requires an additional 32/64 =0.5 bits of memory.
  • After Double Quant, quantizing each parameter only requires an additional 8/64 + 32 / (64*256) =0.127 bits of memory.

Page Optimizer

The Page Optimizer mechanism allows for transferring the optimizer to memory when GPU memory is limited. It can be loaded back when the optimizer state needs to be updated. It is said to effectively reduce the peak occupancy of GPU memory. The paper of QLoRA states that this mechanism is necessary to train a model with 33 billion parameters on a 24GB GPU.

This mechanism can be easily configured by setting the parameters of TrainingArguments: optim = ‘paged_adamw_32bit’[4].

LoRA

The mechanism of LoRA[2] is not the focus of this article, so I won’t go into much detail here. In simple terms, micro-adjustment actually obtains a ΔW based on the original base parameters. The shape and size of ΔW are the same as the original parameters.

However, if we can decompose it into AB=ΔW, the number of parameters in matrix A and matrix B is much smaller than W. The purpose of LoRA is to train matrices A and B instead of the entire W while freezing the original model. This greatly reduces the required computational cost.

Conclusion

This article mainly introduces the key points of QLoRA, focusing on the ideas and quantization process of NF4.

In fact, more and more open-source projects are starting to use QLoRA for fine-tuning large models.

Moreover, with frameworks like transformers and bitsandbytes, it is convenient to set up fine-tuning in the QLoRA manner.

Finally, if there are any omissions or errors in this article, please kindly provide feedback. Thank you.

References

[1]: T. Dettmers, A. Pagnoni, A. Holtzman and L. Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314. 2023.

[2]: E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. 2021.

[3]: T. Dettmers. bitsandbytes: create_normal_map.

[4]: T. Dettmers, A. Pagnoni, A. Holtzman and L. Zettlemoyer. QLoRA Github Project.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Large Language Models
Quantization
Coding
Machine Learning
NLP
Recommended from ReadMedium