Introduction to AI Model Quantization Formats

Demystifying Model Quantization: Enhancing AI Deployment and Efficiency

When downloading models from HuggingFace, you might often notice terms like fp16, GPTQ, or GGML in the model names. For those unfamiliar with model quantization, these terms might seem puzzling. I was also confused at first, but after some research, I gained a better understanding. This article will introduce some common model quantization formats. As I am not a machine learning expert, this article will only provide a basic overview of these formats. Corrections are welcome if there are any inaccuracies.

What is Quantization?

In AI models, particularly deep learning models, quantization typically refers to converting model parameters (such as weights and biases) from floating-point numbers to lower bit-width integers, like converting from 32-bit floating numbers to 8-bit integers. In layman’s terms, quantization is akin to simplifying a detailed book written in advanced vocabulary into a brief summary or a children’s story. This summary or children’s story takes up less space and is easier to disseminate, but it might lose some details from the original book.

Why Quantize?

The primary purposes of quantization are as follows:

Reducing Storage Requirements: Quantized models significantly reduce in size, making them easier to deploy on devices with limited storage resources, such as mobile or embedded systems.
Speeding Up Computation: Integer operations are generally faster than floating-point operations, especially on devices without dedicated floating-point hardware support.
Reducing Energy Consumption: On some hardware, integer operations consume less energy.

However, quantization has a downside: it might lead to a decrease in model accuracy. By representing original floating numbers with lower precision, some information may be lost, meaning the model’s capability could degrade. To balance this accuracy loss, researchers have developed various quantization strategies and techniques, such as dynamic quantization and weight sharing, to minimize the loss in model capability while reducing the model’s requirements as much as possible. For instance, if a model’s full capability is 100, and its size and memory requirements for inference are also 100, after quantization, the model’s capability might decrease to 90, but its size and memory requirements might drop to 50. This is the purpose of quantization.

FP16/INT8/INT4

On HuggingFace, if a model name doesn’t have a specific label, like Llama-2-7b-chat or chatglm2-6b, it generally indicates that these models are full-precision (FP32, though some are half-precision FP16). However, if a model name includes terms like fp16, int8, int4, such as Llama-2-7B-fp16, chatglm-6b-int8, or chatglm2-6b-int4, which means these models are quantized. Here, fp16, int8, and int4 indicate the quantization precision of the model.

The order of quantization precision from high to low is: fp16>int8>int4. The lower the quantization precision, the smaller the model size and the memory required for inference, but the weaker the model's capability.

Take ChatGLM2-6B as an example. The full-precision (FP32) version of this model is 12G in size, and the memory required for inference is about 12-13G. However, the quantized INT4 version is only 3.7G in size, with memory requirements for inference dropping to 5G. As can be seen, quantization significantly reduces both the model size and memory requirements.

FP32 and FP16 precision models need to run on GPU servers, whereas INT8 and INT4 precision models can run on CPUs.

GPTQ

GPTQ is a method of model quantization that can quantize language models to INT8, INT4, INT3, or even INT2 precision without significant performance loss. If you see model names withGPTQ tags such as Llama-2-13B-chat-GPTQon HuggingFace, which means these models have been quantized using GPTQ. For instance, the full-precision version Llama-2-13B-chat is 26G in size, but when quantized to INT4 precision using GPTQ, the size is reduced to 7.26 G.

If you are using the open-source model LLama, you can use the [GPTQ-for-LLaMA](https://github.com/qwopqw

op200/GPTQ-for-LLaMa) library for GPTQ quantization, which can quantize the related Llama models to INT4 precision.

However, a more popular GPTQ quantization tool now is AutoGPTQ, which can quantize any Transformer model, not just Llama. Huggingface has already integrated AutoGPTQ into Transformers, and you can refer to this guide for its usage.

GGML

Before discussing GGML, it’s important to mention the llama-cpp project. Developed by Georgi Gerganov, it’s a pure C/C++ version of the Llama model, with the primary advantage of fast inference on CPUs without needing GPUs. The author then extracted the model quantization part of this project to create a model quantization tool: GGML, with the GG in the project name representing the initials of the author's name.

On HuggingFace, if you see model names with GGML tags such asLlama-2-13B-chat-GGML, which means these models have been quantized using GGML. Some GGML models also include terms like q4, q4_0, q5, etc., such as Chinese-Llama-2-7b-ggml-q4. The q4 here refers to the GGML quantization method, extending from q4_0 onwards to q4_0, q4_1, q5_0, q5_1, and q8_0. You can see the data of various methods after quantization here.

GGUF

Recently, some models on HuggingFace have been spotted with GGUF tags, like Llama-2-13B-chat-GGUF. GGUF is a new feature added by the GGML team. Compared to GGML, GGUF can add additional information to the model, which was not possible with the original GGML models. GGUF is also designed to be extensible, allowing new features to be added to the model in the future without breaking compatibility with older models.

However, this feature is a Breaking Change, meaning that models quantized with newer versions of GGML will be in the GGUF format. This implies that the old GGML format will gradually be replaced by GGUF, and the old GGML format cannot be directly converted to GGUF.

For more information on GGUF, refer to this discussion.

GPTQ vs GGML

GPTQ and GGML are currently the two main methods of model quantization, but what are the differences between them? Which one should you choose?

Here are some key points of comparison:

GPTQ performs faster on GPUs, whereas GGML is faster on CPUs.
For quantized models of the same precision, GGML models are slightly larger than GPTQ models, but their inference performance is roughly equivalent.
Both can quantize Transformer models on HuggingFace.

Therefore, if your model runs on a GPU, GPTQ is recommended for quantization. If your model runs on a CPU, GGML is recommended.

Groupsize

On HuggingFace, regardless of the quantization format, model names often include terms like 32g, 128g, for example, pygmalion-13b-4bit-128g. What do these terms mean?

The g in 128g stands for groupsize. In quantization technology, weights might be divided into groups of a certain size (groupsize), with specific quantization strategies applied to each group. This approach can help improve quantization effects or maintain model performance.

The values for groupsize are 1024, 128, and 32, with GPTQ’s default groupsize being 1024. If no value is assigned to groupsize, it defaults to -1 (note that this is not 0). Groupsize impacts the model's accuracy and the memory size required for inference. The order of accuracy and memory consumption for models of the same precision, from highest to lowest, is: 32 > 128 > 1024 > None(-1). This means that None(-1) has the lowest accuracy and memory consumption, while 32 has the highest.

Conclusion

This article summarizes the common quantization formats of models on HuggingFace. Quantization technology is a crucial aspect of AI model deployment, significantly reducing the size of the model and the memory required for inference. To truly integrate large language models into the lives of ordinary people, enabling them to run on everyone’s smartphones and achieve real ‘ubiquity,’ mastering some quantization techniques is essential.

Follow me to learn about various artificial intelligence and AIGC new technologies. You can exchange ideas and leave comments or questions in the comment section.