avatarBenjamin Marie

Summary

The website content discusses the integration of AutoGPTQ for quantizing large language models (LLMs) within the Hugging Face Transformers and TRL libraries, facilitating efficient and cost-effective model inference and fine-tuning.

Abstract

The article highlights the recent advancements in quantizing large language models (LLMs) using GPTQ, now simplified through Hugging Face's Transformers and TRL libraries. AutoGPTQ, an efficient implementation of the GPTQ algorithm, allows for quantization with 4-bit, 3-bit, or 2-bit precision, resulting in faster inference times and serializable models. The process enables users to load, serialize, quantize, and fine-tune LLMs such as Llama 2. Despite the high VRAM requirements for quantization, which may necessitate the use of cloud computing resources like Google Colab PRO's A100 GPUs, smaller models can be quantized on consumer hardware. The article also provides code snippets for quantization and serialization using the Transformers library, and it points to additional resources for those interested in further exploration, including quantized models available on the Hugging Face Hub.

Opinions

  • The author suggests that GPTQ quantization is superior to other methods like bitsandbytes nf4 due to its serializability and faster inference.
  • The author notes that while quantization with GPTQ is resource-intensive, it is still cost-effective, especially for smaller models that can be processed on consumer hardware.
  • The author recommends using the default "c4" dataset for calibration, indicating it yields reasonable results.
  • The author advises setting the desc_act option to True if inference speed is not a priority, to maintain better perplexity.
  • The author finds the disable_exllama option confusing and suggests it should be set to True when planning to use the model on configurations with limited VRAM that require splitting the model across multiple devices.
  • The author has uploaded quantized versions of Llama 2 7B models to the Hugging Face Hub for public use.
  • The author encourages readers to consult their newsletter, The Kaitchup, for the full article and additional context on quantization and fine-tuning LLMs with GPTQ.

Quantize LLMs with GPTQ Using Hugging Face Transformers

GPTQ is now much easier to use

Many large language models (LLMs) on the Hugging Face Hub are quantized with AutoGPTQ, an efficient and easy-to-use implementation of GPTQ.

GPTQ quantization has several advantages over other quantization methods such as bitsandbytes nf4. For instance, GPTQ models are serializable and faster for inference. You will find a detailed comparison between GPTQ and bitsandbytes quantizations in my previous article:

LLMs quantized with AutoGPTQ are fast and efficient, but there was one obstacle to their massive adoption: They weren’t natively supported by Hugging Face libraries.

This is not the case anymore. Last week, Hugging Face announced that Transformers and TRL now natively support AutoGPTQ.

With Transformers and TRL, you can:

  • Quantize an LLM with GPTQ with a 4-bit, 3-bit, or 2-bit precision
  • Load a GTPQ LLM from your computer or the HF hub
  • Serialize a GPTQ LLM
  • Fine-tune a GPTQ LLM

In this article, I show you how to quantize an LLM with Transformers. If you are interested in fine-tuning an LLM quantized with GPTQ, I did it here:

The notebook to reproduce my experiments is available here (notebook #12):

4, 3, and 2-bit quantizations with Transformers GPTQ

GTPQ requires a lot of GPU VRAM. Quantizing Llama 2 7B isn’t possible on consumer hardware. In my experiments, the VRAM consumption peaked at 33 GB while it only used 6 GB of CPU RAM. I had to use the A100 of Google Colab PRO which has 40 GB of VRAM.

Quantization with GPTQ is also slow. It took 35 min with one A100, which cost approximately $0.75. The quantization speed and VRAM/RAM consumption are the same for the 4-bit, 3-bit, and 2-bit precisions. Note: I quantized with 3-bit and 2-bit precisions just out of curiosity. As we will see below, the models quantized with this precision are almost useless.

While you can’t quantize Llama 2 with GPTQ on consumer hardware, that’s still quite cheap with cloud computing. Smaller models (<4B parameters) can be quantized with consumer hardware (less than 24GB of VRAM, e.g., with an RTX 3090).

To quantize with GPTQ, I installed the following libraries:

pip install transformers optimum accelerate auto-gptq

The quantization and serialization with Transformers is done as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)
model.push_to_hub("kaitchup/Llama-2-7b-gptq-4bit")
tokenizer.push_to_hub("kaitchup/Llama-2-7b-gptq-4bit")

Note: Of course, you will need to change the repository specified in “push_to_hub“. Alternatively, you may call model.save_pretrained to save your model locally.

The most important line is the one calling GPTQConfig:

quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
  • bits: Precision of the quantization. You can set 4, 3, and 2.
  • dataset: The dataset used for calibration. I would leave “c4“ which seems to yield reasonable results. Other datasets are supported according to the documentation.
  • tokenizer: The tokenizer of Llama 2 7B that will be applied to c4.

There are two other important options that I left to default:

  • desc_act

Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. Also known as act-order.

  • disable_exllama

Whether to use exllama backend. Only works with bits = 4.

If inference speed is not your concern, you should set desc_act to True.

disable_exllama is confusing. True means that the support of exllama is set to False. By default, exllama is used. If you plan to use the model on a configuration with a small VRAM that will split the model to multiple devices with device_map, you should set disable_exllama to True.

I uploaded Llama 2 7B models quantized with GPTQ to the HF Hub:

This is an extract of an article that was first published in The Kaitchup, my newsletter. You can read the full article here:

Technology
Machine Learning
Artificial Intelligence
Data Science
Programming
Recommended from ReadMedium