Quantize LLMs with GPTQ Using Hugging Face Transformers
GPTQ is now much easier to use

Many large language models (LLMs) on the Hugging Face Hub are quantized with AutoGPTQ, an efficient and easy-to-use implementation of GPTQ.
GPTQ quantization has several advantages over other quantization methods such as bitsandbytes nf4. For instance, GPTQ models are serializable and faster for inference. You will find a detailed comparison between GPTQ and bitsandbytes quantizations in my previous article:
LLMs quantized with AutoGPTQ are fast and efficient, but there was one obstacle to their massive adoption: They weren’t natively supported by Hugging Face libraries.
This is not the case anymore. Last week, Hugging Face announced that Transformers and TRL now natively support AutoGPTQ.
With Transformers and TRL, you can:
- Quantize an LLM with GPTQ with a 4-bit, 3-bit, or 2-bit precision
- Load a GTPQ LLM from your computer or the HF hub
- Serialize a GPTQ LLM
- Fine-tune a GPTQ LLM
In this article, I show you how to quantize an LLM with Transformers. If you are interested in fine-tuning an LLM quantized with GPTQ, I did it here:
The notebook to reproduce my experiments is available here (notebook #12):
4, 3, and 2-bit quantizations with Transformers GPTQ
GTPQ requires a lot of GPU VRAM. Quantizing Llama 2 7B isn’t possible on consumer hardware. In my experiments, the VRAM consumption peaked at 33 GB while it only used 6 GB of CPU RAM. I had to use the A100 of Google Colab PRO which has 40 GB of VRAM.
Quantization with GPTQ is also slow. It took 35 min with one A100, which cost approximately $0.75. The quantization speed and VRAM/RAM consumption are the same for the 4-bit, 3-bit, and 2-bit precisions. Note: I quantized with 3-bit and 2-bit precisions just out of curiosity. As we will see below, the models quantized with this precision are almost useless.
While you can’t quantize Llama 2 with GPTQ on consumer hardware, that’s still quite cheap with cloud computing. Smaller models (<4B parameters) can be quantized with consumer hardware (less than 24GB of VRAM, e.g., with an RTX 3090).
To quantize with GPTQ, I installed the following libraries:
pip install transformers optimum accelerate auto-gptq
The quantization and serialization with Transformers is done as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfigmodel_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)
model.push_to_hub("kaitchup/Llama-2-7b-gptq-4bit")
tokenizer.push_to_hub("kaitchup/Llama-2-7b-gptq-4bit")Note: Of course, you will need to change the repository specified in “push_to_hub“. Alternatively, you may call model.save_pretrained to save your model locally.
The most important line is the one calling GPTQConfig:
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)- bits: Precision of the quantization. You can set 4, 3, and 2.
- dataset: The dataset used for calibration. I would leave “c4“ which seems to yield reasonable results. Other datasets are supported according to the documentation.
- tokenizer: The tokenizer of Llama 2 7B that will be applied to c4.
There are two other important options that I left to default:
- desc_act
Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. Also known as act-order.
- disable_exllama
Whether to use exllama backend. Only works with
bits= 4.
If inference speed is not your concern, you should set desc_act to True.
disable_exllama is confusing. True means that the support of exllama is set to False. By default, exllama is used. If you plan to use the model on a configuration with a small VRAM that will split the model to multiple devices with device_map, you should set disable_exllama to True.
I uploaded Llama 2 7B models quantized with GPTQ to the HF Hub:
- 4-bit version (3.9 GB)
- 3-bit version (3.1 GB)
- 2-bit version (2.3 GB)
This is an extract of an article that was first published in The Kaitchup, my newsletter. You can read the full article here:





