Summary

The website content provides a detailed guide on how to successfully load and run large language models (LLMs) in 4-bit quantization mode on a GPU with limited VRAM, specifically focusing on the use of Huggingface Transformers and bitsandbytes.

Abstract

The author of the content describes a personal journey of overcoming challenges encountered while attempting to run a 4-bit quantized LLM using Huggingface Transformers. Initially faced with numerous errors and compatibility issues, the author persisted and eventually found a solution. The guide outlines the necessary steps to set up a Python virtual environment with specific package versions, load the LLM in 4-bit mode, initialize parameters and functions, and execute the model on a GPU. The solution enables the use of LLMs, including a 70B model, on consumer-grade GPUs without relying on cloud services or platforms like Google Colab. The author emphasizes the practicality of running such models on local machines, providing complete code examples and references to resources that aided in the process.

Opinions

The author initially struggled with running 4-bit LLMs due to various technical issues, indicating a lack of straightforward solutions in the existing documentation or tools.
The LLaVA model's instructions inadvertently helped the author resolve previous issues with running 4-bit LLMs, suggesting the value of diverse model documentation and community contributions.
The author expresses satisfaction with the performance of Exllama, noting its speed and goodness, which may imply a recommendation for similar use cases.
The author's success in loading and running any LLM in 4-bit without issues reflects a positive outcome from persistence and minor adjustments in the setup.
By sharing the experience and solution, the author conveys a sense of community contribution and the belief that others can benefit from this guide to run LLMs on their own hardware.

Load up and Run any 4-bit LLM models using Huggingface Transformers

Solve the 4-bit LLM setup problems all at one time

Transformer, image by Andrew Zhu using SDXL

I was trying to run open-source LLM using the Huggingface Transformers model in 4-bit quantization mode. Why 4-bit quantization? because I can only run a 4-bit 30B LLM model in a GPU with 24G VRAM.

Read several articles such as Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA. In theory, it should work, in reality. Tons of errors and exceptions, packages not working compatible with each other; incorrect Transformers package versions, etc. I almost gave up, and planed to just stay with Exllama [2] to run a GPTQ 4-bit quant model. By the way, Exllama is good and fast.

Until the day I tried out the LLaVA [3] model, following its instructions, I successfully loaded up and ran the LLaVA model in my GPU. Out of unconscious intention, in the LLaVA python virtual environment, I tried again the 4-bit LLM with Transformers following the article [1] from Huggingface, magically, a completely new exception was reported, and I know something must be fixed, which I can’t fix previous.

Soon, with some minor changes, I successfully ran up any LLM model in 4-bit without issues. Here are the steps to make it work.

A working solution to load up any LLM in 4-bit quant

Step 1. Setup Python Venv and install the right packages

torch==2.0.1
torchvision==0.15.2
transformers==4.35.0
tokenizers>=0.14,<0.15
sentencepiece==0.1.99
shortuuid
accelerate==0.21.0
peft==0.4.0
bitsandbytes==0.41.0
pydantic<2,>=1
markdown2[all]
numpy
scikit-learn==1.2.2
gradio==3.35.2
gradio_client==0.2.9
requests
httpx==0.24.0
uvicorn
fastapi
einops==0.6.1
einops-exts==0.0.4
timm==0.6.13
ipywidgets
diffusers
ipykernel
protobuf==3.20.1

Step 2. Load up LLM model in 4-bit mode

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer, BitsAndBytesConfig
from torch import cuda, bfloat16
import transformers
from transformers import AutoTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList

model_name = "/path/to/ai_models/zephyr-7b-beta"

m = AutoModelForCausalLM.from_pretrained(
    model_name
    , trust_remote_code = True
    , quantization_config = BitsAndBytesConfig(
        load_in_4bit                = True,
        bnb_4bit_compute_dtype      = torch.bfloat16,
        bnb_4bit_use_double_quant   = True,
        bnb_4bit_quant_type         = 'nf4'
    )
    , torch_dtype   = torch.bfloat16
    , device_map    = "auto"#{"": 0}
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name
)
tokenizer.bos_token_id = 1
print(f"Successfully loaded the model {model_name} into memory")

Step 3. Initialise parameters and functions

stop_token_ids = tokenizer.convert_tokens_to_ids(["<|endoftext|>"])

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in stop_token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

import transformers
model = m
device = "cuda:0"
pipe = transformers.pipeline(
    model               = model, 
    tokenizer           = tokenizer,
    return_full_text    = True,  # langchain expects the full text
    task                = 'text-generation',
    #device=device,
    device_map          = "auto",
    # we pass model parameters here too
    stopping_criteria   = stopping_criteria,    # without this model will ramble
    temperature         = 0.15,                 # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p               = 0.15,                 # select from top tokens whose probability add up to 15%
    top_k               = 0,                    # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens      = 768*4,                # max number of tokens to generate in the output
    repetition_penalty  = 1.1                   # without this output begins repeating
)

def gen_text(system_prompt:str = None ,input_text:str = "hello"):
    if system_prompt is None:
        system_prompt = "You are a friendly chatbot who always responds in the style of a pirate"
    messages = [
        {
            "role": "system",
            "content":system_prompt
        },
        {"role": "user", "content": input_text},
    ]
    prompt = pipe.tokenizer.apply_chat_template(
        messages
        , tokenize              = False
        , add_generation_prompt = True
    )
    outputs = pipe(
        prompt
        , max_new_tokens=1024
        , do_sample=True
        , temperature=0.2
        , top_k=50
        , top_p=0.95
    )
    return outputs[0]["generated_text"]

Step 4. Run it up

system = "You are smart, you can solve math arithmetic problems, you can doing reasoning and logical inference, the answer is critical to me"
input = '''
Jane is faster than joe, Joe is faster than Sam. Is Sam faster than Jane?
'''
r = gen_text(system_prompt=system, input_text=input)
r

That is it, hope you can see the correct result from your LLM model running on your own GPU.

Run a 70B LLM in your Machine

If you have two GPUs, each with 24G VRAM, you can run a 70B LLM(the largest LLaMA model) in your own machine, no GPU Cloud, no Google Colab, and use it whenever you need it.

Here is how to do it, in the model loading code, change device_map = “auto" will enable you to load the model evenly distributed to the available CUDAs, say, you have two RTX 3090, each GPU will load around 18G parameters of a 4-bit 70B model. If you want it to run in CUDA:0, just change it back to device_map = “{"":0}".

Here is the complete code for your convenience:

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer, BitsAndBytesConfig
from torch import cuda, bfloat16
import transformers
from transformers import AutoTokenizer
from transformers import StoppingCriteria, StoppingCriteriaList

model_name = "/path/to/ai_models/zephyr-7b-beta"

m = AutoModelForCausalLM.from_pretrained(
    model_name
    , trust_remote_code = True
    , quantization_config = BitsAndBytesConfig(
        load_in_4bit                = True,
        bnb_4bit_compute_dtype      = torch.bfloat16,
        bnb_4bit_use_double_quant   = True,
        bnb_4bit_quant_type         = 'nf4'
    )
    , torch_dtype   = torch.bfloat16
    , device_map    = "auto"#{"": 0}
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name
)
tokenizer.bos_token_id = 1
print(f"Successfully loaded the model {model_name} into memory")

References

[1] Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA, https://huggingface.co/blog/4bit-transformers-bitsandbytes

[2] Exllama: https://github.com/turboderp/exllama

[3] LLaVA: https://github.com/haotian-liu/LLaVA