A Step-by-Step Guide to Runing Mistral-7b AI on a Single GPU with Google Colab
How to run your AI efficiently through 4-bit Quantization (with Colab notebook included!) .
The world of Large Language Models (LLM) is fast evolving, with continuous emergence of breakthrough models like LLAMA2 and Falcon.
On the 27th of last September, the French startup Mistral set the tech world abuzz when they unveiled their first model — the Mistral 7B, claiming the title of the most powerful language model for its size to date. Also, Mistral AI represents a promising opportunity for Europe to carve out its own path in the rapidly evolving field of AI.
The Objective of This Article
The objective of the article is to show you how to effciently load and run Mistral 7B AI on Google Colab using just a single GPU. The magic ingredient for success? Quantization in 4-bit precision and QLoRA, which will be explained later.
You can also find the Google Colab notebook used in our article here, so you can explore and experiment with Mistral AI on your own!
What is Mistral 7B
Mistral-7B-v0.1 is a small, yet powerful Large Language model adaptable to many use-cases. It can perform various natural language processing tasks and has 8k sequence length. For instance, it is optimal for text summarisation, classification, text completion, code completion .
Here’s the low-down on Mistral 7B:
- Performance Beyond Compare: Mistral 7B outperforms Llama 2 13B, on all benchmarks.
- More efficient: Thanks to Grouped-query attention (GQA) and Sliding Window Attention (SWA), Mistral 7B delivers faster inference and handles longer sequences with ease.
- Open for All: Released under the Apache 2.0 license, Mistral 7B can be used without restrictions.
What is quantization and QLoRA?
Mistral 7B might be smaller compared to its peers, it remains a huge challenge to get it run and train in consumer hardware.
In order to run it on a single GPU, we need to run the model in 4-bit precision and use QLoRA to reduce the memory usage.
The QLoRA solution
QLoRA, which stands for Quantized LLMs with Low-Rank Adapters, is an efficient finetuning approach. It uses 4-bit quantization to compress a pretrained language model, without performance tradeoffs compared to standard 16-bit model finetuning.
The abstract of the QLoRA paper:
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. […]
QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and paged optimizers to manage memory spikes. […]
Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA.
Step-by-Step walkthrough
Let’s get started! You can also directly open the Google Colab notebook here, where all the instructions have already been prepared for you to explore!
Step 1 — Install necessary packages
QLoRA uses bitsandbytes for quantization and is integrated with Hugging Face’s PEFT and transformers libraries.
Since we want to make sure we’re using the latest features, we’ll install these libraries from source.
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
Step 2 — Define quantization parameters through the BitsandBytesConfig from transformers
Now we’ll configure the QLoRA settings using BitsandBytesConfig from the transformers library.
Here a quick explaination on the arguments that can be tweaked and used:
- load_in_4bit=True: specify that we want to convert and load the model in 4-bit precision.
- bnb_4bit_use_double_quant=True: Use nested quantization for more memory efficient inference and training.
- bnd_4bit_quant_type=”nf4": The 4bit integration comes with 2 different quantization types FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper. By default, the FP4 quantization is used.
- bnb_4bit_compute_dype=torch.bfloat16: The compute dtype is used to change the dtype that will be used during computation. By default, the compute dtype is set to
float32
but computation can be set tobf16
for speedups.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
Step 3 — Load the Mistral 7B with quantization
Now we specify the model ID and then we load it with our previously defined quantization configuration.
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
Step 4 — Once loaded, run a generation and give it a try!
Finally we‘re ready to bring Mistral 7B into action.
- Let’s start by testing its text generation abilities. You can use the following template:
PROMPT= """ ### Instruction: Act as a data science expert.
### Question:
Explain to me what is Large Language Model. Assume that I am a 5-year-old child.
### Answer:
"""
encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
The model has followed our instructions and explained the concept of a Large Language Model quite well!
2. Now let’s test Mistral 7B coding skills.
messages = [
{"role": "user", "content": "write a python function to generate a list of random 1000 numbers between 1 and 10000?"}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
Seems the model nails the code perfectly!
Wrap-up
A quick recap
In summary, we have seen that Mistral AI is a very interesting alternative to the popular models like LLaMA and Falcon.
It’s free, smaller yet more efficient. It allows full customization and can be finetuned easily with compelling performance.
Throughout this guide, we’ve walked you through on how to make Mistral AI run on a single GPU using QLoRA approach. If you want to go further, I highly recommend delving into the Huggingface article “Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA”.
What’s Next?
There are more practical guides upcoming on how to tailor Mistral AI for your specific use cases, including:
- How to Fine-tune Mistral 7B with Your Own Data: Dive into the art of customizing Mistral 7B with your unique datasets.
- How to Fine-tune Mistral 7B Using Only YAML Configuration Files: We’re going to simplify the fine-tuning process by demonstrating how to use YAML configuration files in the process.
Follow me and Stay tuned!
Before you go! 🦸🏻♀️
If you liked my story and you want to support me:
- Clap my article 50 times, that will really really help me out.👏
- Follow me on Medium and subscribe to get my latest article🫶