Running Mixtral 8x7b on M1 16GB

Summary

The web content describes how to run the Mixtral 8x7B AI model, a Mixture of Experts (MoE) system, on a 16GB M1 Pro laptop using quantization methods and specific software configurations for optimal performance.

Abstract

The article discusses the innovative Mixtral 8x7B AI model by Mistral AI, which offers capabilities comparable to GPT3.5 and strong multilingual support. Despite its large 47B parameter size, the model can be efficiently run on a 16GB M1 Pro laptop by employing a Mixture of Experts approach, which activates only a subset of the model's parameters during inference, reducing the active parameters to 13B. The article provides a step-by-step guide to set up and run the model using llama.cpp and a new 2bit quantization method called QuIP, which significantly compresses the model without a substantial loss in quality. Instructions include cloning the necessary repository, compiling the code, downloading the quantized model, and executing it with specific command-line arguments to optimize performance on the M1 Pro's CPU and Metal GPU. The article concludes with a demonstration of the model's functionality and a note on its slow generation speed, while also promoting an AI service that offers similar capabilities to ChatGPT Plus at a lower cost.

Opinions

The author expresses admiration for the Mixtral 8x7B model's ability to run on a 16GB M1 Pro laptop, considering its size and complexity.
There is an emphasis on the efficiency and practicality of using a Mixture of Experts model, which allows for access to a large number of parameters while maintaining a lower active parameter count during inference.
The author seems impressed by the QuIP quantization method, highlighting its state-of-the-art status and its ability to compress the model significantly with minimal quality loss.
The author provides a subjective assessment of the model's performance, describing the generation speed as slow but still impressive given the hardware constraints.
A value proposition is made for an AI service alternative to ChatGPT Plus, suggesting it as a cost-effective option for similar performance.

Running Mixtral 8x7b on M1 16GB

Mistral AI has revolutionized the landscape of artificial intelligence with its Mixtral 8x7b model. Comparable to GPT3.5 in terms of answer quality, this model also boasts robust support for languages like French or German. It’s especially impressive that now we can run a substantial 47B model on a 16GB M1 Pro laptop.

Essentially, Mixtral 8x7B is a Mixture of Experts (MoE) model. It utilizes an array of smaller, rapid 7B models in place of a singular large model, ensuring both speed and efficiency in processing. Mixtral’s router network selectively engages two experts per token at each layer, allowing access to 47B parameters while actively utilizing 13B during inference. This approach, combined with a 32k token context size, optimizes performance and efficiency.

What about running on only 16GB ? Obviously, we will use llama.cpp but with a caveat :) We can’t just run it like this, we need to quantise it (compress the parameter accuracy). Here is where one solution is shining — QuIP. This is a new SOTA method for a 2bit!!! quantisation, which allows such crazy compression with a relatively small loss in quality.

So, let’s start!

Install llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Now download the model itself here. You can find other models like LLAMA-2–70b or Mistral-7b quantised with this method.

Put it in the ./llama.cpp/models folder and you are ready to run:

./main -m ./models/mixtral-instruct-8x7b-2.10bpw.gguf -ngl 0 -t 6

So -ngl 0 sets CPU-only usage. BUT, if you have >20GB RAM feel free to set it to 1 to use Metal GPU and enjoy the huge speedup 🚀

-t 6 sets number of threads. Typically you want to set the number to count of P-cores in your CPU (it’s 6 for my M1 Pro).

And Voilà, it’s alive!

Question: Let q(z) = z**2 - 3*z + 4. Calculate the remainder when 17 is divided by q(3).
Answer: 1[end of text]

llama_print_timings:        load time =    2529.21 ms
llama_print_timings:      sample time =       5.09 ms /    42 runs   (    0.12 ms per token,  8249.85 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   48960.64 ms /    42 runs   ( 1165.73 ms per token,     0.86 tokens per second)
llama_print_timings:       total time =   48978.25 ms
Log end

Note that it takes slightly less than a minute to generate such a short piece of text, which is … slow 🐢 . Nevertheless, still impressive that we can run it anyway!