Running Mixtral 8x7b on M1 16GB
Mistral AI has revolutionized the landscape of artificial intelligence with its Mixtral 8x7b model. Comparable to GPT3.5 in terms of answer quality, this model also boasts robust support for languages like French or German. It’s especially impressive that now we can run a substantial 47B model on a 16GB M1 Pro laptop.

Essentially, Mixtral 8x7B is a Mixture of Experts (MoE) model. It utilizes an array of smaller, rapid 7B models in place of a singular large model, ensuring both speed and efficiency in processing. Mixtral’s router network selectively engages two experts per token at each layer, allowing access to 47B parameters while actively utilizing 13B during inference. This approach, combined with a 32k token context size, optimizes performance and efficiency.
What about running on only 16GB ? Obviously, we will use llama.cpp but with a caveat :) We can’t just run it like this, we need to quantise it (compress the parameter accuracy). Here is where one solution is shining — QuIP. This is a new SOTA method for a 2bit!!! quantisation, which allows such crazy compression with a relatively small loss in quality.
So, let’s start!
Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
makeNow download the model itself here. You can find other models like LLAMA-2–70b or Mistral-7b quantised with this method.
Put it in the ./llama.cpp/models folder and you are ready to run:
./main -m ./models/mixtral-instruct-8x7b-2.10bpw.gguf -ngl 0 -t 6
So -ngl 0 sets CPU-only usage. BUT, if you have >20GB RAM feel free to set it to 1 to use Metal GPU and enjoy the huge speedup 🚀
-t 6 sets number of threads. Typically you want to set the number to count of P-cores in your CPU (it’s 6 for my M1 Pro).
And Voilà, it’s alive!
Question: Let q(z) = z**2 - 3*z + 4. Calculate the remainder when 17 is divided by q(3).
Answer: 1[end of text]
llama_print_timings: load time = 2529.21 ms
llama_print_timings: sample time = 5.09 ms / 42 runs ( 0.12 ms per token, 8249.85 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 48960.64 ms / 42 runs ( 1165.73 ms per token, 0.86 tokens per second)
llama_print_timings: total time = 48978.25 ms
Log endNote that it takes slightly less than a minute to generate such a short piece of text, which is … slow 🐢 . Nevertheless, still impressive that we can run it anyway!





