Summary

Elon Musk has open-sourced the Grok-1 large model, a 314 billion parameter AI, which is now available for commercial use under the Apache 2.0 license, potentially revolutionizing AI applications despite its demanding computational requirements.

Abstract

Musk has made good on his promise to release the Grok-1 large model as open-source software. This model, with its staggering 314 billion parameters, stands out in the open-source community for its size and potential. Although it has general linguistic capabilities, further fine-tuning is necessary for specific applications. The model's architecture includes unique features such as a specialized attention mechanism with a scaling factor, a GELU-like activation function, and four layers of Layernorm with dimensionality reduction. It also utilizes RoPE with float32 type and has a vocabulary size of 131072. The xAI team, responsible for Grok-1, has already achieved notable success, outperforming many models on the MMLU benchmark. However, the computational resources required for pre-training and fine-tuning Grok-1 are substantial, necessitating multiple high-end GPUs.

Opinions

The author expresses excitement about the Grok-1 model's potential, noting its impressive performance and unique design features.
There is an acknowledgment that while the model is powerful, its size makes it challenging for average users to utilize without significant computational resources.
The author suggests that releasing versions of Grok-1 with fewer parameters could increase its accessibility and adoption within the developer community.
The author is optimistic about the future impact of Grok-1, anticipating that it will lead to new applications and innovations in the AI field.
There is an appreciation for the classic optimization techniques used in Grok-1, such as RMS norm and RoPE, which are common in large language models.

Musk Makes Good on Promise

99% of People Can’t Use It: The Grok-1 Large Model is Now Open Source

How Many GPUs Are Needed to Pre-train Grok-1?

Recently, Musk has fulfilled his promise by open-sourcing both the weights and architecture of the Grok-1 large model!

Grok-1 is a behemoth with an astonishing 314 billion parameters, making it a standout among open-source large models.

However, such a massive model might be too cumbersome for the average user to utilize with ease.

Although Grok-1 has mastered general linguistic abilities, it still requires further fine-tuning for specific scenarios to better tackle a variety of real-world applications.

The good news is that Grok-1 is now available under the Apache 2.0 license, supporting commercial use.

Next, let’s delve into the details and usage scale of this model!

Let’s see what changes and surprises it can bring to our work and lives.

Model Details

Today, I took a look at Grok’s code and have some findings I’d like to share with you all.

Indeed, the Grok model has some unique designs and impressive performances.

First, let’s examine the attention mechanism of the Grok model.

It introduces a unique scaling factor, specifically 30 times tanh(x/30).

This scaling factor acts as a limiter, linearly scaling smaller logit values so they fall within a range of -30 to 30, while larger values are capped within this interval.

Next, Grok opts for an activation function similar to GELU, akin to the Gemma model.

Moreover, Grok has four layers of Layernorm, which is more compared to the two layers in the Llama model.

However, similar to Gemma, Grok performs dimensionality reduction in the final RMS Layernorm layer.

Additionally, I noticed that Grok’s RoPE (Rotary Positional Encoding) exclusively uses the float32 type, consistent with the Gemma model.

In Grok, the RoPE multiplier is set to 1, and QKV (Query, Key, Value) includes biases, but the output layer and MLP (Multilayer Perceptron) layer do not have biases.

Finally, regarding the vocabulary size, Grok’s vocabulary consists of 131072 entries, while Gemma’s vocabulary is larger, reaching 256000.

https://twitter.com/ArtificialAnlys/status/1769502975242961301

Since October 2023, the xAI team began training this model and achieved a 73% score on the MMLU benchmark, surpassing many well-known models.

Overall, the Grok model indeed incorporates many classic optimization techniques commonly used in large language models, such as RMS norm and RoPE.

It would be even more exciting if the team could open-source versions of the model with fewer parameters to lower the barrier to entry!

Such models would be more readily accepted and applied by the broader developer community.

Usage

How many GPUs are needed to pre-train the Grok-1 large model?

The Grok-1 model consists of numerous expert subnetworks, each with approximately 30B parameters.

There are a total of 2048 such subnetworks.

However, during training, only the two most powerful experts are utilized, so the actual number of parameters to consider is 60B.

I plan to use A100 GPUs, each with 40GB of memory.

But 60B parameters require 240GB of memory! That means at least 6 A100 GPUs are needed.

And that’s not all; training also involves gradient computation and backpropagation.

So, I might need 12 A100 GPUs!

What about fine-tuning Grok-1 without pre-training; how many GPUs would that require?

Using Grok-1 directly? The weights can be downloaded from:

xai-org/grok-1 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

This model is not only large, with 50 checkpoint files, each file ranging from 1–3GB, but also occupies about 300GB of storage space in total.

Given the parameter count exceeds 300B, efficient computation requires using multiple GPUs in parallel.

Drawing on previous estimates with the 314B Moe model, handling a model of this scale would need at least 8 A100 GPUs.

Because of the model’s size, inference speed on a single GPU would be slow.

Parallel processing with 8 A100 GPUs could handle several dozen samples per second.

So, using Grok-1’s pre-trained weights for a chat application would require at least 8 A100 GPUs, with a total memory exceeding 5TB.

Indeed, this is a massive undertaking!

Conclusion

Now that you have a deeper understanding of the Grok-1 large model, including its unique design, performance, and the resources required for its use.

What impact do you think Grok-1 will have on the future? Are you looking forward to seeing more applications and innovations based on this model?

Feel free to share your thoughts!

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

I am Li Meng, an independent open-source software developer, and author of SolidUI, highly interested in new technologies, and focused on the AI and data fields. If you find my content interesting, please follow, like, and share. Thank you!