Putting The World’s Largest AI Supercomputer into Perspective

Source: AI-generated by Author using Flux

Elon Musk has just announced that xAI has finally connected their Colossus cluster, a 100,000-install base NVIDIA H100 GPU accelerated computer.

The biggest AI computer the world has ever seen (and it’s not even close) boasts some of the most astonishing numbers one can fathom and helps us toy with the idea of how large the next generation of models will be.

And let me tell you, they are huge!

Get news like this before anyone else by subscribing to my newsletter, the place where analysts and strategists get answers to AI’s most pressing questions.

TheTechOasis

The newsletter to stay ahead of the curve in AI

thetechoasis.beehiiv.com

That’s A Lot of FLOPs

AI workloads require an insane amount of computation for them to be viable. The reason is that they are huge digital files boasting billions of parameters occupying double or triple figures in GigaBytes and, in some cases, like frontier models, TeraBytes in size.

To make matters worse, due to their structure, these models are queried every single time they need to predict something, which, in text terms, means they are queried for every new word they predict.

Therefore, to prevent latency from becoming unbearable, we need to store them in RAM, which is scarce even for advanced GPUs. Consequently, to deploy these models confidently, they require potential hundreds or even thousands of GPUs working in unison.

And although everything revolves around linear algebra, meaning each of these calculations isn’t particularly complex, the sheer number of them inevitably ensures the global complexity is enormous.

But plain words carry little weight compared to seeing real numbers. So, if we want to envision how large the next model frontier will be, how costly is it to train an LLM?

Estimating LLM costs

To determine this, we must calculate the number of FLOPs required. FLOPs are floating point operations per second, or the total number of calculations per second required to train (or run) the model.

Following OpenAI’s scaling laws paper, we can estimate the total amount of FLOPs with the equation Cost = 6 x N x D, where:

‘N’ refers to the number of non-embedding parameters of the model (for very large models, the embedding parameters are negligible, so we can just take the global value)
‘D’ refers to the amount of training tokens used to train the model

But is this very simplistic formula accurate? Let’s take Llama 3.1 405B, the state-of-the-art LLM by Meta, as an example:

‘N’ = 405 billion (as mentioned, only 2 billion out of 405, or 0.49%, are embedding parameters, so we can take the total value)
‘D’ = 15 trillion tokens as reported by Meta themselves

Applying this formula gives us a total amount of FLOPs Meta needed to train this model as:

Total FLOPs = 6 x 405 x 10⁹ x 15 x 10¹² = 36,450 x 10²¹, or 3.6 x 10²⁵, extremely close to the value Meta actually reported of 3.8 x 10²⁵, a rounding error.

Another interesting thing we can do is estimate how long it took them to train the model.

Estimating Training Run Length

Using the actual research paper, we know they used a cluster of 16k Nvidia H100s. We also know the model was trained on mixed precision, FP16 weight precision (or 2 bytes per weight), with the Adam Optimizers having FP32 precision (or 4 bytes per weight).

That means the peak performance of a single NVIDIA H100 is 1,979 TeraFlops.

Although this is peak performance (not the actual throughput of each GPU), that means that 16k H100s have a total compute power of 1,979 x 10¹² x 16 x 10³ = 31,664 x 10¹⁵, or 3.2 x 10¹⁹ FLOPs.

Consequently, to reach the total training costs of Llama 3.1 405B calculated earlier, the cluster had to run for 3.8 x 10²⁵ / 3.2 x 10¹⁹ = 1.48 x 10⁶ seconds, or 14 days.

But here’s the thing: that estimate is not even close. The model was actually trained for 54 days, after accounting for an average Model Flop Utilization ranging from 38–43% along three stages of pre-training, as shown in the image below:

Model Flop Utilization, or MFU, defines the actual per-GPU throughput we can obtain on average during training.

Long story short, it’s not worth it to estimate the ideal training duration, as reality is much, much more daunting. Nevertheless, the model took three times days to train than the theoretical value.

And what have been the total economic costs?

A 16k NVIDIA H100 cluster costs, roughly speaking, 960 million in capital costs.

At $30,000 per GPU, the capital cost of GPUs alone is $480 million
In AI data center numbers, land, contractors, and other necessary equipment represent another 50% of the total cost, adding the extra $480 million.

As for the running costs of the training run, things are a little bit trickier.

If we take the average US industrial electricity tariff of 0.083 €/KWh
And assume that the actual Watts required to run each GPU sit around double the Thermal Design Power (TDP) of the GPU (according to SemiAnalysis to account for networking, cooling, and other costs), which according to NVIDIA is 700W, that means that the actual power load required is 1400 x 16,000 = 22.4 MW

Therefore, running for 42.5 days, the running costs of such a data center at max power would cost around 22.4 x 10³ KW x 24 x 54 = 29.03 x 10⁶ KW x 0.083 €/KWh = $2.41 million, or just 0.25% of the total cost of ownership (TCO).

As you can see, running costs for AI training are negligible. For AI inference, though, that’s a totally different story.

Knowing all this, we can now answer the big question: What large of a model can we train with such a data center?

A Truly Colossal Beast

At 100,000 H100s, Elon’s cluster’s total theoretical peak FLOPs are enormous, boasting a total value of 1,979 x 10¹⁷ FLOPs.

Thus, to train Llama 3.1 405B, it would theoretically take around 3.8 x 10²⁵ / 1.979 x 10²⁰ = 1.92 x 10⁵ seconds = 53 hours, or just 2.2 days.

Yes, you read that right: 2 days to train a state-of-the-art model.

Of course, if we assume that xAI can’t reach a higher MFU than Meta (it would probably be way lower due to the sheer size of Colossus), that number would actually be 9-ish days assuming 500 TFlops/GPU, still nuts for a model of the size and quality of Llama 3.1 405B.

But we can take it a step further. At that scale, xAI could decide to train a much, much larger model. For instance, if we set a budget of 100 days, how large of a model could we train with such a cluster?

Let’s reverse-engineer the numbers.

I’m assuming the training precision is also mixed (BF16/FP16 for the weights, FP32 for the Adam optimizer states.

Running such a cluster at peak performance (TDP) would mean a total FLOPs budget of 100,000 GPUs x 1,979 x 10¹² FLOPs per GPU x 3600 seconds x 24 hours x 100 days = 170,985,600 x 10¹⁹ FLOPs, or 1.71 x 10²⁷ FLOPs.

Using the same scaling-laws formula we saw earlier and fixating the number of training tokens (the size of the training dataset) in the 15 trillion tokens Meta used, that would give us the following:

I’m fixing the data set size because we want to know the largest model we could train for a known state-of-the-art data set such as Meta’s. However, it’s actually preferable to maximize the data set size instead of the model size.

1.71 x 10²⁷ = 6 x {Model_size} x 15 x 10¹² ⇒ Model size = 19 x 10¹² or a 19 trillion parameter model, ten times larger than the alleged maximum size of the current frontier; a truly humongous beast that only a Colossus-sized cluster could build.

Again, this is the maximum theoretical size. If we assume a 500 peak TFlops/GPU, that gives a 4.8 trillion parameter model, two times the actual peak size.

If we assume that the leak from TSMC during SEMICON (between 3–5 trillion) is real, this number approaches reality much more.

Costs-wise, Colossus is also on another planet.

The capital costs sit around 6 billion (at a $30k price tag per GPU, probably closer to 4 billion considering the hefty discounts NVIDIA will have provided to xAI)
The running costs of training that model would be insane, too. Connecting such a data center requires roughly 140 MW of power. Running for 100 days, that would cost 33.6 million kWh x 0.083€/kWh = $27.9 million.

For reference, 140 MegaWatts is enough power to serve electricity to 56,000 homes at the low end and 126,000 homes at the high end.

Will Grok-3 break all records?

With the industry still digesting Grok-2, Elon wouldn’t have built a 4–6 billion data center if he wasn’t about to train a humongous model that can also be run at scale.

Thus, Grok-3 could be the first next-generation AI model. It could represent a two-order-of-magnitude increase over the FLOPs budget of the current state-of-the-art, which would also answer the great question:

Is scaling all we need?

Things could get nasty if Grok-3 fails to deliver that long-awaited step-function increase despite having poured 100 times more compute than to train GPT-4.

If so, maybe The Bitter Lesson isn’t true after all, and all we have left is thoughts and prayers for the companies that have invested almost all their cash into scaling AI.

For business inquiries on AI strategy or analysis, reach out at [email protected]

If you have enjoyed this article, I share similar thoughts in a more comprehensive and simplified manner for free on my LinkedIn.

If preferable, you can connect with me through X.