Dev

Cerebras to light up datacenters in North America and France packed with AI accelerators


Cerebras has begun deploying more than a thousand of its dinner-plate sized-accelerators across North America and parts of France as the startup looks to establish itself as one of the largest and fastest suppliers of AI inference services.

The expansion, confirmed at the HumanX AI conference in Las Vegas, will see Cerebras – by the end of this year – bring online new datacenters in Texas, Minnesota, Oklahoma, and Georgia, along with its first facilities in Montreal, Canada, and France.

Of these facilities, Cerebras will maintain full ownership of the Oklahoma City and Montreal sites, while the remainder are jointly operated under an agreement with Emirati financier G42 Cloud.

The largest of the US facilities will be located in Minneapolis, Minnesota, and will feature 512 of its CS-3 AI accelerators totaling 64 exaFLOPS of FP16 compute, when it comes online in the second quarter of 2025.

Unlike many of the large-scale AI supercomputers and datacenter buildouts announced over the past year, Cerebras’s will be powered by its in-house accelerators.

Announced a year ago this week, Cerebras’s CS-3 systems feature a wafer-scale processor measuring 46,225 mm2, which contains four trillion transistors spread across 900,000 cores and 44 GB of SRAM.

Next to the hundreds of thousands of GPUs hyperscalers and cloud providers are already deploying, a thousand-plus CS-3s might not sound like that much compute until you realize each is capable of producing 125 petaFLOPS of highly sparse FP16 performance compared to just 2 petaFLOPS on an H100 or H200 and 5 petaFLOPS on Nvidia’s most powerful Blackwell GPUs.

When the CS-3 made its debut, Cerebras was still focused exclusively on model training. However, since then the company has expanded its offering to inference. The company claims it can serve Llama 3.1 70B at up to 2,100 tokens a second.

This is possible, in part, because large language model (LLM) inferencing is primarily memory-bound, and while a single CS-3 doesn’t offer much in terms of capacity, it makes up for that in memory bandwidth, which peaks at 21 petabytes per second. An H100, for reference, offers nearly twice the memory capacity, but just 3.35 TBps of memory bandwidth. However, this alone only gets Cerebras to around 450 tokens a second.

As we’ve previously discussed, the remaining performance is achieved via a technique called speculative decoding, which uses a small draft model to generate the initial output, while a larger model acts as a fact-checker in order to preserve accuracy. So long as the draft model doesn’t make too many mistakes, the performance improvement can be dramatic, up to a 6x increase in tokens per second, according to some reports.

Amid a sea of GPU bit barns peddling managed inference services, Cerebras is leaning heavily on its accelerator’s massive bandwidth advantage and experience with speculative decoding to differentiate itself, especially as “reasoning” models like DeepSeek-R1 and QwQ become more common.

Because these models rely on chain-of-thought reasoning, a response could potentially require thousands of tokens of “thought” to reach a final answer depending on its complexity. So the faster you can churn out tokens, the less time folks are left waiting for a response, and, presumably, the more folks are willing to pay for the privilege.

Of course, with just 44 GB of memory per accelerator, supporting larger models remains Cerebras’s sore spot. Llama 3.3 70B, for instance, requires at least four of Cerebras’s CS-3s to run at full 16-bit precision. A model like Llama 3.1 405B – which Cerebras has demoed – would need more than 20 to run with a meaningful context size. As fast as Cerebras’s SRAM might be, the company is still some way from serving up multi-trillion-parameter scale models at anything close to the speeds they’re advertising.

With that said, the speed of Cerebras’s inference service has already helped it win contracts with Mistral AI and, most recently, Perplexity. This week, the company announced yet another customer win with market intelligence platform AlphaSense, which, we’re told, plans to swap three closed source model providers for an open model running on Cerebras’s CS-3s.

Finally, as part of its infrastructure buildout, Cerebras aims to extend API access to its accelerators to more developers through an agreement with model repo Hugging Face.

Cerebras’s inference service is now available as part of Hugging Face’s Inference Providers line-up, which provides access to a variety of inference-as-a-service providers, including SambaNova, TogetherAI, Replicate, and others, via a common interface and API. ®



READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.