Meta has announced two new GPU clusters that will see the firm provide improved infrastructure for dealing with the taxing compute demands of artificial intelligence (AI) systems.
Marking a “major investment in Meta’s AI future,” the firm announced the addition of two 24k GPU data center-scale clusters that boast heightened throughput and reliability for AI workloads.
These GPUs will support both Meta’s current Llama 2 model and its upcoming Llama 3 model, as well as the company’s wider research and development projects across generative AI and other areas.
The announcement was described by the firm as “one step in our ambitious infrastructure roadmap”, and will see the tech giant acquire 350,000 Nvidia H100 GPUs to expand its portfolio.
Meta said the expansion project will deliver a total compute power equivalent to nearly 600,000 H100s upon completion.
“As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow’s needs,” the firm said in a statement.
“That’s why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond.”
Meta focused on building “end-to-end” AI systems in its latest pair of GPU clusters, emphasizing researcher and developer experience as a means of guiding production.
With high-performance network fabrics working alongside 24,576 Nvidia Tensor Core H100 GPUs, these new clusters are able to support “larger and more complex” models than Meta’s previous RSC clusters.
One of the new clusters was built with “remote direct memory access (RDMA) over converged Ethernet (RoCE),” while the other features an “Nvidia Quantum 2 InfiniBand fabric,” both geared towards improved network functionality.
Both clusters were built using Meta’s in-house open GPU hardware platform Grand Teton, which itself is built on generations of AI that integrate “power, control, compute, and fabric interfaces into a single chassis for better overall performance.”
“Grand Teton allows us to build new clusters in a way that is purpose-built for current and future applications at Meta,” the firm said.
Generative AI also consumes data in huge volumes, the firm said, meaning the next generation of GPUs need to take storage into account.
Meta’s “home-grown” Linux storage system does this in its latest GPU cluster offerings, which will work in parallel with a version of Meta’s Tectonic distributed storage solution.
Though Meta reports that there were initial performance issues with these larger clusters, changes to its internal job scheduler helped optimize both GPU clusters to “achieve great and expected performance.”