AWS opens cluster of 40K Trainium AI accelerators to researchers

Amazon wants more people building applications and frameworks for its custom Trainium accelerators and is making up to 40,000 chips available to university researchers under a $110 million initiative announced on Tuesday.

Dubbed “Build on Trainium,” the program will provide compute hours to AI academics developing new algorithms, looking to increase accelerator performance, or scale compute across large distributed systems.

“A researcher might invent a new model architecture or a new performance optimization technique, but they may not be able to afford the high-performance computing resources required for a large-scale experiment,” AWS explained in a recent blog post.

And perhaps more importantly, the fruits of this labor are expected to be open-sourced by researchers and developers so that they can benefit the machine learning ecosystem as a whole.

As altruistic as this all might sound, it’s to Amazon’s benefit: The cloud giant’s custom silicon, which now spans the gamut from CPUs and SmartNICs to dedicated AI training and inference accelerators, was originally designed to improve the efficiency of its internal workloads.

Developing low-level application frameworks and kernels isn’t a big ask for such a large company. However, things get trickier when you start opening up the hardware to the public, which in large part lacks these resources and expertise, necessitating a higher degree of abstraction. This is why we’ve seen many Intel, AMD, and others gravitate toward frameworks like PyTorch or TensorFlow to hide the complexity associated with low-level coding. We’ve certainly seen this with AWS products like SageMaker.

Researchers, on the other hand, are often more than willing to dive into low-level hardware if it means extracting additional performance, uncovering hardware-specific optimizations, or simply getting access to the compute necessary to move their research forward. What was it they say about necessity being the mother of invention?

“The knobs of flexibility built into the architecture at every step make it a dream platform from a research perspective,” Christopher Fletcher, an associate professor at the University of California at Berkeley, said of Trainium in a statement.

It isn’t clear from the announcement whether all 40,000 of those accelerators are its first or second generation parts. We’ll update if we hear back on this.

The second generation parts, announced roughly a year ago during Amazon’s Re:Invent event, saw the company shift focus toward everyone’s favorite flavor of AI: large language models. As we reported at the time, Trainium2 is said to deliver 4x faster training performance than its predecessor and boost memory capacity by threefold.

Since any innovations uncovered by researchers — optimized compute kernels for domain-specific machine learning tasks, for example — will be open-sourced under the Build on Trainium program, Amazon stands to benefit from its crowdsourcing of software development.

Naturally, throwing hardware at academics is a tale as old as university computer science programs, and to support these efforts, Amazon is extending access to technical education and enablement programs to get researchers up to speed. This will be handled through a partnership with the Neuron Data Science community, an organization led by Amazon’s Annapurna Labs team. ®

READ SOURCE