Dev

IBM says it’s been running ‘AI supercomputer’ since May but chose now to tell the world


IBM is the latest tech giant to unveil its own “AI supercomputer,” this one composed of a bunch of virtual machines running within IBM Cloud.

The system known as Vela, which the company claims has been online since May last year, is touted as IBM’s first AI-optimized, cloud-native supercomputer, created with the aim of developing and training large-scale AI models.

Before anyone rushes off to sign up for access, IBM stated that the platform is currently reserved for use by the IBM Research community. In fact, Vela has become the company’s “go-to environment” for researchers creating advanced AI capabilities since May 2022, including work on foundation models, it said.

IBM states that it chose this architecture because it gives the company greater flexibility to scale up as required, and also the ability to deploy similar infrastructure into any IBM Cloud datacenter around the globe.

But Vela is not running on any old standard IBM Cloud node hardware; each is a twin-socket system with 2nd Gen Xeon Scalable processors configured with 1.5TB of DRAM, and four 3.2TB NVMe flash drives, plus eight 80GB Nvidia A100 GPUs, the latter connected by NVLink and NVSwitch.

This makes the Vela infrastructure closer to that of a high performance compute (HPC) site than typical cloud infrastructure, despite IBM’s insistence that it was taking a different path as “traditional supercomputers weren’t designed for AI.”

It is also notable that IBM chose to use x86 processors rather than its own Power 10 chips, especially as these were touted by Big Blue as being ideally suited for memory-intensive workloads such as large-model AI inferencing.

The nodes are interconnected using multiple 100Gbps network interfaces arranged in a two-level Clos structure, which is designed so there are multiple paths for data to provide redundancy.

However, IBM explains in a blog post its reasons for opting for a cloud-native architecture, which center on cutting down the time required to build and deploy large scale AI models as much as possible.

“Do we build our system on-premises, using the traditional supercomputing model, or do we build this system into the cloud, in essence building a supercomputer that is also a cloud?” the blog asks.

IBM claims that by adopting the latter approach, it has compromised somewhat on performance, but gained considerably on productivity. This comes down to the ability to configure all the necessary resources through software, as well as having access to services available on the wider IBM Cloud, with the example of loading data sets onto IBM’s Cloud Object Store instead of having to build dedicated storage infrastructure.

Big Blue also said it opted to operate all the nodes in Vela as virtual machines rather than bare metal instances as this made it simpler to provision and re-provision the infrastructure with different software stacks required by different AI users.

“VMs would make it easy for our support team to flexibly scale AI clusters dynamically and shift resources between workloads of various kinds in a matter of minutes,” IBM’s blog explains.

But the company claims that it found a way to optimize performance and minimize the virtualization overhead down to less than 5 percent, close to bare metal performance.

This included configuring the bare metal host for virtualization with support for Virtual Machine Extensions (VMX), single-root IO virtualization (SR-IOV) and huge pages, among other unspecified hardware and software configurations.

Further details of the Vela infrastructure can be found on IBM’s blog.

IBM is not the only company using the cloud to host an AI supercomputer. Last year, Microsoft unveiled its own platform using Azure infrastructure combined with Nvidia’s GPU accelerators, network kit, and its AI Enterprise software suite. This was expected to be available for Azure customers to access, but no time frame was specified.

Other companies that have been building AI supercomputers, but following the traditional on-premises infrastructure route, include Meta and Tesla. ®



READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.