Hardware keeps getting faster, but it’s still worth taking a step back periodically and revisiting your code. You might just uncover a little tweak that wrings out more efficiency or extra throughput than you’d expect.
That’s exactly what researchers at Cheriton School of Computer Science at the University of Waterloo have managed to do. Just around 30 lines of code — to the Linux kernel’s network stack, and they say you could curb datacenter power consumption by up to 30 percent.
These changes have now been published as part of the 6.13 kernel release, making their way to the public.
The project was born out of a desire to explain how it was possible that user-level networking approaches could achieve such grandiose performance compared to kernel-level approaches, Professor Martin Karsten told El Reg.
Traditionally, he explained, Linux networking has been interrupt driven. As new data came across the network, a system interrupt was triggered, and the CPU core would pause its current task so it could process packets. This made it well-suited to environments where multiple users might run jobs simultaneously.
“In the old school system the operating system was a facilitator of multi-user activities,” Karsten said. “You have a server; you have lots of people logged in doing all sorts of little things; and the operating system constantly needs to look after everybody and establish fairness.”
A lot has changed since then. Many modern throughput-oriented workloads — think reverse proxies or caching — can consume resources equivalent to multiple traditional systems. For these kinds of apps, Karsten tells us, it can be more efficient for the application to poll the network when it’s ready to take on more work.
“My application either has work or it doesn’t have work. If it does have work, why would I bother looking at the network? I’m already busy,” Karsten explained. “I do the work that I have, and then I’ll look at the network.”
By reducing the number of interrupt requests, or IRQs, the host CPU can spend more time crunching numbers and less time waiting on packets that aren’t ready to process.
This is already possible in user space, but it does take some work. You need to know whether your application will benefit from this approach and then implement it, Karsten added.
Constantly polling the network also comes with overheads of its own. “When traffic is low, you’re still burning through your core, which is extremely inefficient with respect to power consumption,” he added.
With this insight, the research team looked for a way to accomplish both. What they came up with was a kernel patch introducing adaptive polling. During periods of heavy traffic, this allowed the host to poll the network for a new chunk of data as soon as it was finished processing the last. If traffic died down and there were no new numbers to crunch, the system could then revert back to an interrupt-based approach, saving energy in the process. More importantly, because all of this is handled in the kernel, it’s essentially automatic.
Researchers first demonstrated a rudimentary implementation of this capability in a paper published in 2023. At this point, Karsten realized that this work could have implications beyond a research paper and began working with Linux kernel dev and Fastly CTO Joe Damato to integrate the functionality.
According to Karsten, the performance implications of the patch could be quite significant. Early testing showed it could boost throughput by up to 45 percent without compromising tail latency. Meanwhile, for the same load, he said the reduction in resources previously wasted on interrupting heavy network overheads could curb power consumption by as much as 30 percent.
Of course, those are the best case scenarios. Not every application is going to see this level of performance improvement. Karsten tells us that throughput heavy apps should benefit the most. “There are applications out there such as Memcached… that don’t do much else but network communication.”
Even if the savings are much lower on average, he argues that’s still a considerable amount of power across all the Linux boxes in the wild.
Speaking of which, these savings won’t be realized overnight as it could take a while before a kernel sporting the modifications make its way into the kind of long-term-support (LTS) releases favored by enterprise customers.
Unfortunately, it appears that even when the kernel does see widespread adoption in the datacenter, it may not do much for the AI clusters. This is because in AI and HPC applications a technology called remote direct memory access or RDMA has long been preferred.
“This approach eliminates the need for CPU cycles in network data processing, forming the foundation of high-performance interconnect technologies,” Gilad Shainer, SVP of Networking at NVIDIA, told The Register.
Even still, for Karsten, the patch underscores the importance of revisiting software stacks not just at the kernel or application level, but middleware, libraries, and everything else. “I think there are so many inefficiencies that we can eradicate and I think the time is soon,” he said.
“Nobody spent any money on this in the past because why worry if hardware is twice as fast next year,” he added. “But if that party ever stops, then we better look at the software.”
And this is exactly where Karsten is planning to focus his research going forward. ®