UCARE Research Group Catalogs “Fail-Slow” Nightmares for Large-Scale Systems

Cloud providers, data centers, computer clusters, and other large-scale computer systems, share a common boogeyman: the fail-slow. Unlike its more dramatic cousin the fail-stop, which simply shuts down a software program or hardware component, the fail-slow can be a much more subtle and nefarious culprit, throttling performance in mysterious ways. Tracking down the source of a fail-slow fault can take up hundreds of valuable hours, and the primary cause can be the last thing you’d expect: a single faulty cooling fan, high altitude, or even a poorly-placed desk chair.

UCARE, a UChicago Computer Science systems research group led by Haryadi Gunawi, specializes in just these types of problems -- faults that might be irrelevantly rare to a single machine, but can become a major nuisance at a larger scale. Fail-slow hardware is a perfect example, but remains largely unacknowledged and understudied by the CS community.

“If we want to build robust software we need to understand the failure mode of our hardware,” said Gunawi, a Neubauer Family Assistant Professor in the Department of Computer Science. “To me, this is a new failure mode that many people should think about when they want to build large-scale systems.”

So in a recent paper for the Usenix File and Storage Technology (FAST '18) conference, titled “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” Gunawi and graduate student Riza Suminto collected over 100 fail-slow horror stories from operators of large-scale systems at universities, national laboratories, and private companies. The author list for the final paper represented an impressive cross-section of these categories, from Twitter, Huawei, and Pure Storage to the University of Utah, the University of Chicago Research Computing Center, and the Argonne and Los Alamos National Laboratories.

The survey found little overlap between the root causes of fail-slow events, reinforcing the unpredictability of these faults. Respondents reported failures along every step of the hardware chain, from storage to CPU to memory to network, that dropped performance across the full system. Most frustrating were the “cascading root causes,” where an original event as seemingly innocuous as a broken fan or a clogged air filter set off a Rube Goldberg machine series of events that could cripple an entire cluster.

For example, in one report, the failure of a single fan caused the other fans in a cooling system to work at maximum power, creating excess noise and vibration that slowed disk performance, which in turn slowed software processes. One center located at 7500 feet altitude saw a cooling defect which affected their CPUs -- a bug that wasn’t noticeable during sea level tests by the manufacturer. And in a particularly low-tech example, a technician rocking in an office chair loosened disk drives in a stack, creating a system-wide ripple effect impossible to diagnose through error logs.

What many of these stories shared in common were very rare events or minute variations in hardware manufacturing, magnified in a large-scale system to become serious issues. The authors recommend more frequent fault tests for common fail-slow causes and improved transparency and data collection around both hardware and software performance.

“The complexity of our software and hardware ecosystems in the cloud scale is outpacing our efforts in debugging, verification, testing of these systems,” Gunawi said. “When we talk about a scale of thousands of machines, the probability that you see one or a few hardware components start limping is actually not that small.”

The “unspoken truth” validity of the paper has so far been reinforced by coverage in ZDNet and on CS blogs, as well as conversations Gunawi has had at conferences with engineers from some of the largest tech companies. By sharing these stories publicly, Gunawi hopes that the paper will draw more attention to these faults and energize computer scientists to find new ways of preventing a common cause of large-scale system nightmares.

“Many people still don't believe this problem, only large-scale operators believe this problem, and the goal is just to say, ‘believe us.’ That's why all these authors signed on to this paper,” Gunawi said. “If we can convince the community that fail-slow hardware is real, I bet the community will be able to deal with this problem, because we are all smart people.”