
The concurrency bug is one of the computer programmer’s most notorious enemies. Occurring when two pieces of code execute simultaneously, producing a corrupted result, these bugs are easy to make but hard to find and fix. Concurrency bugs have been blamed for medical accidents, massive regional blackouts, and online theft. When researchers find a new method for finding concurrency bugs, they consider it a success if it locates and corrects even a few errors in software.
But last month at the 2019 ACM Symposium on Operating Systems Principles (SOSP) meeting in Canada, a paper from UChicago CS, Microsoft Research, and the University of California, Berkeley revealed a new approach that captured thousands of concurrency bugs. The accomplishment was so far beyond past attempts, it earned the conference’s Best Paper award, a prestigious honor at this very selective, high-impact biennial conference.
Led by Guangpu Li, a graduate student advised by UChicago CS Professor Shan Lu, the paper presented an out-of-the-box approach to detecting concurrency bugs that outperformed decades of research on the topic.
“It's a very simple idea for a very challenging problem,” Lu said. “People have been looking at this problem for more than 40 years, and from that very early time, it was the same idea: that you just analyze the program using very expensive algorithms and compute the logical time. What we're saying is a different idea, and it's very simple. We were worried that it was too simple...but then the result is very good.”
The paper focused on a particular type of concurrency bug called thread-safety violations, or TSVs, which often evade established testing methods and show up in production code. Previous approaches for finding concurrency bugs either require computationally intensive analysis of code to find potential conflicts, or require running tests many times over to sniff out bugs.
Instead of further tweaks to these techniques, Li and his co-authors tried a totally new approach, seeking a middle ground with low overhead and high accuracy, detecting many more bugs without raising the false positive rate. Their concurrency testing tool first detects “near-misses,” where two parts of the code execute on the same object close to each other in time. It then injects delays of different durations to only these potentially problematic pairs, ruling out the rest, in an attempt to detect whether a true concurrency bug might occur. Depending on how effective each delay injection is — for example, does it make two code elements execute closer to each other or not — the delay-injection plan is further refined. The two-step process can analyze a new program in one run, saving developers both time and the cost of running multiple tests or expensive program analysis.
“In the first step, we observe two pieces are code that are close by, and in the next step, you put in some delays to try to push them to execute together,” Lu said. “So you don’t need to do any complicated analysis. It's basically just a very lightweight approach that turns out to work very well.”