For decades, the number one mission of databases was speed. Researchers and software developers raced to design systems that sifted through ever larger and complex datasets under the hood, returning desired answers to their users as quickly as possible. With Moore’s Law scaling up the available computation, resource use wasn’t usually a concern. But that luxury is coming to an end, and resource efficiency is a new priority as more computation shifts to the pay-for-compute cloud and remote devices.
Aaron Elmore, assistant professor at UChicago CS, develops database models that address this need, giving users the power to sacrifice speed for reduced resource use and cost. His approach, intermittent query processing (IQP), grafts machine learning prediction to database processing, providing more efficient computation to systems working with bursty data or intermittent monitoring. As a new recipient of the CAREER award, the National Science Foundation's most prestigious award in support of early-career faculty, Elmore will continue designing these innovative systems for data-driven applications.
[Read about the five UChicago CS faculty who received NSF CAREER awards in the 2021 cycle.]
Many modern technologies, such as Internet of Things devices, sensors, or e-commerce, produce “bursty data” — data that comes in spaced-out, unpredictable packets. But databases are typically designed for either relatively static datasets that are updated infrequently, or more recently, steady streams of data. For these database types, systems use batch processing or continuous query strategies, but bursty data is a poor fit for either framework.
“Database systems traditionally were designed such that when you ask them to do something, they do everything they can do to get you that answer right now, or if data is constantly coming in, as soon as they get data they update the answer,” Elmore said. “My vision was to think about what we can do when we know that either the data, or a user’s interest, is going to be bursty. In this case, how can we be smart about deferring work to save resources?”
An example might be a bank keeping track of customer accounts and wanting to monitor balances that are higher than the average. While the overall average can be easily updated as new records are added to the database, searching for outliers requires comparing every account to the average every time it changes, a more computationally intensive task. IQP analyzes and separates out these “easy” and “hard” tasks, and then schedules them at the frequency that meets the user’s expectations, whether that’s determined by time, cost, or available computation.
“It might be a case where you're on an edge or a battery-powered device, and you have limited resources. Or you could be on a cloud, where you’re paying for everything that you do,” Elmore said. “We wanted a knob that somebody could turn and say, ‘if I slow things down, or if I'm willing to trade this off, can I save some amount of money?’”
Elmore’s system folds in machine learning to help make these decisions by predicting when new data will arrive and how long different tasks will take. Because the database language SQL is declarative — users specify what they want, not how to do it — these estimates aren’t simple, Elmore said. So IQP builds a model on previous runs of the query, predicting future runtimes and resource usage. Users can then make their choices about how much they prioritize cost versus performance, and the database will automatically find the level of operation that satisfies those high-level goals.
Thus far, Elmore has created early versions of IQP in Spark and with the programming language Rust. The latter project, nicknamed CrustyDB, is used in his Introduction to Database Systems course, co-taught with Assistant Professor Raul Castro Fernandez. The IQP system is also part of a project with Sanjay Krishnan and Michael Franklin called CrocodileDB — so named because crocodiles remain motionless for long periods of time before they quickly strike, just like a database running intermittent queries.
The research involved in developing these systems goes hand in hand with Elmore’s teaching and mentoring work. The pedagogical databases he uses to teach database principles to computer science and Master’s in Computational Analysis and Public Policy (MS-CAPP) students include many IQP features, allowing students with an interest in database research to easily contribute to the project. Elmore has also taught younger students database basics as a mentor in the “BigDataX: From Theory to Practice in Big Data computing at eXtreme Scales” REU program and through the compileHer student group and their annual tech capstone event for middle school girls.
“As the value of data continues to grow, database systems are increasingly a critical part for teams working on data science, AI, big data, or IoT platforms,” Elmore said. “As faculty, it is essential to train the generation working with data to understand the solutions, principles, and opportunities of database systems.”