A new method to measure the tidiness of data repositories and help researchers clean up their “data swamps” took first place in the ACM Student Research Competition for Undergraduates at the 2018 Supercomputing conference in Dallas, TX. The project, conducted by students Luann Jung and Brendan Whitaker with advisors Kyle Chard and Aaron Elmore of UChicago CS, was part of last summer’s BigDataX Research Experiences for Undergraduates (REU) program, held at UChicago and the Illinois Institute of Technology.
Anyone who has used a computer knows how easy it is for data to become disorganized. Without the strictest drag-and-drop discipline, folders quickly become cluttered with files of different formats or subject matter, making it difficult or impossible to find the right file when needed in the future. If this disorder is a problem for personal computers, it is magnified exponentially in the archives of scientific projects that hold terabytes and terabytes of data.
Over the 10 weeks of the BigDataX program, Jung — a first-year student at MIT — and Whitaker — a fourth-year at Ohio State University — worked with a group of UChicago CS researchers seeking solutions for this issue, colloquially known as the “data swamp.” In an CERES-funded project, Chard, Elmore, Ian Foster, Michael Franklin and Blase Ur are developing automated processes that reorganize data repositories or databases and make the information within more reusable and discoverable. But in order to create these improvements, researchers need a way to measure what they’re improving — attaching a number to the dirtiness or cleanliness of a given repository.
“It's an important challenge because a huge portion of scientific research nowadays is heavily reliant on statistical inference from large quantities of data,” Whitaker said. “The work-hour split of data scientist is often said to be nearly 90% preprocessing and 10% inference, training, and testing. It's this ‘heterogeneous’ quality that necessitates such a time sink, and it's my view that quantifying this quality is the first step in creating systems that do it automatically and do it well.”
Measuring Clutter With Clusters
In their paper, “Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories,” Jung and Whitaker constructed a parallel pipeline that uses clustering methods to quickly assess a given repository for its tidiness. The pipeline processes text files and tabular data (such as csv or tsv files) and clusters them according to their shared features.
Broadly, the cleanliness score is then calculated from how well those clusters map against file directories in the repository. If a given cluster is heavily represented in a small number of directories and only appears rarely in other locations, that’s considered well-organized. Conversely, a cluster that appears somewhat frequently across many different directories without a significant drop-off between where it is and isn’t present would be considered a sign of disorder.
To test their new score, Jung and Whitaker created synthetic datasets where they could manually shuffle how the files were organized, and also used a real data repository from the Carbon Dioxide Information and Analysis Center. In both evaluations, run using the Chameleon cloud computing testbed, the new cleanliness score outperformed previously-published measures.
Overall, the project combined many of the skills emphasized by the BigDataX program, which is designed to "promote a data-centric view of scientific and technical computing, at the intersection of distributed systems theory and practice." At Supercomputing, the project beat out dozens of competitors for first place in the undergraduate category, which also qualifies it for the ACM Student Research Competition Grand Finals.