If the 2010s were the decade of “Big Data,” the 2020s will be the decade of Data Discovery. Companies and organizations from virtually every industry and sector heard the gospel of big data and started keeping, collecting, and managing larger and larger databases. But possessing that data is one thing, and doing something valuable with it is another. Entities that can quickly access and combine the data they have and use it to drive their decisions will have all the advantages going forward.
New UChicago CS assistant professor Raul Castro Fernandez builds tools to make that data discovery process easier for all. In previous work at MIT and Imperial College London, he created systems that help organizations locate the data they need, combine data across different formats, and process it for use in predictive analytics, machine learning, and other applications.
“We've collaborated with companies that have thousands of databases, and each database contains hundreds of tables, and each table can be hundreds of thousands or sometimes millions and billions of records,” Raul said. “So it becomes really hard to find any data that might be relevant to solve a particular question. I build systems that tap into those kinds of sources and help people identify the data that they need.”
Growing up in Spain, Raul originally studied to be an electrical engineer. But eventually his interests drifted to computer science, where he got swept up in the race to make databases run faster and faster. Eventually, he observed that the bottleneck was moving away from the speed at which queries return results from a database to a more complex problem: how do you even find the right data to query?
“The main problem that people have today when they try to solve a question that requires data is actually figuring out where the data is in the first place,” Raul said. “Then it gets even more complicated, because once you find the data, it is going to be in a different shape, and how do you actually transform it? I became more interested in this general area of how we can make it easier for people to actually tap into the value that we are seeing in the data.”
Raul has addressed these issues by creating tools such as Aurum and Termite. Aurum, Latin for “gold,” allows users to write simple queries searching for the type of data they need for a given project, within a system scalable to thousands of databases. Termite solves data integration problems when users are faced with combining tables or spreadsheets with unstructured text formats, placing all data in an intermediate representation that also makes finding relevant documents easier.
“I can ask the system: I'm interested in this table, which documents are close to it? And then that's going to help me navigate across the many, many millions of documents that might be there, to find the handful of them that actually might be relevant,” Raul said. “We bring techniques that have been successful in machine learning and natural language processing to the realm of databases to solve these problems that really happen everywhere.”
To test these tools, Raul has worked with companies such as BASF, British Telecom, and Merck on applying these tools to their own databases. He’s also dabbled in entrepreneurship himself, co-founding a company that analyzed sensor data for vineyards and helped them predict disease, crop yield and other outcomes, and another that aggregated news articles into concise and trustworthy summaries.
At UChicago, he will continue developing Aurum, Termite, and other data discovery systems, while also branching out into a new area: the economics of data. It’s a potential non-technical solution to the compatibility and accessibility bottlenecks of data-driven applications, creating markets and incentives for data owners to share their datasets in easily consumable formats with the users who need them.
“How do we think about these kinds of issues to solve these fundamental data problems of discovery and integration?,” Raul said. “That's one of the main lines that I'm interested in today.”
It’s also one of the primary reasons Raul came to UChicago, where he hopes to collaborate with economists on these data marketplace ideas. He’s creating a course on the topic for next fall, after teaching Introduction to Databases in winter quarter and a new course, with Assistant Professor Blase Ur, on data ethics and responsible data science in the spring.
Raul also joins the department’s growing Chidata team, pursuing projects in databases, data science, and other processes and platforms, and he seeks partnerships with campus researchers from all disciplines that want to do more with data — a mission that casts a wide net.
“A common theme in my research is to always try to take ideas and make them practical and use them with people in the field,” Raul said. “And when you look at who's suffering data problems today, that’s almost everybody, right? The cool thing about UChicago is that you just walk a few blocks in any direction and you talk to people in the medical school or you talk to the economics department or you talk to neuroscientists. We all have a common interest in solving those data problems to get something done.”