"Data Curation at Scale"

Dong Deng, Massachusetts Institute of Technology

Wednesday, March 14, 2018 at 12:00 Noon

Room 368 (CIT 3rd Floor)

Data curation (ingest, transformation, cleaning, schema mapping, deduplication, and consolidation) of raw data sets consumes up to 80% of a data scientist’s time. Integrating silos of enterprise data is also a major challenge to business users. To address these issues, we have built an end-to-end data curation system, Data Civilizer, in cooperation with the Qatar Computing Research Institute.

In this talk, I will briefly introduce the Data Civilizer system. Then I will discuss two of the components that I have constructed. First, I will discuss entity consolidation in Data Civilizer. This module accepts a collection of clusters of records thought to represent the same entity (i.e. duplicates) and merges each cluster into a single “golden” record. Next, I will show how to address the key challenges to enable scalable entity matching in Data Civilizer. Finally, I will conclude the talk with my future vision on data sharing and data accessing.

Dong Deng is a postdoctoral associate in the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, where he works with Prof. Michael Stonebraker and Prof. Samuel Madden. He is interested in data management and data science, with a special focus on data curation. Dong obtained his PhD degree from Tsinghua University in 2016 with the highest doctoral dissertation award. He also received Siebel Scholarships, Google PhD Fellowship, Microsoft PhD Fellowship, and has been regularly publishing in top venues including SIGMOD, PVLDB, and ICDE.

Host: Professor Stan Zdonik