Thesis Proposal


"Quantifying Uncertainty in Data Analysis"

Yeounoh Chung

Friday, March 9, 2018 at 3:00 P.M.

Room 368 (CIT 3rd Floor)

In the age big data, uncertainty in data constantly grows with its volume, variety and velocity. Data is noisy, biased and error-prone. Compounding the problem of uncertain data is uncertainty in data analysis. A typical end-to-end data analysis pipeline involves cleaning and processing the data, summarizing different characteristics and running more complex machine learning algorithms to extract interesting patterns. The problem is that, all the steps are error-prone and imperfect. From the input data to the output results, uncertainty propagates and compounds. This thesis deals with the problem of uncertainty in data analysis.

First, we look at how uncertainty in a form of missing unknown data items affect aggregate query results (e.g., AVG, COUNT, MIN/MAX) that are common in exploratory data analysis. It is challenging to make sure that we have collected all the important data items to derive correct data analysis, especially when we deal with real-world big data; there is always a chance that some items of unknown impacts are missing from the collected data set. To this end, we propose sound techniques to derive aggregate queries with the \emph{open-world assumption} (the data set may or may not be complete).

Next, we examine uncertainty in the form of data errors. It is almost guaranteed that any real-world data sets contain some types of error (e.g., missing values, inconsistent records, duplicate entities). This is an important source of uncertainty in data analysis because those errors would almost surely corrupt the analysis results. Unfortunately, there has not been much work to measure the data quality or estimate the number of undetected/remaining data errors in a data set; the best practices in data cleaning basically employ a number of orthogonal data cleaning algorithms or crowd-source the task in a hope that the increased cleaning efforts would result in a perfect data set. To guide such cleaning efforts, we propose techniques to estimate the number of remaining errors in a data set.

Lastly, we also look at uncertainty in a trained machine learning model quality. As Machine learning (ML) systems become democratized, helping users easily debug their models becomes increasingly important. In fact, ensuring model quality leads to more accurate and reliable data analysis results. To ensure that a given model is performing well at a given task, ML practitoners consider various test performance metrics (e.g., log loss, accuracy, recall, etc.) over the entire data set, as well as on smaller data slices. The key issue here is that there are too many data slices to examine. Here, we present an automated data slicing tool for model validation, that can bring user's attention to a handful of large, problematic and interpretable data slices.

Host: Professor Tim Kraska