Thesis Defense

 

"Quantifying Uncertainty in Data Exploration"

Yeounoh Chung

Friday, November 9, 2018 at 1:00 P.M.

Room 101 (CIT 1st Floor)

In the age of big data, uncertainty in data constantly grows with its volume, variety and velocity. Data is noisy, biased and error-prone. Compounding the problem of uncertain data is uncertainty in data analysis. A typical end-to-end data analysis pipeline involves cleaning and processing the data, summarizing different characteristics and running more complex machine learning algorithms to extract interesting patterns. The problem is that, all the steps are error-prone and imperfect. From the input data to the output results, uncertainty propagates and compounds. This thesis addresses such problems of uncertainty in data analysis.

First, it is shown how uncertainty in a form of missing unknown data items can affects aggregate query results, which are very common in exploratory data analysis. It is challenging to make sure that we have collected all the important data items to derive correct data analysis, especially when we deal with real-world big data; there is always a chance that some items of unknown impacts are missing from the collected data set. To this end, sound techniques to derive aggregate queries with the \emph{open-world assumption} (the data set may or may not be complete) are proposed.

Next, uncertainty in the form of data errors is examined. It is almost guaranteed that any real-world data sets contain some forms of data error. This is an important source of uncertainty in data analysis because those errors would almost surely corrupt the final analysis results. Unfortunately, there has not been much work to measure the quality of data in terms of the number of remaining data errors in the data set. The data cleaning best practices basically employ a number of (orthogonal) data cleaning techniques, algorithms or human/crowd workers in a hope that the increased cleaning efforts would result in a perfectly clean data set. To guide such cleaning efforts, techniques to estimate the number of undetected remaining errors in a data set are proposed.

Lastly, a couple of uncertainty problems related to machine learning (ML) model quality are addressed. ML is one of the most popular tools for learning and making predictions on data. For its use, ensuring good ML model quality leads to more accurate and reliable data analysis results. The most common practice for model quality control is to consider various test performance metrics on separate validation data sets; however, the problem is that the overall performance metrics can fail to reflect the performance on smaller subsets of the data. At the same time, evaluating the model on all possible subsets of the data is prohibitively expensive, which is one of the key challenges in solving this uncertainty problem. Furthermore, missing unknown data items can also degrade the quality of the model. The challenge is that most ML/inference models can perform badly on unseen instances, if similar cases are not learned during training.

Host: Professor Tim Kraska