# Thesis Proposal

### "Probabilistic approaches for rigorous and efficient analysis of statistical properties of large datasets"

### Lorenzo De Stefani

### Friday, November 16, 2018 at 2:00 P.M.

### Room 368 (CIT 3rd Floor)

The analysis of statistical properties of data is a core goal in computers science. The challenges and approaches to this task have evolved through time due to the explosion of the size of the datasets being considered, and due the development of new analysis tools. Rather than computing the exact quantities of interest, computing high-quality estimates often allow for a considerable reduction in terms of computational effort while retaining meaningful and reliable insight into the statistical properties of the data.

In this thesis, we study the applications of probabilistic analysis and statistical learning theory to the analysis of large datasets. Using these concepts allows to rigorously characterize the relationship between the amount of input *training data*, the *quality*, in terms of *confidence* and *precision*, of the approximations achievable for a certain task, and the complexity of the task itself.

In the first part of this work, we study the application of statistical learning tools such as Rademacher Complexity and VC dimension, to the classical *hypothesis testing* problem.
As a preliminary result, we introduce **RadaBound**, a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Practical data analysis and machine learning are usually iterative and adaptive processes. Based on the results of the hypotheses (or models) tested so far, the user selects a new hypothesis (or model) to test until an adequate model is found. Standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses. To evaluate the performance of such an iterative analysis process we need to quantify the accumulated errors in running a sequence of non-independent tests on the same data. Our solution is based on a new application of the *Rademacher Complexity* generalization bounds adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches.

We also explore the application of the Rademacher Complexity uniform convergence bounds to the setting of multiple hypothesis testing. Traditional methods offering control of the *Family Wise Error Rate* (FWER) do not scale well as their *statistical power* considerably decreases as the number of hypotheses being considered scales the power.
In our **RadeFWER** approach, the statistical power of the procedure decreases with respect to the *actual complexity* of the considered hypotheses, captured by a *data-dependent* way by estimation of the Rademacher Complexity. We verify experimentally the correctness of our approach and the achieved increase of statistical power compared to classical FWER control procedures.
As proposed future work, we aim to apply the Rademacher based bounds to achieve statistical procedures which pride control of the *False Discovery Rate* (FDR), an alternative, less restrictive, control criteria for hypothesis testing.

We investigate the importance of rigorous statistical control of multiple hypothesis testing in the context of the visual exploration of large datasets. We argue that data representation should indeed be considered hypotheses and the statistical significance of the observed phenomenon should correspondingly rigorously be ascertained. In our preliminary results, we build upon the *Alpha Investing* testing scheme to obtain a procedure which allows for adaptive visual exploration while controlling the *marginal False Discovery Rate* (mFDR), that is the ratio between the expected number of false positives and the expected number of discoveries.
As proposed work, we aim to use statistical learning analysis tools in order to both obtain stronger FWER guarantees, and to extend the scope of the control to the validation of recommendations of *"interesting"* visualization. In particular, our work-in-progress approach is based on the analysis of the *Vapnik-Chervonenkis* (VC) dimension of the class of queries corresponding to the visualizations being considered.

In the second part of this thesis, we study massive dynamic graphs being observed ad an adversarial stream of edges insertions and deletions. The goal of our analysis is to count the number of occurrences of a given small sub-graph patter or a *"motif"* of interest.

In our preliminary results, we present **Trìest**, a suite of one-pass streaming algorithms based on *reservoir sampling* and the *random pairing* sampling schemes. **Trìest** algorithms, which yield high-quality unbiased estimate of the number of *triangle motifs* in fully dynamic edges stream while using a small local memory in which a small and random sample of the edges is maintained. We present a complete analysis of the variance of the proposed algorithms and corresponding large deviations bounds.
We also present **Tiered Sampling**, an extension of the **Trìest** paradigm which employs multiple reservoir sample tiers in order to accurately estimate the count of rare and complex patterns such as 4 and 5 cliques. We evaluate the performance of both approaches thought extensive experimental analysis on real-world graphs with up to five billions of edges while maintaining only a small fraction (<0.5\%) of such graphs as a sample.

As proposed work, we aim to deploy the paradigm of **Trìest** to the study of *temporal graph motifs*, that is occurrences of sub-graph patterns characterized not only by their topology but also by their temporal behavior. Such results would provide the means of a characterization of the *evolution* of the properties of large graphs through time.

Host: Professor Eli Upfal