My research interests include building more efficient big data analytics systems that are designed for modern, UDF driven data pipelines, dealing with dirty data in a better way and designing future systems for sensor-driven Machine Learning products. Furthermore, I am interested in applications for Computer Vision especially for self-driving cars and how to retrieve statistically sound insights from data.
Data Science in Python at Native Code Speed
Tuplex, short for tuples and exceptions, is a novel big data anlytics framework that is inherently more robust to exceptions and errors produced when processing raw data. Written with a powerful C++ backend and easy-to-use Python frontend it provides a novel programming paradigm, an order of magnitude faster execution speed than Apache Spark and efficient parsers for raw data. The project is currently under heavy development and available under tuplex.cs.brown.edu. A preprint can be accessed below.
A framework for secure data exploration
Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. When visualizations are retrieved from sample data however, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem especially arises when differences between visualizations are compared, e.g. to identify an interesting deviation in a pair of observations among many possible pairs. Using theorems from Vapnik/Chervonenkis we developed a novel framework VizRec to provide a safe space for visualization recommendations. This work has been published in DSAA 2019.
- Spiegelberg, L., Yesentharao, R., Schwarzkopf, M. Kraska, T. (2020). Tuplex: Data Science in Python at Native Code Speed [pub preprint]
- De Stefani, L., Spiegelberg, L., Kraska, T., & Upfal, E. (2019). VizCertify: A framework for secure data exploration. [DSAA'19] [pub] [slides]
- Spiegelberg, L., Kraska, T. (2019). Tuplex: Robust, Efficient Analytics When Python Rules [VLDB'19] [pub]
- Engel, J., Scherer, M., & Spiegelberg, L. (2017). One-Factor Lévy-Frailty Copulas with Inhomogeneous Trigger Rates. In M. B. Ferraro, P. Giordani, B. Vantaggi, M. Gagolewski, M. Ángeles Gil, P. Grzegorzewski, & O. Hryniewicz (Eds.), SMDS'17
This is a (non-complete) list of other projects I have been working on in the past:
Adaptive Extendible Hashmaps
with a better in-bucket strategy
Extendible Hashmaps are a commonly used data-structure in filesystems or databases for dynamic hashing (i.e. a growable hashmap). In this work we relaxed the assumption of using fixed buckets and experimented with different in-bucket structures to adapt better to different workloads.
in Apache Spark
With ratings and curated lists being widespread, ordinal regression is a modified version of the popular linear model that takes the ordering property of such data into account. In this project we extended Spark by an ordinal regression model and performed various experiments.