My research interests include building more efficient big data analytics systems that are designed for modern, UDF driven data pipelines, dealing with dirty data in a better way and designing future systems for sensor-driven Machine Learning products. Furthermore, I am interested in applications for Computer Vision especially for self-driving cars and how to retrieve statistically sound insights from data.
Processing for the normal case
Tuplex, short for tuples and exceptions, is a novel big data anlytics framework that is inherently more robust to exceptions and errors produced when processing raw data. Written with a powerful C++ backend and easy-to-use Python frontend it provides a novel programming paradigm, an order of magnitude faster execution speed than Apache Spark and efficient parsers for raw data. The project is currently under heavy development and available under tuplex.cs.brown.edu.
A framework for secure data exploration via visual representation
Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. When visualizations are retrieved from sample data however, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem especially arises when differences between visualizations are compared, e.g. to identify an interesting deviation in a pair of observations among many possible pairs. Using theorems from Vapnik/Chervonenkis we developed a novel framework VizRec to provide a safe space for visualization recommendations. This work is currently under review and can be accessed as a preprint here.
Parsing CSV files at a faster pace
Pacer is developed as part of the Tuplex project to provide efficient code-generated parsers for CSV files. By using a small data sample, properties of CSV files are automatically detected that then allow to generate efficient LLVM IR code.
Data centric compilation
Python is nowadays winning the race as the defacto choice for data scientists. Whereas there is a wide selection of Machine Learning frameworks, data preparation remains a bottle neck. Natpy, short for Native Python samples input data and compiles on the inferred properties native code for compilable UDFs written in Python. This project is part of the larger Tuplex effort.
- De Stefani, L., Spiegelberg, L., Kraska, T., & Upfal, E. (2018). VizRec: A framework for secure data exploration via visual representation. (under review)
- Engel, J., Scherer, M., & Spiegelberg, L. (2017). One-Factor Lévy-Frailty Copulas with Inhomogeneous Trigger Rates. In M. B. Ferraro, P. Giordani, B. Vantaggi, M. Gagolewski, M. Ángeles Gil, P. Grzegorzewski, & O. Hryniewicz (Eds.), Soft Methods for Data Science (pp. 205–212). Cham: Springer International Publishing.
- Spiegelberg, L. (2017). Model-free approaches for evaluation counterparty credit risk (Master’s thesis). Technische Universität München.
- Spiegelberg, L. (2013). Ocean circulation studies in the Labrador Sea (Bachelor’s thesis). Technische Universität München.
This is a (non-complete) list of other projects I have been working on in the past:
Adaptive Extendible Hashmaps
with a better in-bucket strategy
Extendible Hashmaps are a commonly used data-structure in filesystems or databases for dynamic hashing (i.e. a growable hashmap). In this work we relaxed the assumption of using fixed buckets and experimented with different in-bucket structures to adapt better to different workloads.
in Apache Spark
With ratings and curated lists being widespread, ordinal regression is a modified version of the popular linear model that takes the ordering property of such data into account. In this project we extended Spark by an ordinal regression model and performed various experiments.