About
Hello world! I am a 4th year Ph.D. student in Computer Science at Brown University. My advisors are Dr. Malte Schwarzkopf and Dr. Tim Kraska in the database management group.
My research interests include building more efficient big data analytics systems that are designed for modern, UDF driven data pipelines, dealing with dirty data in a better way and designing future systems for sensor-driven Machine Learning products. Furthermore, I am interested in applications for Computer Vision especially for self-driving cars and how to retrieve statistically sound insights from data.
Before coming to Brown I was working for Mentat Innovations as Data Scientist and as Machine Learning/Data Engineer for BMW's self-driving car group.
Research Projects
Tuplex
Data Science in Python at Native Code Speed
Tuplex, short for tuples and exceptions, is a novel big data anlytics framework that is inherently more robust to exceptions and errors produced when processing raw data. Written with a powerful C++ backend and easy-to-use Python frontend it provides a novel programming paradigm, an order of magnitude faster execution speed than Apache Spark and efficient parsers for raw data. The project is currently under heavy development and available under tuplex.cs.brown.edu. A preprint can be accessed below.
VizCertify
A framework for secure data exploration
Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. When visualizations are retrieved from sample data however, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem especially arises when differences between visualizations are compared, e.g. to identify an interesting deviation in a pair of observations among many possible pairs. Using theorems from Vapnik/Chervonenkis we developed a novel framework VizRec to provide a safe space for visualization recommendations. This work has been published in DSAA 2019.
Publications
- Spiegelberg, L., Yesentharao, R., Schwarzkopf, M. Kraska, T. (2020). Tuplex: Data Science in Python at Native Code Speed [pub preprint]
- De Stefani, L., Spiegelberg, L., Kraska, T., & Upfal, E. (2019). VizCertify: A framework for secure data exploration. [DSAA'19] [pub] [slides]
- Spiegelberg, L., Kraska, T. (2019). Tuplex: Robust, Efficient Analytics When Python Rules [VLDB'19] [pub]
- Engel, J., Scherer, M., & Spiegelberg, L. (2017). One-Factor Lévy-Frailty Copulas with Inhomogeneous Trigger Rates. In M. B. Ferraro, P. Giordani, B. Vantaggi, M. Gagolewski, M. Ángeles Gil, P. Grzegorzewski, & O. Hryniewicz (Eds.), SMDS'17
Other Projects
This is a (non-complete) list of other projects I have been working on in the past:
Picture Perfect Plates
Building a model to classify restaurant images
We developed a computer vision from the ground up to classify restaurant images into different categories together with TripAdvisor, Inc. More can be read about this project here and here.
Two-stage Dictionary Selection
via greedy selection
In this project we developed a two-stage optimization algorithm for dictionary selection under a supermodular assumption that is based on greedy minimization of weakly supermodular set functions. Our final report can be found here.
Adaptive Extendible Hashmaps
with a better in-bucket strategy
Extendible Hashmaps are a commonly used data-structure in filesystems or databases for dynamic hashing (i.e. a growable hashmap). In this work we relaxed the assumption of using fixed buckets and experimented with different in-bucket structures to adapt better to different workloads.
Ordinal regression
in Apache Spark
With ratings and curated lists being widespread, ordinal regression is a modified version of the popular linear model that takes the ordering property of such data into account. In this project we extended Spark by an ordinal regression model and performed various experiments.