About

this is how I look

Hello world! I am a 2nd year Ph.D. student in Computer Science at Brown University under the supervision of Dr. Tim Kraska in the database management group.

My research interests include building more efficient big data analytics systems that are designed for modern, UDF driven data pipelines, dealing with dirty data in a better way and designing future systems for sensor-driven Machine Learning products. Furthermore, I am interested in applications for Computer Vision especially for self-driving cars and how to retrieve statistically sound insights from data.

Before coming to Brown I was working for Mentat Innovations as Data Scientist and as Machine Learning/Data Engineer for BMW's self-driving car group.

Research Projects

Tuplex
Processing for the normal case

Tuplex, short for tuples and exceptions, is a novel big data anlytics framework that is inherently more robust to exceptions and errors produced when processing raw data. Written with a powerful C++ backend and easy-to-use Python frontend it provides a novel programming paradigm, an order of magnitude faster execution speed than Apache Spark and efficient parsers for raw data. The project is currently under heavy development and available under tuplex.cs.brown.edu.

VizRec
A framework for secure data exploration via visual representation

Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. When visualizations are retrieved from sample data however, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem especially arises when differences between visualizations are compared, e.g. to identify an interesting deviation in a pair of observations among many possible pairs. Using theorems from Vapnik/Chervonenkis we developed a novel framework VizRec to provide a safe space for visualization recommendations. This work is currently under review and can be accessed as a preprint here.

Pacer
Parsing CSV files at a faster pace

Pacer is developed as part of the Tuplex project to provide efficient code-generated parsers for CSV files. By using a small data sample, properties of CSV files are automatically detected that then allow to generate efficient LLVM IR code.

Natpy
Data centric compilation

Python is nowadays winning the race as the defacto choice for data scientists. Whereas there is a wide selection of Machine Learning frameworks, data preparation remains a bottle neck. Natpy, short for Native Python samples input data and compiles on the inferred properties native code for compilable UDFs written in Python. This project is part of the larger Tuplex effort.

Publications

  • De Stefani, L., Spiegelberg, L., Kraska, T., & Upfal, E. (2018). VizRec: A framework for secure data exploration via visual representation. (under review)
  • Engel, J., Scherer, M., & Spiegelberg, L. (2017). One-Factor Lévy-Frailty Copulas with Inhomogeneous Trigger Rates. In M. B. Ferraro, P. Giordani, B. Vantaggi, M. Gagolewski, M. Ángeles Gil, P. Grzegorzewski, & O. Hryniewicz (Eds.), Soft Methods for Data Science (pp. 205–212). Cham: Springer International Publishing.
  • Spiegelberg, L. (2017). Model-free approaches for evaluation counterparty credit risk (Master’s thesis). Technische Universität München.
  • Spiegelberg, L. (2013). Ocean circulation studies in the Labrador Sea (Bachelor’s thesis). Technische Universität München.

Other Projects

This is a (non-complete) list of other projects I have been working on in the past:

Picture Perfect Plates
Building a model to classify restaurant images

We developed a computer vision from the ground up to classify restaurant images into different categories together with TripAdvisor, Inc. More can be read about this project here and here.

Two-stage Dictionary Selection
via greedy selection

In this project we developed a two-stage optimization algorithm for dictionary selection under a supermodular assumption that is based on greedy minimization of weakly supermodular set functions. Our final report can be found here.

Adaptive Extendible Hashmaps
with a better in-bucket strategy

Extendible Hashmaps are a commonly used data-structure in filesystems or databases for dynamic hashing (i.e. a growable hashmap). In this work we relaxed the assumption of using fixed buckets and experimented with different in-bucket structures to adapt better to different workloads.

Ordinal regression
in Apache Spark

With ratings and curated lists being widespread, ordinal regression is a modified version of the popular linear model that takes the ordering property of such data into account. In this project we extended Spark by an ordinal regression model and performed various experiments.