New England Database Symposium

neds.gif (1190 bytes)

New England Database Society

Friday, May 4, 2012

sponsored by Netezza Corporation

sunlogo.gif (4979 bytes)

NEDS

Statistical Data-Analysis in an RDBMS Almost for Free
Christopher Re
University of Wisconsin-Madison

Friday, May 4, 2012, 4PM
HP/Vertica Computer Science Lounge (Volen 104), Brandeis University

(preceded by a wine and cheese reception at 3:00 pm, and followed by dinner at 6:00 pm)

Abstract:

The main question driving my research is: how does one deploy statistical data-analysis tools to enhance data-driven systems? Our goal is to find abstractions that one needs to deploy and maintain such systems. In this talk, I describe my group's attack on this question by building a diverse set of statistical-based data-driven applications: a system whose goal is to read the Web and answer complex questions, a muon detector in collaboration with a neutrino telescope called IceCube, and a social-science applications involving rich content (OCR and speech data). Even in this diverse set, we have found common abstractions that we are exploiting to build systems.

In the technical portion of the talk, I discuss one such abstraction that we found attempting to answer the question: how can we bring sophisticated data-analysis tools to data that lives in an RDBMS? My technical message is that the algorithmic problems underlying many statistical data analysis techniques can be solved with a classical algorithm called incremental gradient descent that is no more difficult to compute than a SQL AVG. To demonstrate our point, we have implemented this method on top of a handful of commercial and open-source databases. Our approach is often faster than special-purpose tools and avoids a messy export-reimport cycle.

Papers, software, virtual machines containing installations of our software with data, and links to applications that are discussed in this talk are available from http://www.cs.wisc.edu/hazy.

Speaker's Bio:

Christopher (Chris) Ré is an assistant professor in the department of Computer Sciences at the University of Wisconsin-Madison. The goal of his work is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from
the University of Washington, Seattle under the supervision of Dan Suciu. For his PhD work in the area of probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. Chris's papers have received four best papers or best-of-conference citations
(best paper in PODS 2012 and best-of-conference in PODS 2010, twice, and one in ICDE 2009). Chris received an NSF CAREER Award in 2011 and was recently granted his first patent.