New England Database Symposium

neds.gif (1190 bytes)

New England Database Society

Friday, October 07, 2011

sponsored by Netezza Corporation

sunlogo.gif (4979 bytes)

NEDS

MADDER and Self-Tuning Data Analytics on Hadoop with Starfish

Shivnath Babu
Duke University

Friday, October 07, 2011, 4PM
HP/Vertica Computer Science Lounge (Volen 104), Brandeis University

(preceded by a wine and cheese reception at 3:30 pm, and followed by dinner at 6:00 pm)

Abstract:

Timely and cost-effective analytics over "big data" is now a key ingredient for success in businesses and scientific disciplines. The Hadoop platform---consisting of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces to express analysis tasks---is an emerging choice for big data analytics. Hadoop's performance out of the box can be poor, causing suboptimal use of resources, time, and money (e.g., in pay-as-you-go clouds). Unfortunately, practitioners of big data analytics such as business analysts, computational scientists, and researchers often lack the expertise to tune the Hadoop platform for good performance.

I will introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop, while adapting to system workloads and user needs to provide good performance automatically; without any need for users to understand and manipulate the many tuning knobs in the Hadoop platform. While Starfish's design is guided by work on self-tuning database systems, I will discuss how new analysis practices (dubbed the MADDER principles) over big data pose new challenges; leading us to different design choices in Starfish. Starfish is under active development and is available at: http://www.cs.duke.edu/starfish

Speaker's Bio:

Shivnath Babu is an Assistant Professor of Computer Science at Duke University. He got his Ph.D. from Stanford University in 2005. He has received a U.S. National Science Foundation
CAREER Award and three IBM Faculty Awards. His research interests are in ease-of-use and manageability of data-intensive computing systems, automated problem diagnosis and cluster
sizing for systems running on cloud platforms, and automated detection and recovery from corruption of data caused by hardware faults, software bugs, or human mistakes.