|
New England Database Society sponsored by Netezza Corporation |
|
|
NEDS |
MADDER
and
Self-Tuning Data Analytics on Hadoop with Starfish
Shivnath Babu
Duke University
Friday, October 07, 2011, 4PM
HP/Vertica Computer Science Lounge (Volen 104),
Brandeis University
(preceded by a wine and cheese reception at 3:30 pm, and followed by dinner at 6:00 pm)
Abstract:
Timely
and cost-effective analytics over "big data" is now a key ingredient
for success in businesses and scientific disciplines. The Hadoop
platform---consisting of an extensible MapReduce execution engine,
pluggable distributed storage engines, and a range of procedural to
declarative interfaces to express analysis tasks---is an emerging
choice for big data analytics. Hadoop's performance out of the box can
be poor, causing suboptimal use of resources, time, and money (e.g., in
pay-as-you-go clouds). Unfortunately, practitioners of big data
analytics such as business analysts, computational scientists, and
researchers often lack the expertise to tune the Hadoop platform for
good performance.
I will introduce Starfish, a self-tuning system for big data analytics.
Starfish builds on Hadoop, while adapting to system workloads and user
needs to provide good performance automatically; without any need for
users to understand and manipulate the many tuning knobs in the Hadoop
platform. While Starfish's design is guided by work on self-tuning
database systems, I will discuss how new analysis practices (dubbed the
MADDER principles) over big data pose new challenges; leading us to
different design choices in Starfish. Starfish is under active
development and is available at: http://www.cs.duke.edu/starfish
Shivnath Babu is an Assistant
Professor of Computer Science at Duke University. He got his Ph.D. from
Stanford University in 2005. He has received a U.S. National Science
Foundation
CAREER Award and three IBM Faculty Awards. His research
interests are in ease-of-use and manageability of data-intensive
computing systems, automated problem diagnosis and cluster
sizing for
systems running on cloud platforms, and automated detection and
recovery from corruption of data caused by hardware faults, software
bugs, or human mistakes.
Maintained by Olga Papaemmanouil olga AT cs.brandeis.edu