The 42nd IPP Symposium

MapReduce and Parallel DBMSs: Together at Last

Andy Pavlo, Brown University

The MapReduce (MR) paradigm is heralded as a revolutionary new platform for large-scale, massively parallel data access. Some proponents claim that the extreme scalability of MR will relegate relational database management systems (DBMS) to the status legacy technology. In this talk, however, I will discuss the results from our recent benchmark study from that suggest that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results [PPR+09]. This leads us to conclude that MR is more akin to an Extract-Transform-Load (ETL) system than a DBMS, as it is quickly able to load and analyze large amounts of data in an ad hoc manner [SAD+10]. As such, it is complementary to DBMS technology, rather than a competitor.

Thus, I will also discuss how the DBMS community has embraced MR technologies in the last year, and what features of DBMSs are being incorporated into popular open-source MR implementations.

Andrew Pavlo is a third year Computer Science PhD candidate with Brown University's Data Management Group under the guidance of Stanley Zdonik. His research includes the development of automatic database design and optimization algorithms for Michael Stonebraker's OLTP database project H- Store. Prior to arriving at Brown, Andrew was a systems programmer for the Condor Project at the University of Wisconsin-Madison under Miron Livny. He completed his undergraduate at the Rochester Institute of Technology in New York.