Thesis Defense


"C-MR: Continuous Execution of MapReduce Workflows for Stream Processing"

Nathan Backman

Thursday, July 26, 2012, at 2:00 P.M.

Lubrano Conference Room

Data processing frameworks provide application programmers an interface to manipulate and analyze data. This thesis studies a novel parallel stream processing model, designed for workflow-based data processing frameworks, that leverages application performance requirements to motivate the flexible scheduling and fine-grained allocation of data to computing nodes.

We feature this processing model through the design and implementation of the Continuous-MapReduce (C-MR) data processing framework. C-MR abstracts away the complexities of parallel stream processing and workflow scheduling while providing the simple and familiar MapReduce programming interface with the addition of stream window semantics. Its novel processing model enables: 1) fine-grained, workflow-wide load balancing across computing nodes; 2) the evolving application of data and task parallelism models as guided by application performance requirements; and 3) a novel scheduling framework which supports gradual transitions between scheduling policies relative to application performance and/or resource availability.

This work explores the potential of the C-MR processing model by studying our single-host implementation of C-MR that supports parallel execution on non-dedicated and heterogeneous computing nodes (both multi-core CPUs and GPUs). We then study this processing model through the implementation of a distributed version of C-MR that supports execution on multiple hosts. This endeavor involved the generalizable strategy of employing hierarchical instances of the C-MR processing model while requiring modifications to the data acquisition and load balancing strategies. Experimental results from these studies show that the C-MR processing model can effectively support the continuous execution of workflows of MapReduce jobs for stream processing while being resilient to stream and resource fluctuations due to the processing model's flexibility and diversification of processing responsibilities.

Host: Ugur Cetintemel