A continuous MapReduce framework for stream-oriented applications

Overview Publications People Acknowledgements


The Continuous-MapReduce (C-MR) framework is a parallelized stream processing engine that is being developed at
Brown University. C-MR supports low-latency stream processing using the familiar MapReduce interface with the addition of integrated window semantics. In keeping with the philosophy of MapReduce, C-MR abstracts away the complexities of parallelized data stream management and windowing for complex workflows of MapReduce jobs.

To support the continuous application of MapReduce jobs to data streams over parallel hardware platforms, it is necessary to provide the following key features and functionalities:
  • Little or no changes to the standard MapReduce programming model
  • Support for complex workflows of MapReduce jobs
  • Automatic window management to maintain stream order in the midst of parallelized operations
  • A workflow-wide, latency-oriented scheduler that can effectively utilize the available parallel resources to deal with time-varying load and spikes common in real-world streams
Our work describes the design and implementation of Continuous-MapReduce. We make the following high-level contributions:
  • Transparently augmenting the MapReduce processing model to support windows for parallel stream processing. This includes automatic windowed stream management, ensuring the preservation of temporal order after partitioned stream processing and prior to temporal aggregation, and defining windows over application timestamps, all without changing the basic MapReduce programming model and semantics.
  • Effectively leveraging heterogeneous parallel computing nodes. We provide the design and implementation of a generic, parallel stream processing system that can effectively utilize the available computing resources, such as CPUs and GPUs, with potentially heterogeneous capabilities.
  • Supporting flexible and dynamic workflow scheduling to deal with variations in load and resource availability. To this end, we present a progressive scheduler, designed specifically for generic computing nodes, which incrementally and continuously transitions between scheduling policies given the status of system resources. For example, under memory stress, the system that normally uses a latency-minimization scheduling policy switches to a memory saving policy to avoid thrashing.





This work has been partially supported by the National Science Foundation under grant No.
IIS-0916691 and IIS-0448284.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.