Distributed Stream Processing Engine |
 |
|
Software
Borealis is a distributed stream processing engine that is being developed at
Brandeis University,
Brown University, and
MIT. Borealis builds on our previous efforts in the area
of stream processing: Aurora and Medusa.
The Borealis project is no longer an active research project.
The last two releases of Borealis are:
The Borealis software runs on Linux x86-based computers.
The instructions to install Borealis
have details about system compatibility and software installation.
The Borealis Application Programmer's Guide
describes how to write application programs using Borealis. The
Borealis Developer's Guide
and the paper,
A Distributed Catalog for the Borealis Stream Processing Engine,
describes the distributed operation of the Borealis system software.
The current version of Borealis includes the following modules:
- A Stream Processing Engine that provides the core low-latency stream processing functionality with a rich set of stream-oriented operators.
- A Coordinator that deploys a network of cooperating Borealis stream
processing engines, distributes query processing across multiple
machines, and maintains integrity and correct operation as the network
is dynamically mutated.
- A Load Manager that monitors run-time load and dynamically moves operators across machines to improve performance.
- A Load Shedder that detects and eliminates CPU overload from multiple
machines by selectively dropping tuples in a coordinated fashion.
- A Fault Tolerance module that runs redundant query replicas to deal with various failure modes and achieve high availability.
- A Revision Processing mechanism that process corrections of erroneous
input tuples by generating corrections of previously output query
results.
The release also includes the following tools:
- A Graphical Query Editor that simplifies the composition of streaming
queries using a boxes and arrows interface. Queries may also be written
using a textual (XML-based) query definition language.
- A System Visualizer that dynamically displays the Borealis network topology and various run-time statistics.
- A Stream Connection Generator to
generate code to marshal data between applications and Borealis.
Copyright 2006, Brandeis University, Brown University, and Massachusetts
Institute of Technology; all rights reserved.
Redistribution and use in source and binary forms,
with or without modification, are permitted provided that the following
conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither Brandeis University, Brown University, MIT names, nor the names
of individual contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS
PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Overview
Over the last several years, a great deal of progress has been made in
the area of stream processing engines (SPEs). Several groups have developed
working prototypes and many
papers have been published on detailed aspects of the technology such
as stream-oriented languages, basic resource management, and resource-constrained one-pass query
processing. While this
work is an important first step, fundamental mismatches remain between
the requirements of many streaming applications and the capabilities
of first-generation systems.
In the Borealis project, we
identify and address the following shortcomings of current stream
processing techniques:
- Distributed, highly-available operation: Distributed operation
across a cluster of commodity machines is the most economical way of
achieving high scalability and availability, two key design concerns common
to many streaming domains. Furthermore, typical streaming workloads exhibit
significant bursts and variations over all time-scales, requiring the
ability to dynamically distribute load and deal with transient overloads. We
are investigating novel distributed architectures and algorithms customized
on the basis of the requirements and characteristics of stream processing
applications. Examples include load distribution algorithms that are
tolerant of load spikes and high availability approaches that enable
parallel, low latency recovery.
- Dynamic revision of query results: In many real-world
streams, corrections or updates to previously processed data are
available only after the fact. For instance, many popular data
streams, such as the Reuters stock market feed, often include
so-called revision records, which allow the feed originator
to correct errors in previously reported data. Furthermore, stream
sources (such as sensors), as well as their connectivity, can be
highly volatile and unpredictable. As a result, data may arrive late
and miss its processing window, or may be ignored temporarily due to
an overload situation. In all these cases, applications are forced to
live with imperfect results, unless the system has means to revise its
processing and results to take into account newly available data or
updates.
- Dynamic query modification: In many stream processing
applications, it is desirable to change certain attributes of the
query at run time. For example, in the financial services domain,
traders typically wish to be alerted of interesting events,
where the definition of ``interesting'' (i.e., the corresponding
filter predicate) varies based on current context and results. In
network monitoring, the system may want to obtain more precise results
on a specific subnetwork, if there are signs of a potential
Denial-of-Service attack. Finally, in a military stream application
that Mitre explained to us, they wish to switch to a ``cheaper'' query when the
system is overloaded. For the first two applications, it is sufficient to
simply alter the operator parameters (e.g., window size, filter predicate),
whereas the last one calls for altering the operators that compose the
running query.
- Flexible and highly-scalable optimization: Currently,
commercial stream processing applications are popular in industrial
process control,
financial services, and network monitoring. Here we see a server heavy optimization problem
--- the key challenge is to process high-volume data streams on a
collection of resource-rich ``beefy'' servers. Over the horizon, we
see a very large number of applications of wireless sensor technology
(e.g., RFID in retail applications, cell phone services). Here, there is a
sensor heavy optimization problem --- the key
challenges revolve around extracting and processing sensor data from a
network of resource-constrained ``tiny'' devices. Further over the
horizon, we expect sensor networks to become faster and increase in
processing power. In this case the optimization problem becomes more
balanced, becoming sensor heavy, server heavy. Thus, there will be a need
for a more flexible optimization structure that can deal with a very
large number of devices and perform sensor-heavy
server-heavy resource management and optimization.
|
Publications
-
Fast and Highly-Available Stream Processing over Wide Area Networks
Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik, 24th International Conference on Data Engineering (ICDE'08), Cancun, Mexico, April 2008.
-
Staying FIT: Efficient Load Shedding Techniques for Distributed Stream
Processing
Nesime Tatbul, Ugur Cetintemel, Stan Zdonik, 33rd International Conference on Very Large Data Bases (VLDB'07), Vienna, Austria, September 2007.
-
Fast and Reliable Stream Processing over Wide Area Networks
Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik, IEEE 1st International Workshop on Scalable Stream Processing Systems (ICDE SSPS'07), Istanbul, Turkey, April 2007.
-
A Cooperative, Self-Configuring High-Availability Solution for Stream Processing
Jeong-Hyon Hwang, Ying Xing, Ugur Cetintemel, and Stan Zdonik, 23rd International Conference on Data Engineering (ICDE'07), Istanbul, Turkey, April 2007.
-
Providing Resiliency to Load Variations in Distributed Stream Processing
Ying Xing, Jeong-Hyon Hwang, Ugur
Cetintemel, Stan Zdonik, 32nd International Conference on Very Large
Data Bases (VLDB'06), Seoul, Korea, September 2006.
-
Window-aware Load Shedding for Aggregation Queries over Data Streams
Nesime Tatbul, Stan Zdonik, 32nd
International Conference on Very Large Data Bases (VLDB'06), Seoul,
Korea, September 2006.
- Revision Processing in a Stream Processing Engine: A High-Level Design
E. Ryvkina, A. S. Maskey, M. Cherniack, S. Zdonik, 22nd International
Conference on Data Engineering (ICDE'06), Atlanta, GA, April 2006.
- Dealing
with Overload in Distributed Stream Processing Systems
Nesime Tatbul, Stan Zdonik, IEEE International
Workshop on Networking Meets Databases (NetDB'06), Atlanta, GA, April 2006.
-
Fault-Tolerance
in the Borealis Distributed Stream Processing System
Magdalena Balazinska, Hari
Balakrishnan, Samuel Madden, and Michael Stonebraker, ACM SIGMOD International
Conference on Management of Data (SIGMOD'05), Baltimore, MD, June 2005.
- The
8 Requirements of Real-Time Stream Processing
Michael Stonebraker, Ugur Cetintemel, and
Stan Zdonik, 21st International Conference on Data Engineering
(ICDE'05), Tokyo, Japan, April 2005.
-
Dynamic
Load Distribution in the Borealis Stream Processor
Ying Xing, Stan Zdonik, and
Jeong-Hyon Hwang, 21st International Conference on Data Engineering
(ICDE'05), Tokyo, Japan, April 2005.
-
High-Availability
Algorithms for Distributed Stream Processing
Jeong-Hyon Hwang, Magdalena
Balazinska, Alexander Rasin, Ugur Cetintemel, Michael Stonebraker, and
Stan Zdonik, 21st International Conference on Data Engineering
(ICDE'05), Tokyo, Japan, April 2005.
-
The
Design of the Borealis Stream Processing Engine
Daniel J. Abadi, Yanif Ahmad,
Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon
Hwang, Wolfgang Lindner, Anurag S. Maskey, Alexander Rasin, Esther
Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik, 2nd Biennial
Conference on Innovative Data Systems Research (CIDR'05), Asilomar, CA,
January 2005.
-
Scalable Distributed Stream Processing
M. Cherniack, H. Balakrishnan, M. Balazinska,
D. Carney, U. Cetintemel, Y. Xing, S. Zdonik. In proceedings of the First
Biennial Conference on Innovative Database Systems (CIDR'03), Asilomar, CA,
January 2003.
Demonstrations:
- Distributed
Operation in the Borealis Stream Processing Engine
Yanif Ahmad, Bradley Berg, Ugur
Cetintemel, Mark Humphrey, Jeong-Hyon Hwang, Anjali Jhingran, Anurag
Maskey, Olga Papaemmanouil, Alex Rasin, Nesime Tatbul, Wenjuan Xing,
Ying Xing, Stan Zdonik, 2nd International Conference on Geosensor Networks,
Boston, MA, October 2006 (invited demo)
- Distributed
Operation in the Borealis Stream Processing Engine
Yanif Ahmad, Bradley Berg, Ugur
Cetintemel, Mark Humphrey, Jeong-Hyon Hwang, Anjali Jhingran, Anurag
Maskey, Olga Papaemmanouil, Alex Rasin, Nesime Tatbul, Wenjuan Xing,
Ying Xing, Stan Zdonik, ACM SIGMOD International Conference on
Management of Data (SIGMOD'05), Baltimore, MD, June 2005. Best
Demonstration Award. (Overview
Poster, Game Poster)(Some photos from the demo)
-
Magdalena Balazinska, Hari
Balakrishnan, Michael Stonebraker, ACM SIGMOD International Conference
on Management of Data (SIGMOD'04), Paris, France, June 2004. (Load Management and High
Availability (ppt))
|
People
Brandeis:
Brown:
MIT:
|
Acknowledgements
This project is based
upon work supported by the National Science Foundation under Grant No.
0325703 and 0086057. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of the National Science
Foundation
|
|
|