Borealis Application Programming Guide

BOREALIS APPLICATION PROGRAMMING GUIDE

Updated: Tuesday June 21, 2005

Borealis is built with open source tools and runs on Linux i86 computers. The Winter 2005 version contains very little support for application programmers. A version will be posted in the Spring of 2005 with more user support.

1. PROGRAM BUILD PROCESS

Before starting you will need to install the packages as described under the heading "Packages Needed For Borealis". To build the software read the comments in the following script, setup your environment, and then run the script:

borealis/utility/unix/build.borealis.sh

To see how to make and run a Borealis application see the examples in the borealis/utility/test/simple/ directory.

An application consists of your C code (<program>.cc), one or more XML files (<program>.xml) and marshaling code (<Program>Marshal.h and <Program>Marshal.cc). The marshal program parses your XML files using the borealis.dtd and generates the marshaling code.

                                <program>.cc
<program>.xml -> [ marshal ] -> <Program>Marshal.h -> [ c++ ] -> <program>
borealis.dtd                    <Program>Marshal.cc

To build a Borealis application you will need the marshal program in a directory listed on your PATH variable. The marshal program is built in the borealis/tool/marshal/ directory. It can be built with:

> borealis.build.sh  -tool     # Build the Borealis tools.

To run a Borealis application you need to have the BigGiantHead program in a directory listed on your PATH variable. The BigGiantHead program is built along with the borealis and CentralLookupServer in the borealis/src/src/ directory. It can be built with:

> borealis.build.sh            # Build the Borealis tools.

Run these programs in separate terminal windows in this order: CentralLookupServer; borealis; <program>

<program> launches the BigGiantHead which sets up and starts the network and then goes away. The program continues to run the application.

<program>.xml --> [ BigGiantHead ] <-----> [ borealis ]
                     [ program ]   <-----> [ borealis ]

Eventually the BigGiantHead will replace the CentralLookupServer. It will continue to run instead of just quitting. It will be able to read additional XML to modify the network on the fly. It will also validate the network (type checking) as it is constructed.

2. NETWORK DEFINITION

Networks are defined by Borealis XML files with following format. The boralis.dtd contains a more formal definition and is commented. Note that XML is case-sensitive.

<?xml version="1.0"?>
<!DOCTYPE borealis SYSTEM "http://www.cs.brown.edu/research/borealis/borealis.dtd">

<borealis>
  <input   stream={stream name}  schema={schema name}    />
  <output  stream={stream name} [schema={schema name}]   />

  <box  name={box name}    type={transform}  >
    <in   stream={input stream name}   />
    <out  stream={output stream name}  />
    <parameter  name={parameter name}   value={parameter value} />
    <access    table={table name} />
  </box>
 
  <query  name={query name} />
    <box  name={box name}    type={transform}  >
       <in   stream={input stream name}   />
       <out  stream={output stream name}  />
       <parameter  name={parameter name}   value={parameter value} />
    </box>
  </query>

  <connectionpointview name={view name} stream={stream name} >
    <order field={field name} />?
  ( <size  value={number of tuples} />
  | <range start={start tuple}  end={end tuple} />
  )
  </connectionpointview>
</borealis>

A deployment file can be written to specify how the Borealis network is distributed over Borealis components. If you do not use a deployment file then a default deployment will be used. Deployment files use the following format.

<?xml version="1.0"?>
<!DOCTYPE borealis SYSTEM "http://www.cs.brown.edu/research/borealis/borealis.dtd">

<deploy   [recovery=( amnesia | upstream | passive | active )] />
  <publish    stream={stream name}  [endpoint={primary}]  />
  <subscribe  stream={stream name}  [endpoint={monitor}]  [gap={gap size}] />

  <node  primary={Borealis node}  query={query name ...}
         [backup={Borealis node}]
       [recovery=( amnesia | upstream | passive | active )] />

  <region  node={Borealis node}  [endpoint={regional component}] />
  <global  endpoint={global component} />
</deploy>

Data and control portals are specified by endpoint, node, primary and backup attributes. They designate a communications portal for a component that receives data or control information from another component. A portal is designated by the IP address of the computer running the receiving component and a port number. The IP address may be designated by the name of the computer or a dotted IP address. The port number is a 16 bit unsigned integer used to select a unique communications channel. They are encoded as:

[{host address}] [:{port number}]

The host address defaults to the computer running the Head. Different default values are used for port numbers depending on the type of component. Constants are defined in src/modules/catalog/distributed/Diagram.h.

Borealis node          15000    node, primary, backup, publish endpoint
Output monitor         25000    subscribe endpoint
Head                   35000    Used for dynamic deployment.
Regional component     45000    region endpoint
Global component       55000    global endpoint

When no deployment XML files are passed to the Head then a default deployment is used. A publish element is a assigned to each input element and a subscribe is a assigned to each output. No regional or global components are deployed. No backup nodes are deployed either.

When testing a system with multiple components it is often convenient to deploy them all on a single computer. The -l option on the Head will override the portals specified in the deployment XML with portals on the local host computer.

Applications may also pass XML to the Head using dynamic deployment. When no XML files are given to the Head via the command line it runs in persistent mode. Applications pass XML to the Head using the RPC calls deployXmlString and deployXmlFile in the HeadServer class. Applications can include borealis/tool/head/HeadClient.h to access these methods. Note that Global components require that the Head be run in persistent mode.

2.1 Stream Definition

A Borealis network is a dataflow diagram. The network can be distributed across several processing nodes.

There are three types of dataflows: streams, inputs and outputs. Streams are dataflows that are inside a node. They are implicitly declared from their name in stream attributes on boxes. Inputs and outputs are dataflows connecting an application program to Borealis nodes.

<input   stream={stream name}  schema={schema name}   />
<output  stream={stream name} [schema={schema name}]  />

An input passes data from an application to a node and an output sends data from a node to an application. For now output streams must be connected to streams that are in turn connected to box outputs. In other words an output can not be directly connected to an input.

The publish and subscribe elements in deployment XML connect application components to streams. Multiple components may publish data to an input stream and several components may subscribe to an output stream. Consequently there may be one or more publish or subscribe elements per stream.

Streams have one incoming (source) connections and may fan out to several outgoing (sink) connections.

The data passed over inputs and outputs are flat structures defined by a schema. Schemas for outputs will eventually be declared implicitly. For now they must be defined.

<schema  name={schema name} />
    <field  name={field name}  type={field type}  [size={string size}]/> ...
</schema>

Field types may be:

int         A 32 bit signed integer.
long        A 64 bit signed integer.
single      A 32 bit IEEE floating point value.
double      A 64 bit IEEE floating point value.
string      A fixed length, zero filled sequence of bytes.
timestamp   A 32 bit time value.

2.2 Query Definition

Processing in a Borealis node is performed by boxes. A query is a group of boxes. This is useful for starting or stopping the group as a whole. For clarity other related elements may be defined within a query, but this does not effect the behavior of the network.

<query  name={query name} />
    <box  name={box name}    type={transform}  >
        <in   stream={input stream name}   />
        <out  stream={output stream name}  />
        <parameter  name={parameter name}   value={parameter value} />
    </box>
</query>

If a box is not wrapped in a query, a query will be defined containing just the one box. The name of the query will be the same as the box name.

2.3 Connection Point View Definition

<connectionpointview name={view name} stream={stream name} >
        <order field={field name} />?
      ( <size  value={number of tuples} />
      | <range start={start tuple}  end={end tuple} />
      )
</connectionpointview>

CPViews can be placed on any arc of the network at the time of the construction or modification of the network. CPView accumulates tuples that flow on that arc according to specifications defined by the user. CPView can be fixed or moving, and that is also defined by its specification. Fixed view stores the same set of tuples that does not change over time. Moving view stores a window of tuples defined with respect to the latest tuple seen on this arc, so as new tuples are flowing along the arc, the set of tuples stored by a moving view changes. Several CPViews with different specifications can co-exist on the same arc.

The user can define size or range and optionally prediction function.

Size can be in terms of number of tuples or in terms of values of the order_by field of the stream, on which the CPView is defined. If specification defines the size, it means CPView is moving, and size determines how many most recent tuples or values CPView stores. Range is defined by specifying start and end parameters (also, either tuples or order_by values). Range can be defined for either fixed or moving CPView. If start and end are absolute, then the CPView is fixed as the tuples stored do not change. However, if the start and end are relative (eg. now and "now - 1 hour"), then the view is moving. Prediction function is an optional parameter and should be defined only for CPViews that are used for time travel into future. For now this parameter is not used.

CPViews can be used for attaching ad-hoc queries to them and for time travel. Time travel can be either into past or into future and it can happen either on the main query network or on a copy of a branch of the network downstream from the CPView. In order for a CPView to time travel into future, prediction function F parameter has to be defined for this view. Time travel happens via issuing replay() command on the CPView. The parameters of replay() determine inwhat part of past/future time travel happens.