Array Design (week of 8/17/06)



Borealis

Borealis Type Checker

Borealis Expression Library

Aggregate -Parse and handle aggregate box. StandardAggregate contains most of the implementation. This will be minimally impacted by arrays in the expression phase because it only uses expressions in its arguments.

EString -Basic string type for standard functions. May make a user-defined type for Arrays.

EStringPool -manages allocation and string management. Will not be impacted by arrays.

ExprLexer -Lexical analyzer for string expressions. Invoked in expression parse method. Takes input streams and breaks them into tokens. Will need to add previously unimplemented operators like the dot product and brackets.

ExprParser -Builds an expression tree from lexer output. If we build the expressions properly this will probably require minimal modification, only for new ops

ExprParserTokenTypes.txt -Defines strings to be regarded as tokens. Will need to add our new ops and type to this list.

Will we need to have an input method in XML for comparing arrays (i.e. a filter predicate that could ask if an array is 3x3 and filled with ones)?

Expr.g -Add support for dot, [], etc. I am leaning against making arrays declarable in our initial pass (as in the above example).

ExprUtil -Support functions

ExprWalker -This generates and evaluates(?) expressions from SQL-style queries. Do boxes translate themselves into this in their respective classes? This will require no modification if everything is implemented in the standard fashion.

Expression -Base class that all expressions are built from. May want to add a hook to this base class to differentiate between array and scalar contents for memory management purposes. Will need to implement is_impl and evalImpl methods for our standard functions.

EvalContext -Tells the expression which tuples it refers to

MathFunctions -Implementations of various templated math ops. Will need TypedArray versions of this. These will be wrappers for Blitz equivalent library calls

NArgs - This enforces that an expression has the correct number of arguments. It also has the base class for Function. The Function factories will need to be modified to create array funcs.

SQL* - Manage and access tables of expressions. This will require minimal to no intervention for arrays.

Arrays should probably be like TypedExpression<>.

StandardAggregate -Window management and other standard functions. Should not be impacted by arrays

StandardFunctions -Generic functions such as min/max/strlen/int. Mostly helpers. Will need to add a few of our own here.

modules/common/DataType - Will need to add our type(s) to this. Is there any convenient way to make this templatable? i.e Array<int> [3][3][3] It does not look like it at first glance.. This will make the from string method not have the correct length.



Boxes

Qbox - Constructs streams and matches with schemas. Will need to interface to catalog for type checking an validation of streams

Will need to deal with the Tuple creation to allocate arrays.

Use join as an example b/c it requires allocation to create a new tuple of a different type

Basic Design Ideas



Basic Data Flow

Basic Construction



I need to start thinking about finding a camera or something for an initial data stream.



Push vs. Pull

Pull

Push

+simple hand-off, easier traceability

+seems more intuitive from a stream processing perspective

-may require more up front setup

-puts more burden on the node



Table vs. Array Part of Tuple (2 pipes vs. 1)

Table

+Minimizes duplication

+encapsulation of memory management / optimization techniques

+ may be more extensible b/c we are designing from the ground up

-more initial time / planning investment

-may break previous structures



Tuples with Arrays



Do scaling or delta image work for original demo

Proof out xml this week

Non-uniform scaling requires a lot of conditionals for boundaries of the image as well as for scaling up versus down. In traditional Borealis boxes this will require a lot of them. Potential implementation strategies:

Tons and Tons of Boxes
+ Conforms to standard DB operators
+ Will flex the majority of our new capability
- Ugly to code
- harder to debug

Implementing Scale as an arithmetic operator
+ simpler design
+ code essentially already done
- Seems like a hack
- Will not be a good showcase for SPE

Making a single custom box
+ Easier optimization
- Is this splitting hairs with the arithmetic op one?

Hybrid Solution: Specialized boxes such as vertical_scale_up, with traditional filters to route them to the right boxes
+ still shows stream processing / db ops
+ Simplifies out boundary conditions and allows faux statefulness (i.e. calculating ratio of old size to new size)
- Will need to implement custom boxes





Do we want to have each query chain have its own copy of the array to operate on in place or is it preferable for them all to share one copy?

- Going with in-place stream operations right now.

Does it make sense to break the arrays up in cases like filter that have partial matches? How can we handle these break ups? Designate output by indices or ? Do images make sense in arbitrary partial form? Might it make more sense that if we have a match to the predicate we return a sub-array?



Box design

Needs to:

How does a box get its XML schema?

Will our designed loading mechanism have any major pitfalls w/the current box implementation?

Does it make more sense to create new array boxes or can the current ones be extended to accept these Blitz pointers?

Blitz may not work b/c it needs to know dimensions at compile time. Unfortunately boost is no better on that front. (May require custom expression evaluator regardless of partial eval after all).

Can use *=, +=, etc. to avoid unnecessary copies in Blitz?





Boxes

All: need to adapt to looking up pointers to arrays.

Should boxes have a way of indicating to the ArrayAcceptor/Manager how many tuples in a stream it may need?

Filter: When acting on an array it will pass on cells that match and set the rest of cells to null in its out stream

Map: Apply transformations in place to matrices and send to out stream

Union: Will probably not change much

Bsort: Will do comparisons and emissions in place. Need to make sure that table keeps n instances of the array.

Aggregate: The way we handle this depends on both the aggregating operation as well as the window. For min(..), max(..) we can just hang on to the appropriate answer and compare it with the rest of the inputs. For average, we will need to hang on to <window size> number of arrays. Is there a better way to do this? Are there other operators that we should implement for aggregate?

Join: Will hopefully not have to buffer too many arrays at once, may want to artificially limit number or work out caching scheme if it is too many. Will have to do the same with aggregate. Caching scheme may have been required anyway (as opposed to having all the tuples in memory at once).

Resample: This will use the solution to the problems of aggregate operators

Custom Boxes: What will we need for input information? What are the requirements of custom box coding now?

Expression Evaluation



Model for Generic Expression Evaluation (we may modify/optimize for specific apps)

The expression manager needs to

Who is going to primarily be using this library:

Needs to be fairly generic, specific apps will build on basic arithmetic and comparison ops, few to no optimizations will be transformations

Reflection Libraries

Reflex

Reflection C++

Do we need to rewrite the whole thing or just the FieldExpressions, acceptable template types, etc.?

Do we need to rewrite it all so that expression components are more accessible?

Our current expression model is extremely obfuscated.

We do need to retain the context model and perhaps expression walker.

Major modules:

ExpressionParser ?#8364;“ takes in string, produces tree form of expression (as well as perhaps other access methods)

ExpressionEvaluator ?#8364;“ evaluates expressions when a tuple comes in that is of the relevant stream

Expression itself will need to at least add functions for concatenation, append, etc. as well as defining common math ops and iterator functions

May want to allow binding operator, although I am not sure how this will fit into our current XML paradigm

Should we make the context more loosely coupled with the expressions themselves?

If an entire expression is regarded as one object how should it be accessed?



Proposal Notes

coordinate info - int32 (coordinate must be declared like a schema) (coordinates are loaded into a table as declared with the XML) Coordinate systems exist in categories/ classes. A coordinate system declaration should include the coordinate system type (i.e GPS or X, Y, Z for airplanes), and two functions: : one to transform it to the user-defined generic format of its coordinate type and the other to get data of its type to its format.



Wire Format

Header Info:

int n = num_dimensions

enum base_type (can be int, float, etc.)

int dim[n] the size of each dimension

int32 coordinate_system;

bool partition

bool dense

bool regular_stride

float stride_size[n]

partition dimensions[n]

partition number - tells us which partition we are dealing with in row-major order

float sensor_location[3]

Dense Array, Regular Stride Representation:

one element after another in row-major form; 0/null for blank values

Sparse Array (stride does not matter, n-dimensional array):

n floats, one for each dimension coordinate followed by the cell value; repeated in row major form for each value in the array

Dense Array; irregular stride

Each cell in row-major format prefaced with its coordinate

Tuples within Borealis

When arrays on the server/Borealis side a module called an ArrayAcceptor will intercept them and convert the serialized data into a Blitz++ Object. It will then wrap it in a smart pointer which will be put into any tuple referring to it on that host. The tuple itself will contain the header information structure. Arrays will be operated on in place by each box where ever possible. The only time an array should need to be copied is when a stream diverges down two different paths that will modify it. The majority of the modifications between this and our current version of Borealis will revolve around memory management and rewriting most or all of our expression library.

Expression Wildcarding

When arrays are compared in expressions or predicates they do not have to match the dimension of what they are being compared to. They do need to correspond right to left though. For example, an int[640][480][3] can be compared to an int[3] or int[480][3] but not int[640]. Arrays can also be compared to scalars on a cell by cell basis.

Box Semantics

Aggregate – will accept a window of tuples, with window length being defined by either tuples or window size. Slide will be defined similarly. Windowing will be managed by dynamic reference counting on each array object. Each time an array is added to a window the reference count is increased (just as it is when it is accessed by any non-windowing box). When it is removed from the window the reference count is decremented. When the reference count goes to zero the array's destructor is automatically called. Aggregate will continue to operate with the standard functions it has now (min, max, average, etc.). The results will be a composite result based on each cell (i.e. a min of a 640x480 image will return the minimum value for each value over all of the frames in that window)

Join - Joins will have a windowing scheme similar to aggregates. If we are joining two arrays we will create a new one with an extra dimension with pointers to the previous two arrays.

Filter – Filter will compare arrays or scalars. In the case of a superset being compared to a subset (i.e. int[640][480][3] < int[3]) filter will iterate through and null out values that do not match the predicate.

Map - Map will apply operators found in our rewritten expression library. It will not differ much from our previous implementation.

Concatenate – This will basically be a special case of join. It will check to see if the dimensions of the two incoming arrays are of compatible dimensions and throw an exception if they are not.

Locate(Array1, Array2, function, slide[n]) – The box will first check to see that the two arrays are roughly compatible. The function will need to define the size of chunks in an XML schema before it can be passed into a locate box. This does not necessarily have to be the size of Array2, the function may scale, rotate the feature, etc. Like our windowing scheme, this may benefit from defining both the sub-array sizes that are passed into the function as well as slide/overlap (in case the feature is between two blocks). Slide will tell us how much to move in each dimension of Array1 (Array1 is n-dimensional). After all of this is established, Locate will iterate through Array1 in row-major order a slide unit each iteration. It will collect the function returns as an array which will be its output. Each function return will be in the form of one array of size 2n. The first n values will be the lower left hand corner of the location of the found feature. The second n values will be the size of the match in each dimension.

Pivot(array, attribute_list_1, attribute_list_2) – Pivot will substitute the contents of the dimensions specified in the first list with the values in the second list. This will operate on the array in place.

Regrid – This seems like it will be a special case of Map, with more extensibility regarding dimensions. Or should map support the in and out array having different dimensions?



Schema Examples

A video frame and scaling vector:

<schema name="frame_and_scale">

<field name="time" type="int32"/>

<field name="frame" type="int[800][600][3]"/>

<field name="vector" type="float[2]"/>

</schema>

Marshal would translate to a header of :

struct frame_and_scale_frameField_header

{

int num_dimensions = 3;

enum base_type = int32; // may have to make this more flexible for user-defined types later

int dim[n] = (800, 600, 3);

int32 coordinate_system = 0

bool partition =0;

bool dense = 1;

bool regular_stride = 1;

float stride_size[n] = (1, 1, 1);

partition_dimensions[n] = (0, 0, 0)

partition_number – 0;

float sensor_location[3] = (0, 0, 0)

}

When the client sends a tuple it will pass in a basic C++ array (i.e. int[800][600][3]) Blitz is not necessary on this side of the equation. The client side will flatten the array and send it over the wire. The ArrayAcceptor will de-serialize it, etc. The rest of this is described above.



















































Basic Architecture Layout



9/29 - Brad added a 1-bit flag to the tuple header to designate whether or not it contains arrays


FilterArray Add – created it in the src/modules/queryprocessor/boxes directory, conformed to standard form

Modified Files