Programmable Systolic Arrays

Richard Paul Hughey
Ph.D. Dissertation

Department of Computer Science
Brown University
Providence, Rhode Island 02912

CS-91-34
May 1991
Programmable Systolic Arrays

Richard Paul Hughey

B. A., Swarthmore College, 1985
B. S., Swarthmore College, 1985
Sr. M., Brown University, 1987

Thesis
Submitted in partial fulfillment of the requirements for the
Degree of Doctor of Philosophy in the Department of Computer Science
at Brown University.

May, 1991
© Copyright 1991
by
Richard Paul Hughey
Vita

RICHARD Hughey was born in Philadelphia, Pennsylvania, on July 31, 1963. He worked toward his high school diploma in Swarthmore, Pennsylvania, Ridgewood, New Jersey, Beverly Hills, California, and Frankfurt, Germany, and it was awarded to him by the Village of Ridgewood in 1981. He received the B. A. degree in mathematics with distinction and the B. S. degree in engineering with distinction from Swarthmore College (Swarthmore, Pennsylvania) in 1985. He received the Sc. M. degree in computer science from Brown University (Providence, Rhode Island and Providence Plantations) in 1987. His research interests include VLSI architecture and design, parallel computing, parallel programming, and the applications of computer science to computational biology. He is a member of the IEEE and the IEEE Computer Society, and was inducted by Phi Beta Kappa, Sigma Xi, and Tau Beta Pi as an undergraduate. He can be reached by electronic mail at rph@cs.brown.edu.
Abstract

SYSTOLIC arrays can solve computationally intensive problems many times faster than traditional computers or supercomputers. Because VLSI systems require many months to design, simulate, fabricate, and test, research has turned to programmable systolic arrays. This thesis introduces a general, programmable systolic array architecture: the Systolic Shared Register (SSR) architecture. The Systolic Shared Register architecture preserves the simple communication of single-purpose systolic arrays while providing a fully programmable systolic co-processor.

To test the SSR architecture, I designed, simulated, and had fabricated the Brown Systolic Array. B-SYS is an 8-bit SSR machine: each 16-element register bank in the linear array is shared between two functional units, providing simple and efficient systolic communication. The 85,000-transistor chip worked on first fabrication and a 10-chip, 470-element prototype array performs sequence comparison over 80 times faster than its Intel 80386 host. A custom-designed board could magnify this to over 600 times faster for a 10-chip co-processor. A 32-chip B-SYS system could process over three billion 8-bit operations every second.

Although the SSR paradigm stands by itself, it is instructive to simultaneously consider programming methods and applications. This thesis describes the principle of software fault detection for the automatic generation of fault-tolerant programs, an efficient alternative to rigid hardware methods. Additionally, this thesis introduces the New Systolic Language; NSL is a programming language for systolic co-processors based on data stream computation. With the aid of NSL, several systolic applications are examined in detail, in particular sequence comparison problems from the Human Genome Project. A comparison of B-SYS to existing parallel machines reveals its unequivocal superiority in its targeted area.
Acknowledgments

THIS thesis, like any other, owes a great debt to a considerable number of people. First, I thank my advisor Daniel Lopresti. He prepared the stage for this work with a single-purpose sequence comparison chip, and his comments and criticisms have been a constant influence on this work. I also thank my committee, Steve Reiss and John Savage, whose careful reading greatly improved this thesis. Ken Zadeck provided important critical advice early in the project.

The discussions of computational biology applications have been greatly enhanced by the efforts of Steve Altschul and Tom White to impart knowledge to an ignorant computer scientist.

While at Brown, I have had the pleasure of knowing many fine people, in particular a large crowd of card players always eager for a break (Mark Boddy, Ken Basye, Glenn Caroll, Eugene Charniak, Tony Davis, M6l Leifer, Paul Markowitz, Gail Mitchell, Maury Neburg, Gene Santos, Solomon Shimony, and Lynn Stein Melnick). A few people who resisted the influences of Bridge and "Oh Roll" include Cheryl Harkness, Tim Johnson, Scott Meyers, Rob Ravenscroft (who, in addition to being an excellent brewer, proved eager to show that all problems areTEX-reducible), Kate Sanders, and Cathy Chevon.

Several VLSI classes endured assignments on various versions of the Brown Systolic Array simulator. I thank Glenn Caroll, Paul Howard, Gail Mitchell, Marion Nodine, and Elizabeth Schriver for their comments.

My pursuit of an advanced degree was influenced and encouraged by four undergraduate professors: David Bowler who introduced me to electronics and digital logic, Charles Ginestad who introduced me to mathematics and academic research, and Charles Kelemen and Steve Platt who introduced me to computer science.

I would like to extend a special thanks to Todd Karalskashian and Karen Ohl, for reasons innumerable.

Finally, I enthusiastically thank my sisters Alice and Barbara, and my parents Ginnie and Joe. Without their support and love throughout my life, this thesis would never have been possible.
8 Conclusions 143
A B-SIM User's Guide 147
B B-SYS User's Guide 149
C Sequence Comparison on the Connection Machine 151
Bibliography 155
List of Tables

1.1 The Genetic Code in RNA form. ........................................ 3
2.1 Systolic algorithms. ...................................................... 14
2.2 Systolic building blocks. ............................................... 23
2.3 Systolic systems. .......................................................... 24
4.1 Carry lookahead addition. .............................................. 51
4.2 B-SYS result block control words. ................................. 53
4.3 B-SYS generate and propagate control words. .................... 53
4.4 B-SYS pin names. .......................................................... 61
5.1 B-SYS chip status. ......................................................... 78
5.2 B-SYS prototype board key. ............................................ 80
5.3 B-SYS prototype I/O addresses. ....................................... 81
5.4 B-SYS sequence comparison timings. ............................... 82
5.5 Comparison of single fault detection methods. ................... 90
6.1 Static Config class members. .......................................... 90
6.2 NSL operators and functions. ......................................... 105
6.3 SStream member functions. ............................................. 109
7.1 Differences modulo 256. .............................................. 114
7.2 B-SYS cell program lengths for various applications .......... 140
C.1 CM-3 sequence comparison. ......................................... 152
List of Figures

2.1 Systolic insertion sort ........................................ 10
2.2 Systolic implementation of Horner’s method ............... 11
2.3 Comparison of AAC with AGCA ................................ 13
2.4 P- NAC architecture ........................................... 17
2.5 Connection Machine processor architecture ................ 19
2.6 Warp data path .................................................. 20
2.7 Massively Parallel Processor architecture .................. 21
2.8 SDEF program for Horner’s method .......................... 29
2.9 Heart’s program for Horner’s method ....................... 30
2.10 Occam program for Horner’s method ....................... 32
2.11 W2 cell program for Horner’s method ..................... 33
2.12 Paris program for Horner’s method ....................... 34

3.1 Linear SSR processor array .................................... 40
3.2 Systolic Shared Register networks ........................... 43
3.3 Methods of inter-processor communication .................. 44
3.4 Phased systolic sort ............................................ 47

4.1 Segment of the B-SYS array .................................... 50
4.2 B-SYS arithmetic logic unit .................................. 51
4.3 B-SYS instruction word ....................................... 52
4.4 B-SYS phased systolic sort ................................... 54
4.5 B-SIM TeX snapshot .......................................... 55
4.6 B-SIM Tango animation ....................................... 56
4.7 Example CMOS circuits ...................................... 57
4.8 CMOS layer key ............................................... 58
4.9 Floor-plan of the Brown Systolic Array ..................... 60
4.10 B-SYS pin grid array ....................................... 61
4.11 B-SYS clocks and diagnostic signals ...................... 62
4.12 Block diagram of functional unit and register bank .......... 63
4.13 RAM cell and layout ........................................ 64
4.14 Propagate logic block ....................................... 65
4.15 8 logic block and operand latches ........................ 66
4.16 One-bit carry block ......................................... 66
7.15 Uniform comparison network for phenylalanine—leucine. 131
7.16 Wildcard comparison network for phenylalanine—leucine. 131
7.17 NSL amino acid encoding comparison. 132
7.18 NSL protein alignment cell program. 134
7.19 NSL protein alignment main program. 135
7.20 Systolic transitive closure cell. 137
7.21 Transitive closure cell program. 137
7.22 Transitive closure main program. 138
7.23 Transitive closure I/O functions. 139

A.1 The Brown Systolic Array simulator. 148

C.1 CM sequence comparison driving routine. 153
C.2 CM sequence comparison cell program. 154
Chapter 1

Introduction

The systolic array is perhaps the most significant architectural development inspired by the revolution in very large scale integration (VLSI) fabrication technology. Systolic co-processors exhibit tremendous performance due to the ease with which processing elements may be replicated hundreds or thousands of times on a single VLSI chip. Since the introduction of the systolic paradigm, systolic co-processors have been proposed to solve a wide variety of problems including speech recognition, sorting and searching, image processing, and various dynamic programming algorithms. Unfortunately, the majority of these proposals assume special- or single-purpose processing elements. Each new VLSI building block requires many months to design, simulate, fabricate, and test. Because of this great effort required to construct new arrays, research has turned to programmable systolic arrays. This thesis presents a general, programmable systolic array architecture: the Systolic Shared Register (SSR) design. The Systolic Shared Register architecture preserves the simple communication of single-purpose systolic arrays while providing the user with a fully programmable systolic co-processor.

Programmable hardware co-processors cannot, of course, exist in a vacuum. Although the SSR paradigm could stand by itself, it is much more instructive to simultaneously consider two other aspects of co-processor design: algorithms and programming languages. Any machine constructed without a clear notion of the types of algorithms it is meant to solve will be poorly designed. Once target algorithms are considered, the problem of programming the system must also be investigated.

The majority of this thesis describes the design, implementation, and use of the Brown Systolic Array (B-SYS), a working VLSI implementation of the SSR architecture. B-SYS was specifically designed for the algorithmic needs of molecular biologists but can efficiently execute many other systolic algorithms. This thesis will discuss the types of algorithms that are well suited to the Brown Systolic Array and present a general framework for programming any systolic co-processor.

The remainder of this introduction will first present some general background information — a brief overview of systolic arrays and their applications. Next, the major contributions of this thesis are summarized, and finally the organization of this work is presented.
1.1 Systolic Arrays

Systolic arrays were originally proposed as a class of parallel architectures by H. T. Kung and C. E. Leiserson in 1978. The utility of the systolic principle rests in its formalism for mapping algorithms to large processor arrays. Instead of a labyrinth of individually programmed processing elements, the typical special-purpose systolic array consists of one or two types of cells connected in a regular network, such as a line, ring, square mesh, or hexagonal mesh.

Systolic algorithms are designed with the knowledge that individual processing elements are unable, without great cost, to send information to arbitrary or distant locations. In a planar VLSI process, multiprocessors requiring arbitrary communication will use the vast majority of chip area for routing. The point of a systolic implementation is to put as much processing power as possible in as little space as possible. Thus, in the words of Foster and Kung, a systolic algorithm will have the following properties:

- The algorithm can be implemented by only a few different types of simple cells.
- The algorithm's data and control flow is simple and regular, so that cells can be connected by a network with local and regular interconnections. Long distance and irregular communication is thus minimized.
- The algorithm uses extensive pipelining and multiprocessing. Typically, several data streams move at constant velocity over fixed paths in the network, interacting at cells where they meet. In this way a large number of cells are active at one time so that the computation speed can keep up with the data rate.

As a result of these principles, data is written to and read from the processor array only at array boundaries, while computation takes place throughout the array. For example, data might enter on one side and results might exit on the other. The description of a systolic algorithm, therefore, must specify the operation of the systolic cells in addition to the data movement between the cells. Systolic arrays can efficiently manage large quantities of data without the potential for bus contention or network congestion.

1.2 Applications

Systolic arrays were originally applied to digital signal processing, and now linear arrays for convolution and other types of digital filtering and transformation abound. Often, such systems require only a small number of processing elements (for example, a finite impulse response (FIR) filter requires \( K \) processing elements, not some number proportional to the length of the input stream) so that small systolic pipelines can be formed.

Various types of mesh architecture are used for the 2-dimensional image processing versions of the basic digital signal processing filters and transformations. Many matrix operations can also be performed on mesh systolic arrays.

Combinatorial problems, those requiring only integers and simple operations, present the largest achievable gains with systolic processing. Since such problems only require
<table>
<thead>
<tr>
<th>UUU</th>
<th>Phenylalanine</th>
<th>UCU</th>
<th>Serine</th>
<th>UAU</th>
<th>Tyrosine</th>
<th>UGU</th>
<th>Cysteine</th>
</tr>
</thead>
<tbody>
<tr>
<td>UUC</td>
<td>Leucine</td>
<td>UCC</td>
<td></td>
<td>UAC</td>
<td></td>
<td>UGC</td>
<td></td>
</tr>
<tr>
<td>UUA</td>
<td></td>
<td>UCA</td>
<td></td>
<td>UAG</td>
<td>Stop</td>
<td>UGA</td>
<td>Stop</td>
</tr>
<tr>
<td>UUG</td>
<td></td>
<td>UGA</td>
<td></td>
<td>UGG</td>
<td>Tryptophan</td>
<td></td>
<td></td>
</tr>
<tr>
<td>CUU</td>
<td></td>
<td>CUC</td>
<td>Proline</td>
<td>CAU</td>
<td>Histidine</td>
<td>CGU</td>
<td>Arginine</td>
</tr>
<tr>
<td>CUC</td>
<td></td>
<td>CCC</td>
<td></td>
<td>CAC</td>
<td></td>
<td>CGC</td>
<td></td>
</tr>
<tr>
<td>CGA</td>
<td></td>
<td>CCA</td>
<td></td>
<td>CAA</td>
<td>Glutamine</td>
<td>CGA</td>
<td></td>
</tr>
<tr>
<td>CUG</td>
<td></td>
<td>CCG</td>
<td></td>
<td>CAG</td>
<td></td>
<td>CGG</td>
<td></td>
</tr>
<tr>
<td>AUU</td>
<td>Isoleucine</td>
<td>AGU</td>
<td>Threonine</td>
<td>AUA</td>
<td>Asparagine</td>
<td>AGU</td>
<td>Serine</td>
</tr>
<tr>
<td>AUC</td>
<td></td>
<td>ACC</td>
<td></td>
<td>AAC</td>
<td></td>
<td>AGC</td>
<td></td>
</tr>
<tr>
<td>ATA</td>
<td></td>
<td>ACA</td>
<td></td>
<td>AAA</td>
<td>Lysine</td>
<td>AGA</td>
<td>Arginine</td>
</tr>
<tr>
<td>AUG</td>
<td>Methionine</td>
<td>ACG</td>
<td></td>
<td>AAG</td>
<td></td>
<td>AGG</td>
<td></td>
</tr>
<tr>
<td>GUU</td>
<td>Valine</td>
<td>GCU</td>
<td>Alanine</td>
<td>GAU</td>
<td>Aspartic acid</td>
<td>GGU</td>
<td>Glycine</td>
</tr>
<tr>
<td>GUC</td>
<td></td>
<td>GCG</td>
<td></td>
<td>GAC</td>
<td></td>
<td>GGC</td>
<td></td>
</tr>
<tr>
<td>GUA</td>
<td></td>
<td>GCA</td>
<td></td>
<td>GAA</td>
<td>Glutamic acid</td>
<td>GGA</td>
<td></td>
</tr>
<tr>
<td>GUG</td>
<td></td>
<td>GCG</td>
<td></td>
<td>GAG</td>
<td></td>
<td>GGG</td>
<td></td>
</tr>
</tbody>
</table>

Table 1.1: The Genetic Code in RNA form.

Simple operations, small processing elements are sufficient. Only small processing elements can form the largest, most parallel machines. Many combinatorial problems (including some image processing applications) can efficiently use hundreds of thousands or even millions of processing elements, a complexity not achievable with complicated processors or routing schemes.

The strongest influence on the design of the Brown Synaptic Array has been the combinatorial problems of the Human Genome Project, biology's first "big science" project. The project's goal is to determine the exact structure and function of human DNA (deoxyribonucleic acid). Nuclear acids such as DNA are macromolecules, long sequences of nucleotides, that encode an organism's genetic data. The DNA located in a cell encodes the structure of an organism in thousands or millions of nucleotides — three billion in the case of Homo sapiens. Ribonucleic acid (RNA), which can serve as an intermediate step in the production of proteins from the DNA (in which case it is called messenger RNA), contains up to a few thousand nucleotides. The four nucleotides composing a DNA molecule are: adenine (A), guanine (G), cytosine (C), and thymine (T). Uracil (U) is substituted for thymine in the case of RNA. In computer science terms, a DNA molecule is a string of characters from the four-character alphabet Σ = {A, G, C, T}; while for RNA, Σ = {A, G, C, U}.

A nucleic acid chain may encode many amino acid sequences (proteins). Each three nucleotides, a codon, specify which amino acid should be added to the amino acid chain. Each protein is generated from the DNA with a complex chemical decryption algorithm involving the messenger RNA. The codes for each amino acid in terms of the RNA nucleotides are shown in Table 1.1; protein specifications average 1000 codons in length.

*Although all humans are different, these differences are the result of slight variations in the DNA.*
or, equivalently, proteins are chains of around 1000 amino acids.

Nucleic and amino acid sequence data are useless without tools for their analysis. Perhaps the most commonly desired analysis is the determination of the similarity between two pieces of DNA, either to locate where they overlap (so that partial experimental data may be joined) or where they differ (to find, for example, the region responsible for a genetic disorder). Computational biologists have several metrics for comparison which involve different costs between nucleotides as well as the possible use of affine costs or gap penalties to indicate a cost for the deletion of large sections of genetic information. Similar analysis can be performed over the 20-character alphabet of amino acids to determine evolutionary distances between proteins. 15, 16, 18, 136, 154

Exact sequence comparison (with or without gaps) is an \( O(n^2) \) algorithm: on a sequential machine, time proportional to the square of the sequence length is required to solve the problem. 6 On the other hand, a linear systolic array with \( n \) processing elements can solve the problem in time directly proportional to the length of the input sequence. A full-dedged B-SYS system with five hundred processing elements could perform the sequence comparison problem over 500 times faster than its host. Projecting this to a 10,000-element sequence, not uncommon in the world of computational biology, the uniprocessor host (such as an Intel 80386) would require approximately 10 minutes, while a 212-chip B-SYS system (a more ambitious B-SYS implementation using current technology would require only 20 chips) would only need 0.06 seconds, about the time it takes to copy a database string from disk to memory. This factor of 10,000 speedup is crucial for the database applications of linear systolic arrays: B-SYS is envisioned as a filter which can quickly select a small number of interesting sequences from an entire database of known genetic information. The GenBank database (release 66.0, December 15, 1990) contains 51,966,092 bases (nucleotides) from 41,057 entries, while the PIR International Protein Sequence Database (release 27.0, December 31, 1990) contains 7,620,688 residues (amino acids) in 26,798 entries. Assuming sequence lengths of 1240 bases, an inexpensive 32-chip B-SYS co-processor could complete an exact search of the entire Genbank database in about 2.24 days, while on an Intel 80386 this search would require over 10 years (1.3 years on a Cray-2). Researchers can often prune the search space to a smaller collection of sequences (e.g., those from specific organisms or chromosomes), in which case a B-SYS system can perform many more comparisons in a much shorter amount of time, enabling quick experimentation with different similarity metrics.

Several sequence comparison problems, as well as their solutions on the Brown Systolic Array, will be considered in Chapter 7.

The combination of simplicity and size makes combinatorial problems ideal for systolic solution. Since there are many different variations on these problems, truly useful hardware must be easily programmable. As with all hardware, there is also a desire to get vast computing power for a very low cost. Systolic Shared Register architectures,
and the Brown Systolic Array in particular, are a perfect co-processor solution for these problems.

1.3 Contributions

This thesis presents a system designed to solve the inadequacies of both systolic hardware and systolic programming languages for a wide variety of applications. A close examination of existing systolic hardware and software systems will be seen to yield the following observations:

- Single-purpose systolic machines are too inflexible to be, in general, useful.
- Many arrays of simple processing elements include complicated message-passing circuitry, unnecessary for systolic algorithms, which raises the cost and complexity of the machine.
- Regardless of message-passing hardware, most systolic systems have excessively complicated methods of interprocessor communication.
- Many of these systems have ultra-powerful processing elements with very large amounts of memory. Although useful for some problems, many simple systolic algorithms do not need these resources. Thus, the extra power is often a waste of space and money.
- Many programming languages used with systolic systolic systems have problems similar to those of systolic hardware: their use requires complicated and convoluted thought.

The Systolic Shared Register architecture and the New Systolic Language, the major conceptual contributions of this thesis, address these issues. To meld ideas and application, the use of the Brown Systolic Array is also investigated.

1.3.1 A Class of Programmable Systolic Architectures

The Systolic Shared Register architecture eliminates the complicated inter-processor communication present in many machines. Processing elements are split into functional units and register banks. Adjacent functional units can thus communicate and compute using the shared registers. The four characteristics of the general SSR architecture are:

- Broadcast instructions.
- Regular topologies.
- Systolic shared registers.
- A stream model of data flow.

As shall be seen, these attributes lead to simple and efficient systolic architectures for any topology or application.
1.3.2 A Systolic Array Implementation

Paper architectures are prevalent in the literature, but to gain a full appreciation of the Systolic Shared Register concept, a prototype system had to be built. The Brown Systolic Array, a linear SSR machine, was designed with the Magic VLSI design system developed at the University of California at Berkeley and extensively simulated with the Crystal timing analyzer and the COSMOS switch-level simulator. The 6.9 mm x 6.8 mm 2 µm Complementary Metal Oxide Silicon (CMOS) chip contains 85,000 transistors forming 47 processing elements. It was fabricated by the MOS implementation service (MOSIS), and the 94-pin chips were received in the late May of 1980. Of the twenty-four chips delivered, ten performed satisfactorily, and were combined to form a 470-processor array.

The prototype board for the 470-processor array was built for a personal computer with an Intel 80386 central processing unit (CPU). The resulting array is able to, for example, compare 470-character sequences over 80 times faster than the host. A custom-built board, such as that proposed in Section 5.2.4, could easily increase this to 600 or more times faster than the host using the same number of B-SYS chips.

The Brown Systolic Array successfully balances programmability, complexity, and cost to provide a powerful systolic co-processor for computational biology and other applications.

1.3.3 Software Fault Detection for Systolic Arrays

Any massively parallel machine must be resilient in the face of faults. When thousands or millions of processing elements are being used, a random fault at one processing element could corrupt an entire calculation. The simplistic method of fault detection is to run the entire computation twice and compare the two results. Although this method will often detect faults, it relays no information about where the fault occurred, allowing no array reconfiguration or software modification to avoid the faulty area. The solution to this problem is to check the computation periodically at many (or all) locations throughout array. Hardware fault detection methods rely on the duplication of memory and processing elements to mirror a computation. Unfortunately, these methods are fixed at implementation time: if more (or less) fault protection is desired, changes cannot be made. For example, many matrix operations can be made fault tolerant by adding checksum rows and columns. Ideally, when such operations are being performed the processing power previously allocated to fault detection should be reallocated to computation. This thesis presents the method of software fault detection, an ideal solution to the fault detection problem for programmable systolic arrays and SSR machines in particular.

1.3.4 Programming Methods for Systolic Arrays

Systolic programming languages generally fall into two categories: those which are abstract and those which are not. The former rely on mathematical mappings to define the systolic data movement and a separate program to specify the cell function. The latter
generally place a communications burden on the user: unnatural instructions must be used to move data between processing elements and deadlock is always a concern. Following from the idea of using data streams for systolic programming, the New Systolic Language (NSL) is introduced. NSL combines the simplicity of in-line control with the elegance of the abstract methods. The author has written a prototype NSL system in C++ to generate B-SYS code.

1.4 Overview of this Thesis

After this introduction follows a chapter analyzing the needs of a massively parallel yet inexpensive systolic array for combinatorial problems. Chapter 2 investigates previous work concerning all three parts of the systolic co-processor triad: algorithms, architecture, and programming languages. Section 2.1 further describes the concept of the systolic algorithm and considers an illustrative selection of systolic applications. The requirements of these applications guided the the Brown Systolic Array and New Systolic Language implementations. Section 2.2 examines several systolic and semi-systolic systems and then briefly summarizes a sampling of other hardware. The SSR architecture capitalizes on their features and avoids their drawbacks. Section 2.3 discusses and analyzes the current state of systolic programming, forming the basis for the New Systolic Language.

The class of Systolic Shared Register architectures is defined in Chapter 3. Chapter 4 details the B-SYS implementation: its architecture, design, and simulation. Chapter 5 considers the use of the B-SYS chips: fault detection and testing, board design, and prototype performance. Chapter 6 presents the NSL framework for systolic programming. Chapter 7 takes a close look at several applications on B-SYS, in particular several sequence comparison problems important to the Human Genome Project. A review of the B-SYS integrated systolic system concludes this thesis in Chapter 8.

Appendix A is a user’s guide to the B-SIM simulator of the Brown Systolic Array. Appendix B describes the use of the B-SYS prototype hardware. Appendix C considers sequence comparison on the Connection Machine.*

*Connection Machine and C* are registered trademarks of Thinking Machines Corporation. CM-2, CM, and Paris are trademarks of Thinking Machines Corporation.
Chapter 2

Related Work

This chapter explores the three arms of systolic system design: algorithms, architectures, and programming languages. Previous work in each of these areas is evaluated, setting the stage for the introduction of the Systolic Shared Register architecture and the New Systolic Language. This critical review is vital to a complete understanding of SSR architectures and the Brown Systolic array.

2.1 Systolic Algorithms

This section further examines the systolic paradigm, starting with a simple systolic sorting algorithm later used to illustrate programming techniques on SSR machines. Another canonical example, a more computational problem, is Horner’s method of polynomial evaluation. Programs for this problem in several systolic programming languages are discussed in Section 2.3. Sequence comparison, a dynamic programming algorithm of particular importance to the biological applications of the Brown Systolic Array, is also reviewed. Several variations on this fundamental problem, and their B-SYS programs, are analyzed in Chapter 7. Finally, a collection of systolic algorithms from computational geometry, cryptography, matrix algebra, and signal and image processing are summarized with an eye toward their respective hardware requirements.

2.1.1 Sorting

The simplest systolic sorting algorithm uses a linear array of processing elements. The function of each processing element is to read a number from the left, compare it with a stored value, and pass the smaller to the right. In this way, the smaller values will exit the array (from the rightmost cell) in sorted order. Using this algorithm, a type of parallel insertion sort, \( n + 1 \) numbers may be sorted on \( n \) processing elements in \( 3n + 1 \) steps. Each processor is assumed to have a local storage register and a scratch register. During each step, the processors will compare the number in the scratch register with the stored number and then store the larger of the two numbers and pass the smaller to the right. Several steps of this sorting algorithm are shown in Figure 2.1, wherein the top register is the local storage and the bottom is the scratch register. The array is
<table>
<thead>
<tr>
<th>Step</th>
<th>Array</th>
<th>Step</th>
<th>Array</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2, 4 − 0, 0, 0, 0</td>
<td>5</td>
<td>4, 1, 2, 0, 0</td>
</tr>
<tr>
<td>1</td>
<td>3, 2 − 4, 0, 0, 0</td>
<td>6</td>
<td>0, 0, − 0, 4, 1, 0</td>
</tr>
<tr>
<td>2</td>
<td>1, 3 − 2, 0, 0, 0</td>
<td>7</td>
<td>0, 0, − 0, 0, 0, 3, 1</td>
</tr>
<tr>
<td>3</td>
<td>0, 1 − 3, 2, 0, 0</td>
<td>8</td>
<td>0, 0, − 0, 0, 4, 2</td>
</tr>
<tr>
<td>4</td>
<td>0, 0, − 1, 3, 0, 0</td>
<td>9</td>
<td>0, 0, − 0, 0, 0, 0, 3</td>
</tr>
</tbody>
</table>

The local storage register is enclosed in a dotted box. The lower scratch register is used for communication. The array is shown after each complete step, comprised of a comparison phase and a data movement phase.

Figure 2.1: Systolic insertion sort of 4, 2, 3, 1.

shown after the completion of each comparison and movement step. For example, during step 3 the values 4 and 2 are compared in the first processor. The value 2 is passed to the second while 4 is retained. The input value 3 is passed to the first processor at the same time it passed 2 to the second.

This example illustrates all the important principles of a systolic algorithm. Data moves through the array in a regular, synchronous fashion. All data movement is between adjacent processing elements. Finally, the basic operation of each cell is quite simple. For this reason, the replication of the basic cell to form a very large systolic array is easy using current technologies.

2.1.2 Polynomial Evaluation

The systolic implementation of Horner's method for evaluating a degree 4 polynomial illustrates the use of a systolic pipeline similar to that several digital signal processing algorithms. The \(d + 1\) coefficients are stored in the linear array of processors, while successive \(x\) values are passed through the array for evaluation. The polynomial evaluation is broken down into the familiar

\[
y_i = c_0 + \pi_i(c_1 + \pi_i(c_2 + \pi_i(c_3 + \pi_i(c_4 + \ldots )))).
\]

(2.1)
Thus, a polynomial of degree \( d \) may be efficiently evaluated with \( d + 1 \) multiplications and \( d + 1 \) additions. Equation 2.1 can be expressed with the following FOR loop program:

\[
\text{FOR } j = 1 \text{ TO } n \\
\text{FOR } i = d \text{ DOWNTO } 0 \\
x_i = x_i + \pi_j y_j,
\]

or recursively as:

\[
c_{i,j} = c_{i,j-1} \quad (2.2) \\
x_{i,j} = x_{i-1,j} \quad (2.3) \\
y_{i,j} = y_{i,j-1,j} + c_{i,j} \quad (2.4)
\]

The variables from Equation 2.1 have been expanded to be indexed by both loop indices. Note, however, that the \( c \) values remain invariant over changes in \( j \) and the same holds true for \( x \) over changes in \( i \). The initial conditions for this recurrence are:

\[
c_{i,0} = c_{d-i} \quad (2.5) \\
x_{i,0} = x_i \quad (2.6) \\
y_{i,0} = 0 \quad (2.7)
\]

where \( c_{d-j} \) and \( x_j \) refer to values in the original equation. The \( c \) values have been reversed for simplicity.

A number of processing elements equal to one more than the degree of the polynomial is used. This simple method is typical of many systolic programs: each processing element executes the same instructions (a multiply and an addition). The systolic algorithm is shown pictorially in Figure 2.2 and, as mentioned, programs for this algorithm are analyzed in Section 2.3.

2.1.3 Sequence Comparison

Sequence comparison, a primary application for the Brown Systolic Array, is the problem of determining the similarity of two (or more) sequences. The dynamic programming solution for basic sequence comparison problems is a combinatorial algorithm, involving only integers and simple operations. For general information on sequence comparison and its variations, the reader is referred to the literature \cite{1,136,14d}.

11
Let the two strings being compared be \( A = a_1a_2 \ldots a_n \) and \( B = b_1b_2 \ldots b_m \) with each \( a_i, b_j \in \Sigma \), where \( \Sigma \) is a finite alphabet. The distance between two strings \( A \) and \( B \) is denoted as \( \text{dist}(a_1 \ldots a_n, b_1 \ldots b_m) \), or \( d_{nm} \) for short, and refers to the minimum cost sequence of single-character edit operations (insert, delete, and transform) required to transform \( A \) to \( B \). For example, with an insertion and deletion cost of 1 and transformation cost of 2, the distance from the stringAACUG to ACCUGA is 3: the mutation of the second A to C, and the addition of the final A.

Dynamic programming can be used to calculate edit distances by the extension of partial homologies, or subsequence matchings. To calculate \( d_{i,j} \), the best of the three possibilities (add, delete, and transform \( a_i \) to \( b_j \)) is picked. The recurrence is:

\[
\begin{align*}
d_{0,0} &= 0 \quad (2.8) \\
d_{i,0} &= d_{i-1,0} + \text{dist}(a_i, \phi) \quad (2.9) \\
d_{0,j} &= d_{0,j-1} + \text{dist}(\phi, b_j) \quad (2.10) \\
d_{i,j} &= \min \left\{ d_{i-1,j-1} + \text{dist}(a_i, b_j) \right\} \quad (2.11)
\end{align*}
\]

The empty string and the null character are both denoted as \( \phi \) and it is convenient to assume that each string begins with a null character. In the initial conditions, \( d_{0,0} = \text{dist}(a_1 \ldots a_n, \phi) \) is the cost of deleting \( a_1 \ldots a_n \) to transform it to the empty string \( \phi \). The first case of Equation 2.11 is matching the character \( a_i \) to \( b_j \), and then adding the cost of matching the first \( i - 1 \) elements of \( A \) to the first \( j - 1 \) elements of \( B \). The second case is inserting a character into \( A \) to make \( A \) equal to \( B \), then matching the first \( i \) elements of \( A \) to the first \( j - 1 \) elements of \( B \) (in effect, the inserted character is matching \( b_j \)). The final case is the deletion of a character from \( A \) to make it equal to \( B \), and is symmetric to the previous case.

The dynamic programming solution to Equation 2.11 uses the string \( A \) to index the rows of a table and the string \( B \) to index the column. The elements of the table are then entered according to the definition of \( \alpha \) with the table entry \((i, j)\) corresponding to \( d_{i,j} \). The dynamic programming algorithm completes the table as the data dependencies allow (for example, \( d_{i,j-1} \) must be calculated before \( d_{i,j} \)). The reason this works is that if the minimum distance \( d_{nm} \) between the two strings matches \( a_i \) to \( b_j \), then the distance calculation must use the minimum matching of \( a_1 \ldots a_i \) to \( b_1 \ldots b_j \), for if not \( d_{nm} \) has not been calculated as a minimum.

The comparison of the two strings AAC and AGCA will result in the table of Figure 2.3 using the dynamic programming method. The cost of matching the two strings is the final (boxed) table entry, \( d_{3,4} = 3 \). Note that the cost of matching AAC to AGC can also be read from this matrix as \( d_{3,3} = 2 \). The actual sequence of edit operations used to make the "evolutionary change" may be kept track of as the matrix is calculated if such information is needed.

There are many different ways to map this computation to a systolic array. Perhaps the simplest method is to assign one processing element to each column of the dynamic

*Section 2.3 will discuss the problem of systolic mappings.
<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>G</th>
<th>C</th>
<th>A</th>
</tr>
</thead>
<tbody>
<tr>
<td>t</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>A</td>
<td>1 0 1 2 3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>A</td>
<td>2 1 2 3 4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>C</td>
<td>3 2 3 2 5</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 2.3: Comparison of AAC with AGCA.

programming matrix (Figure 2.3). To initialize the computation, the first string is loaded into the array, one character per processing element. Then, the second string is shifted through the array along with weight values. Thus, $d_{ij}$ will be computed in processor $P_j$ at time $(i + j)$. The values necessary to complete this computation ($d_{i-1,j}$, $d_{i,j-1}$, and $d_{i-1,j-1}$) have already been calculated when it comes time to determine $d_{ij}$; $d_{i-1,j}$ and $d_{i,j-1}$ were calculated during the previous step in processors $P_j$ and $P_{j-1}$, respectively, and $d_{i-1,j-1}$ was calculated in $P_{j-1}$ two time steps ago. As can be seen, this data movement maintains the systolic property that data needed for computation in a processing element is computed near that processing element in both time and space.3

**Other Variations**

There are many variations on the sequence comparison problem, including sequence to subsequence comparison, longest common subsequence comparison, and the addition of affine gap penalties to the distance metric. For example, in the DNA case it is more likely that the elimination of 10 consecutive nucleotides is the result of a single event (a large cut) than the result of 10 independent events (each removing one of the nucleotides in the subchain). Judicious use of gap penalties will mirror this behavior in the cost function. Sankoff and Kniskel discuss algorithms for several other sequence comparison problems.108 A previous paper by this author considers the problem of comparing a nucleic acid to codings of an amino acid sequence using the genetic code.54

Considering the underlying encoding of the amino acids shown in Table 1.1 (page 3), some amino acid mutations (e.g., phenylalanine to leucine) are more likely than others (e.g., phenylalanine to arginine). Also, biologists consider the occurrence of some amino acids to be more important than others: in actual proteins, asparagine mutations approximately seven times more often that tryptophan. Thus, to detect significant alignments between proteins, a cost function dist(a,b) using different values for each possible mutation is required. The most common metric used is that proposed by Dayhoff.32,137

Without a programmable systolic array, the usefulness of different metrics and methods cannot be quickly evaluated.

Sequence comparison in all its varieties forms the basis of many data analysis problems involving biological sequences, speech processing, polymer analysis, and image processing. Several sequence comparison problems are considered in Chapter 7.

---

3This mapping will be covered in more detail in Section 7.1, where a B-SYS program for the computation is presented.
<table>
<thead>
<tr>
<th>Problem</th>
<th>BC</th>
<th>Data</th>
<th>CPU*</th>
<th>TP</th>
<th>CPUL</th>
<th>CPUL</th>
<th>CPUL</th>
<th>CPUL</th>
<th>N</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alternate Transcription</td>
<td>2</td>
<td>int</td>
<td>add</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Articulation Points</td>
<td>1</td>
<td>int</td>
<td>add</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Cartesian Points</td>
<td>1</td>
<td>any</td>
<td>cmp</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Channel Assignment</td>
<td>1</td>
<td>int</td>
<td>add</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Constrained Least Squares</td>
<td>2</td>
<td>float</td>
<td>div</td>
<td>T</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>CFD Paneling</td>
<td>1</td>
<td>char</td>
<td>cmp</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Convex Hull</td>
<td>1</td>
<td>int</td>
<td>add</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Convex Hull</td>
<td>1</td>
<td>float</td>
<td>add</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>1-D Convolution</td>
<td>1</td>
<td>float</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>2-D Convolution</td>
<td>1</td>
<td>any</td>
<td>mult</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Data Compression</td>
<td>1</td>
<td>char</td>
<td>cmp</td>
<td>L</td>
<td>k</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>1-D Decomposition</td>
<td>1</td>
<td>float</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>1-D FFT</td>
<td>1</td>
<td>float</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Distance Graph</td>
<td>1</td>
<td>int</td>
<td>add</td>
<td>H</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Distance Graph</td>
<td>1</td>
<td>int</td>
<td>add</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Inner Product</td>
<td>1</td>
<td>any</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>1-D FFT</td>
<td>2</td>
<td>float</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>FIR Filter</td>
<td>1</td>
<td>int</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>FIR Filter</td>
<td>2</td>
<td>int</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>Kalman Filtering</td>
<td>2</td>
<td>float</td>
<td>sqrt</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Long Multiplication</td>
<td>1</td>
<td>int</td>
<td>mult</td>
<td>L</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>LU Decomposition</td>
<td>2</td>
<td>any</td>
<td>recip</td>
<td>H</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
<tr>
<td>Matrix Inversion</td>
<td>4</td>
<td>any</td>
<td>mult</td>
<td>M</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>n</td>
<td>1</td>
</tr>
</tbody>
</table>

*Most complicated operation in algorithm.

*Topological: Linear, Mesh, Triangular mesh, Hexagonal mesh.

*Problem size, in terms of n.

*Sequential complexity of the obvious sequential algorithm, in terms of n.

*Parallel time, in terms of n.

*Output size, in terms of n.

*Amount of memory required per processor (order of magnitude) ≤ 10^6 words are required.

Table 2.1: Systolic algorithms.
2.1.4 Other Algorithms

Table 2.1 summarizes a selection of systolic algorithms. For each algorithm, several items are listed: the number of types of processing elements (equivalence classes), the type of data used (integers, characters, or floating-point numbers), the complexity of the required CPU, and the topology of the array needed to perform the algorithm. Several expressions relating to the serial and parallel complexity of each algorithm are also tabulated.

The ability to support every one of these algorithms would be a tremendous burden on the size of individual processing elements, and hence the possible amounts of parallelism in the final array. Examining the table, one notes that very few algorithms require more than one type of processing element, thus the broadcast of identical instructions to all elements of the array will be a reasonable method of control. Some algorithms with multiple equivalence classes, such as the alternate transcription problem for amino acids (Section 7.1.4), can be programmed with no instruction overhead because one cell program is a restriction of the other. The more complicated algorithms (constrained least squares, 1-D FFT, IIR, etc.) can still be executed by alternately turning on each equivalence class of processing elements to simulate multiple instruction streams.

On consideration of the presumed users of the Brown Systolic Array (in particular, computational biologists), complex arithmetic units may also be trimmed: the lack of floating-point arithmetic, division, and root extraction will eliminate the possibility of running several signal processing algorithms, however it will add an enormous amount of processing power to the final system since the chips will contain many more processing elements. Integer multiplication and division is not included for similar reasons, though of course products can be easily computed with addition.

Since B-SYS was initially designed as a programmable sequence comparison engine, a linear systolic array was the topology of choice. As can be seen from Table 2.1, a large number of applications run on linear systolic arrays, including some which seem inherently two-dimensional. For example, transitive closure, a type of matrix multiplication, can be performed on B-SYS. Although the algorithm (discussed in Section 7.2) provides only a factor of n speedup (versus the mesh algorithm's n²), it illustrates the versatility of the linear systolic array. Although in the ideal world one would have programmable systolic co-processors of all topologies and sizes, in the real world a single linear SSR machine will often be both practical and sufficient.

Finally, the majority of the algorithms shown do not require large amounts of memory; the wasteful feature may also be discarded.

Pruning the list of applications, the following are appropriate for the B-SYS implementation: alternate transcripion, channel assignment, data compression, distance graph, inner product, minimum spanning tree, priority queue, searching, sequence comparison, shuffles scheduling, sorting, stereo matching, and running order statistic. Several other applications can be executed on linear SSR machines like B-SYS, though not as efficiently because they require multiplication. These include: convex hull, 1-D convolution, 1-D deconvolution, inner product, 1-D FFT, FIR filter, IIR filter, log multiplication, matrix multiplication, nearest neighbor, and polygon intersection.

16
Although, as shall be seen, the Systolic Shared Register concept can be used with any systolic topology, B-SYS is a linear array which fills the niche of solving many combinatorial problems.

2.2 Systolic Hardware

The design of the Brown systolic array has been influenced by several systolic and semi-systolic architectures. This section examines a number of architectures which represent important milestones in the landscape of massively parallel machines. After considering a single-purpose machine, a massively parallel machines, and a powerful systolic multi-processor, the tapestry is filled out with a summary of over twenty machines. Many novel features present in these machines have been adapted to the B-SYS architecture, while several of the more cumbersome features have been avoided. The specifics of the Brown Systolic Array implementation follow directly from an examination of the attributes of both the machines in this section and the algorithms presented in the previous section.

2.2.1 The Princeton Nucleic Acid Comparator

The Princeton Nucleic Acid Comparator (P-NAC) was developed by Daniel Lopresti and is the direct predecessor of the Brown Systolic Array.\(^{109,110}\) P-NAC is designed to solve the sequence comparison problem over a four-character alphabet. Using the simple dynamic programming method of calculating edit distances (Section 2.13), P-NAC provides optimal asymptotic speedup by performing \(O(n^2)\) computations in \(O(n)\) time using \(O(n)\) processing elements. The processing elements are connected in a linear array,
and data is fed in from both ends of the array. Each processor is small — the 1985 nMOS implementation placed 30 processors on a 4.6 mm x 6.8 mm chip with a feature size of 4 μm (λ = 2) — and the resulting chip can compare nucleotide sequences quickly. For example, a 1990–processor array of P-NAC processors can compare two 1000-character strings in 130 ms. The architecture of the P-NAC basic cell is shown in Figure 2.4.

Unfortunately, P-NAC can solve only a single problem. Even slight variations to the sequence comparison problem, such as the use of different costs in the calculation of the edit distance, cannot be solved with P-NAC. Since building a new chip for each application is clearly inefficient, a programmable successor to P-NAC is needed.4

Because only one algorithm is ever executed on P-NAC, systolic communication is implemented with latches and fixed connections. A special-purpose implementation of the sorting algorithm of Section 2.1.1 would use a similar method of communication. A programmable systolic array should, if possible, include a method of systolic communication which is as simple as the intrinsic communication of these special-purpose architectures.

A programmable systolic array should preserve the vast parallelism easily attainable with P-NAC. Since nucleic acids such as DNA can be thousands of nucleotides long, thousands of processing elements can be used effectively to accelerate database searches for similar pairs of sequences. Without small and simple processors, such as those of the P-NAC system, the cost of such huge arrays would be prohibitive.

2.2.2 The Connection Machine

The Connection Machine features 64 K (65,536) single-bit processors and a hypercube routing scheme.182 Each Connection Machine processor (CM-2) can access 64 K bits of memory and a floating-point unit shared by thirty-two processing elements. The hypercube routing network provides arbitrary communication paths between processing elements. The four flag registers (see Figure 2.5) are available to neighboring and distant processing elements via the chip's router. The Connection Machine features "data level parallelism", which is to say that it is an SIMD (single instruction and multiple data stream) machine.

Although the Connection Machine supports non-systolic algorithms, it can also efficiently execute systolic algorithms intended for a square mesh or linear array of processing elements. Communication on the Connection Machine is, however, much less intuitive than the synchronous communication of single-purpose systolic arrays: the data must be explicitly moved through the array using the communication registers.

It might also be pointed out that bit-serial operations are slower than bit-parallel when data close to (but below) the computer's word size is being manipulated. In the

4Feature size and lambda-based design rules are defined in Section 4.3.1.

5The University of North Carolina at Chapel Hill is developing BioSCAN, a partially programmable biosequence comparison system based on a 2001 processor 0.4 μm CMOS chip. Although able to process different cost functions over a 30-character alphabet (the amino acid), it is unable to perform more complicated versions of sequence comparison such as those with insertions and deletions or gap penalties, and thus is still restricted in application.113 BioSCAN's algorithm is discussed in more detail in Section 7.1.
sequence comparison case, approximately ten times as many instructions are required on a bit-serial machine as on a bit-parallel machine with an eight-bit word. The use of bit-serial processing elements is partly the result of VLSI packaging restrictions: a chip with a large two-dimensional array of processing elements will likely not have enough pins to allow word-parallel input and output (I/O) at the chip boundaries. In fact, some chips use a smaller word size for chip I/O than for internal processing. This problem is not present in linear systolic arrays since they only communicate through the two end processors.

2.2.3 Warp

The Warp computer was developed at Carnegie Mellon University under the guidance of H. T. Kung, and is a powerful systolic multiprocessor. The present model, PC-Warp, consists of ten 10-MFLOPS (million floating-point operations per second) processors connected in a linear array. The array cells are independently programmable, each containing memory for 8 K instructions (thus, Warp is an MIMD, or multiple instruction and multiple data stream computer). Each Warp processing element features a 32-bit multiplier, a 32-bit adder, and 32 K words of local memory. Between neighboring processors there are three communication queues able to store 512 32-bit words each. The components of the processor are connected with a 32-bit crossbar switch. Thus, data may be placed on the queues directly from the units of CPU or from the processor’s local memory. Warp provides hardware control of the queues: a full or empty queue will cause the appropriate processor to wait until this condition changes. Figure 2.6 shows the data path of an individual Warp cell. A new version of Warp, called iWarp, is being developed jointly by Carnegie Mellon and the Intel Corporation and will have 64 processors and more memory per processor.

The Warp system can perform a large number of signal and image processing applications quickly. However, it is not especially suited for symbolic or simple computations.
on large amounts of data. Since the size of the array is limited by the size of the processing elements (one board each), Warp is unable to make full use of the vast parallelism inherent in many problems.

2.2.4 Other Hardware

A systolic building block is a chip (or board) which, when replicated, can be used to build powerful systolic arrays. As with complete systems, not all systolic building blocks have been made into arrays, but their architectures are of interest. This section briefly discusses several other systems which influenced the design of the Brown Systolic Array.

Splash

Splash, developed at the Supercomputer Research Center in Bowie, MD, can be best described as a firmware-programmable systolic array.\(^6\) It is a linear array of 32 Xilinx field-programmable gate arrays (FPGAs)\(^16\) with memory chips in between. Each FPGA has a 20×16 mesh of configurable one-bit logic blocks and 144 input/output blocks. The FPGAs also have sophisticated routing facilities for local and global connections. The FPGAs are used as programmable hardware: a program of connections between and functions for the logic blocks is preloaded into the array. As data is pumped through the array, it is processed according to the preloaded configuration. Simple applications, such as the basic sequence comparison problem on a 4-character alphabet, can be executed nearly as quickly as on comparable single-purpose hardware (see Section 7.1.2).
There are, however, two main problems with using the Splash system for general systolic algorithms. First, slight variations in algorithm require that the lengthy programming task be redone. Second, Splash implementations have a low processor density per FPGA. Splash is well suited for problems involving small words (2-bit words suffice for nucleotide sequence comparison costs), but performance degrades considerably as the word sizes increase. Of the algorithms which have been implemented on Splash, typically eight or sixteen systolic cells have been placed on each chip, a marked contrast to E-SYS's 47 cells per chip, especially when one notes that D-SYS does not push the limits of current VLSI technology as the Xilinx chips do. Still, for appropriate applications Splash performs very well and, in spite of programming difficulties, can be reconfigured for new applications or run faster than single-purpose VLSI chips can be fabricated.

Image Processing

Simple mesh architectures of bit-serial processors (i.e., the Connection Machine without its extensive routing) have been used for many image processing applications. The Massively Parallel Processor developed by Goodyear and NASA stands out as an early example (Figure 2.7). T. J. Fountain has surveyed several SIMD bit-serial image processing systems. Each of the eight machines surveyed has between 8 and 64 processors per chip. The NCR Geometric Arithmetic Parallel Processor (GAPP) also falls in this class of bit-serial mesh architectures.

Digital Signal Processing

Digital signal processing (DSP) generally requires a full range of floating-point operations. Among these more powerful systolic processors is the Programmable Systolic Chip.
The Instruction Systolic Array

The proposed Instruction Systolic Array (ISA) is a mesh of simple processing elements in which both data and control flow through the array. Instructions flow in one direction (north to south), and selectors (or mask bits) flow in the other (west to east). The ISA thus uses a clever solution to the problem of controlling \( n \) processing elements with only \( n \) control bit inputs each time step. The first proposal of the Brown Systolic Array considered the use of systolic instructions in a linear systolic array. However, the problems with this concept probably outweigh the benefits in a linear systolic array.

Hardware Summary

Table 2.2 summarizes information about the systolic building blocks described in this chapter as well as in the literature. For each building block, those figures of merit which could be gleaned from the reference are tabulated. As can be seen, the Brown Systolic Array (with its modest number of transistors) has a high number of bit-parallel processing elements on each chip: a tribute to the elegance of the Systolic Shared Register design.

Also, B-SYS has a relatively small amount of local memory and no off-chip memory. The advantages of this are speed and size: lengthy memory accesses are avoided and a higher processing element density can be achieved. Although only a small number of registers is used, it is sufficient for many of the applications presented in Table 2.1.

The length of the B-SYS word falls between the extremes (bit-serial and 32-bit floating-point). The clock speed is typical of many of the building blocks, and both the clock speed and the speed of individual processing elements could be improved with a new design. Floating-point arithmetic is, as mentioned, not needed for B-SYS’s target applications.

Table 2.3 tabulates information about complete systems: the total number of processing elements and memory, array topology, and system size. (Splash has not been included because of its variable number of systolic cells.) A full-sized B-SYS system with 1504 processing elements could provide a large computational power (3 or more billion operations per second) using very little space or, equivalently, at very low cost. Thus, B-SYS has one of the highest performance to cost ratio of the machines listed, a ratio higher than all general-purpose supercomputers (obviously, supercomputers can also crunch data on a wider range of problems than B-SYS).

It is important to remember that B-SYS, being a linear array, is eminently scalable: with nine bits of communication (plus clocks and possibly instructions), boards may be joined to form systems as large as desired, with only one or two boards requiring
<table>
<thead>
<tr>
<th>Chip</th>
<th>Year</th>
<th>Technology</th>
<th>Translators</th>
<th>PE per chip</th>
<th>Memory</th>
<th>Word (bits)</th>
<th>Clock (MHz)</th>
<th>Speed</th>
<th>Floating-point?</th>
<th>ALU</th>
<th>Control</th>
</tr>
</thead>
<tbody>
<tr>
<td>ICL, DAP**1,6</td>
<td>77</td>
<td>MSI</td>
<td>0.2</td>
<td>0 1 1 1</td>
<td>5.0MIPS</td>
<td>N Add/And</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>DTVF*3,7</td>
<td>81</td>
<td>3.5um CMOS</td>
<td>8 1 7 0 16 5</td>
<td>5.0MIPS</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>GLIP**2</td>
<td>82</td>
<td>5um CMOS</td>
<td>7 1 4 0 16(8)</td>
<td>5.2MOPS</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>OSG GRID**10</td>
<td>83</td>
<td>1.2um CMOS</td>
<td>50 64 32 8 1</td>
<td>10MIPS 10</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>MIP**15</td>
<td>83</td>
<td>HCMOS</td>
<td>800 8 36 1034 1</td>
<td>10</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>NTT AAP**14</td>
<td>83</td>
<td>2um CMOS</td>
<td>112 64 96 0 1</td>
<td>18MIPS 18</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>PSC**6</td>
<td>83</td>
<td>4um NMOS</td>
<td>26 1 54 0 8(9)</td>
<td>3</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>CM5**19</td>
<td>85</td>
<td>2um CMOS</td>
<td>10 16 8 4 1 4 4MIPS</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>GAP**17</td>
<td>85</td>
<td>CMOS</td>
<td>72 128 0 1 10 10MIPS</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>ICL, DAP,24</td>
<td>86</td>
<td>CMOS</td>
<td>3 16 0 7 1 0.6MIPS</td>
<td>N Add</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>Warp**27</td>
<td>85</td>
<td>TTL</td>
<td>3 16 0 4 32 5 10MIPS</td>
<td>Y Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>IMS TMSO-30A*</td>
<td>86</td>
<td>CMOS</td>
<td>150 1 4096 1024 32 20 8MIPS</td>
<td>Y Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>AIS-5000*14</td>
<td>87</td>
<td>Gate Array</td>
<td>8 0 32 1 7</td>
<td>OC</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>ChIP**22</td>
<td>87</td>
<td>2um CMOS</td>
<td>14 16 4 64 1 7</td>
<td>OC</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>ASAP T**5</td>
<td>88</td>
<td>1200 V 1 20</td>
<td>N</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
<td></td>
</tr>
<tr>
<td>APLIHC**4</td>
<td>89</td>
<td>CMOS</td>
<td>1 740 0 16 10 10MIPS</td>
<td>N Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>PCWarp**27</td>
<td>87</td>
<td>TTL</td>
<td>134 0 32 32 8 16 10MIPS</td>
<td>Y Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>B-SYS**13</td>
<td>89</td>
<td>1.5um CMOS</td>
<td>35 384 32 24</td>
<td>14 1.8MIPS</td>
<td>Y Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
</tr>
<tr>
<td>Blitzem**17</td>
<td>90</td>
<td>1.3um CMOS</td>
<td>1100 128 1024 1 10 20</td>
<td>Y Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>(Warp**26</td>
<td>90</td>
<td>1.2um CMOS</td>
<td>650 1 128 1600 32 20 20MIPS</td>
<td>Y PP, M</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>SLAP**7</td>
<td>90</td>
<td>2um CMOS</td>
<td>50 4 32 0 20 8</td>
<td>N Mult</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
<tr>
<td>LEA**2</td>
<td>90</td>
<td>CMOS</td>
<td>100 8 0 1 1</td>
<td>N</td>
<td>SIMD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>SIMD</td>
<td></td>
</tr>
</tbody>
</table>

*Addressing capability in K words.  *Off-chip.  *Data path is 4 bits wide, but registers are 32 bit.  Bit-serial communication to 8 nearest neighbors.  MOP time is for 32 bit instruction.

Table 2.2: Systolic building blocks (by date).
<table>
<thead>
<tr>
<th>Machine</th>
<th>Data Rate</th>
<th>Data Rate</th>
<th>Class</th>
<th>1.5M</th>
<th>10M</th>
<th>100M</th>
<th>1G</th>
<th>10G</th>
<th>100G</th>
<th>Size</th>
<th>Speed</th>
</tr>
</thead>
<tbody>
<tr>
<td>WP27</td>
<td>89 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PCCF</td>
<td>87 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SPARC</td>
<td>87 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MCMAX-2</td>
<td>89 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Sun</td>
<td>89 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>WP15</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CLIPPER</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PCF</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-SG6</td>
<td>87 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-66</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-66</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-66</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-66</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>AEG-66</td>
<td>86 10</td>
<td>1.5M</td>
<td>SIMD</td>
<td>Linear</td>
<td>11</td>
<td>Medium</td>
<td>10 MGFLOPS</td>
<td>BP</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 2.3: Syntactic systems (by size)
access to the system bus." This is a great contrast to mesh networks, where large numbers of communication lines must be used to join smaller meshes together, and hypercube systems for which the original communication protocols of the system can limit the dimension (size) of the hypercube.

The examples of this section, from P-RAC to Warp, represent three schools of systolic thought: the simple single-purpose array, the SIMD array of bit-serial processing elements, and the MIMD array of powerful processors. The single-purpose architectures lack versatility while the MIMD machines lack massive parallelism. All these types of architecture present a specialized and not entirely satisfactory method of implementing systolic communication.

2.3 Systolic Programming Languages\(^1\)

Having reviewed algorithms and architectures for systolic systems, it is now time to turn to the last part of the systolic co-processor triad: systolic programming languages. Although assembly language could certainly be used for all B-SYS applications, it is difficult and unwieldy for the casual user of the system. A close examination of several systolic programming languages will result in a list of criteria for the development of a new systolic programming language appropriate to the co-processor model of computation. The language should support different topologies, be part of a conventional language able to perform both host and co-processor functions and, perhaps most importantly, have an elegant method of programming different data flows.

In addition to motivating the introduction of the New Systolic Language (Chapter 6), this section continues to explore the systolic paradigm.

2.3.1 Language Types

In the torrent of papers on systolic solutions for divers applications, the twin problems of systolic programming and generalized systolic design have received somewhat less attention. The problem of systolic programming refers to building languages or subroutine libraries for general-purpose systolic processors or multiprocessors, providing methods to program an existing machine. On the other hand, generalized systolic design systems aid in mapping an algorithm to the systolic domain. Using such tools, various interconnection networks and data flows may be explored as a precursor to building or programming a systolic array. These systems are especially useful to the producer of single-purpose systolic arrays who can implement any systolic data flow in the custom network. Programmable systolic arrays can, in general, accommodate a large selection of data flows despite hardwired restrictions. This versatility implies that systolic design systems must be considered when analyzing programming languages for systolic arrays.

\(^*\)Of course, the boards would have to be specifically designed for such scaling so that, for example, clock skew between boards does not degrade performance.

\(^1\)The reader may wish to pursue this section directly before reading Chapter 6
Another dichotomy in the realm of systolic programming exists between those languages which abstract from hardware those which do not. Abstract programming methods are ideal for the algorithm designer who, having determined a systolic algorithm, must find a network with the appropriate data movement. Low-level programming can extract the highest performance from a programmable systolic array processor or multiprocessor. In practice, these two flavors of programming often correspond to systolic design systems and systolic programming languages, respectively. While it is clear that systolic design systems must be abstract, the converse is not required. Despite this, many systolic programming languages do not abstract from hardware, and indeed make the hardware painfully obvious to the user.

To motivate the examination of languages in Section 2.3.2, its general conclusion shall be stated forthwith. The landscape of systolic programming paradigms needs a simple language with two qualities. First, it should make use of gross hardware features without forcing the user to consider the means of data transfer. That is to say, the language should assume something about the network topology (i.e., it must be a member of a predetermined group) but not about the method of communication, be it synchronous or asynchronous. Second, it should be accessible to the programmer who desires a specific flow of data (e.g., two opposing data streams) but does not know the specific mathematical transformation necessary to arrive at that flow. Such a language would be ideal for the user of a programmable systolic co-processor. Given a board with, for example, a hexagonal mesh connection network, it should be no more difficult to program a simple linear systolic algorithm than the hexagonal version of matrix multiplication. It is hoped that the small sample of languages illustrated herein sufficiently supports these conclusions.

The abstract languages and methods have two orientations: functional and dependency. The two methods are essentially isomorphic, but both will be discussed in the next two sections.

Dependency Based Languages

Dependency mapping systems typically require some or all of the following information: a cell program, a dependency map, a network connection between processing elements, an assignment of code to individual processing elements, initialization declarations, and output declarations for the array. Some of these systems, such as Heartle144 (nominally in this class), provide the user with graphical input methods. Although useful for the design of new systolic algorithms, it is not clear that graphical input is needed to code a known algorithm. Other work with dependency graphs includes the quasi-regular array (QRA) method,38 HIFI,39 SDEP,42 the VLSI Array Compiler System (VACS),86 and the mapping methodologies of Lee and Kedem.103 Although it is beyond the scope of this thesis to present the methods used to derive dependency maps, their formulation is presented here, and an example of their use in Section 2.3.2.

Dependency mapping techniques have been used widely since their introduction by Moldovan.121,123 Dependency maps are a concise method of expressing the interrelations between computation variables and loop indices in systolic algorithms expressed as nested
FOR loops. Sequential solutions to dynamic programming problems typically fall in this class. The generalized loop program is:

\[ \text{FOR } i = I_i \text{ TO } I_f \text{ DO } S(i); \]  

(2.12)

in which \( i \) is a vector of index variables bounded by \( I_i \) and \( I_f \) (or, \( i \in \mathcal{I} \) for some index set \( \mathcal{I} \)), and \( S(i) \) is a collection of assignment statements between indexed variables, such as \( X(i) \) and \( Y(i) \). Sequential implementations, of course, most often increment the vector \( i \) in the fixed order of least significant index first, but most dynamic programming and other problems do not require this strictness in the order of evaluation. Dependence mapping techniques allow one to find many parallel mappings for a single problem.

A vector \( d = I_s - I_i \) is a data dependency vector if the following hold:

1. \( I_i < I_s \) (lexicographically)
2. \( Y(I_s) \) is an output variable for some statement \( s \in S(I_s) \)
3. \( X(I_i) \) is an input variable for that statement.\(^9\)

In short, the dependency vector states that in order to do the calculation indexed by \( I_i \), the calculation \( I - d \) must be done first. Multiples \( d' \) of a vector \( d \in \mathcal{D} \), the set of dependence vectors, may be ignored since if \( S_i \) depends on \( S_{i-2} \) it must also depend on \( S_{i-n} \).

The next problem is to map the evaluation of all the index points \( i \) to space (processing elements) and time while obeying all dependency vectors. The mapping will have the following form for some embedding \( E \) (a matrix):

\[ \begin{bmatrix} T \\ S \end{bmatrix} = EI. \]  

(2.13)

The left side of the equation is an array partitioned into a time coordinate \( T \) and one or more space coordinates \( S \). To be a legal systolic mapping, the following must also be true:\(^6\)

1. \( \forall d \in \mathcal{D}, T(Ed) < 0 \). Thus, if an operation is depended upon, it must take place before the dependent operation.
2. \( \forall t \in \mathcal{I}, T(ET) \geq 0 \). Thus, nothing can happen before the beginning of time.
3. \( \forall t \in \mathcal{I}, \|\rho(ET)\| < \infty \). Thus, an infinite (unbounded) number of processing elements is not allowed. Negative processor indices are legal.
4. \( \forall t, t' \in \mathcal{I}, \rho(ET) = \rho(ET') \Rightarrow T(ET) = T(ET') \). Thus, two index points cannot be evaluated at the same location and the same time.

As noted, it is beyond the scope of this thesis to consider the derivation of legal mappings, but the application of this formalism to Horner's method is considered in Section 2.3.2.

\(^9\)This is a slight modification and simplification to the work of Moldovan, intended to illustrate the main points of the mapping methodology.
Functional Languages

There are various types of functional languages. Some, such as the quasi-regular array (QRA) method and DIASTOL, derive systolic mappings from uniform or affine recurrence equations. Alternately, Luks and Jones provide a functional language for the expression of systolic algorithms. Finally, HIFI uses a graphical functional language. Since all dynamic programming problems can be expressed as uniform recurrence equations, recurrence-based languages are in some sense more powerful than dependency mapping schemes. However, this difference is slight in practice, and dependence tools and techniques predominate.

Low-level Languages

Low-level languages exist on a one (or more) per machine basis. These include W2 on Warp, Paris on the Connection Machine, Occam on Transputer arrays, REF Fortran, GAPF assembly language, the ISA language, and the low-level Brown Systolic Array language. Not all of these machines are primarily systolic, however the use of systolic principles can lead to fast and efficient programs on any parallel machine.

2.3.2 Programming Examples

This section examines several parallel programming languages, starting with the most abstract (in the author’s opinion) and ending with the most low-level. The code fragments all highlight the simple systolic algorithm for Horner’s method presented in Section 2.1.2. Many of the languages to be examined were not developed exclusively for programmable systolic co-processors; criticisms of these languages will be based entirely on their applicability to this field. Many of the features criticized are useful in each language’s primary functions, be it the design of single-purpose systolic arrays or the programming of message-passing multiprocessors. After considering the strengths and weaknesses of these as systolic array programming language, the criteria for a new systolic language will be proposed.

SDEF

The SDEF system, developed by Engstom and Capello, requires several parts for each program specification: an index set S, domain dependencies D, a spacetime embedding E, and a process function F. The language can express any uniform systolic algorithm because of its use of dependencies and spacetime mappings. Figure 2.8 illustrates the SDEF style of programming. The SDE file gives a mathematical definition of the required

---

1Uniform recurrence equations were introduced in the seminal work of Karp, Miller, and Winograd. For a discussion of the mapping of affine recurrence equations to systolic arrays, see the work of Yassoby and Capello.

2Most of the code fragments have not been tested as not all machines were available. Nevertheless, it is believed that the code fragments accurately display the features of each language.

28
systolic data movement. There are two index variables, one for the Cs and the other for both the Xs and Ys. Each $C_{i,j}$ (= $c_{i}$) depends on $C_{i,j-1}$, and is in fact equal to $C_{i,j-1}$ since the polynomial remains fixed. $y_{i,j}$ is the partial result of the $j$-th X value having undergone $i$ multiplications. SDEF uses inverted data dependency vectors, thus the vector $[0 \ -1]^T$ corresponds to $[0 \ 1]^T$ in the previous section. Referring to Equation 2.13,

\[
\begin{bmatrix}
T \\
S
\end{bmatrix}
= \begin{bmatrix}
1 & 1 \\
0 & 1
\end{bmatrix} I,
\]

(2.14)

or the computation of $y_{i,j}$ takes place at time $T = i + j$ in processor $P_j$.

The functional code language is a modified form of C. The programmer is relieved of worrying about how data is transferred, an advantage which will become apparent as more languages are considered. Each cell performs a transfer function (Figure 2.8b) from inputs to outputs, with the identity being the default; the output version of a variable $V$ is named $V'$, and if $V'$ does not appear as an index the statement $V' = V$ is assumed. Not shown in Figure 2.8 is the architectural specification which tells the SDEF system what sort of physical (or simulated) array the algorithm is to run on. The SDEF system will generate a data template describing the order in which data is to be loaded into the array in addition to code for each process or processor. SDEF is an excellent system for the development and exploration of systolic algorithms. It is, as its designers readily admit, too removed from hardware for some applications: there is an overhead associated with each process and the user cannot take advantage of low-level hardware features. A final disadvantage of the SDEF system is that, being a systolic design system, one cannot include host code and co-processor code in the same program. This means that the SDEF programmer cannot use the systolic processor as a hardware subroutine, making it difficult to link applications together. This ability is nearly a prerequisite for making full use of programmable systolic co-processors.
code horner_cell (c)
ports Xin, YIn, Xout, Yout;
begin real x, y, c;
1 x := y := 0;
2 repeat
3 y := y + x * c;
5 Xout <= x, Yout <= y;
6 x <= Xin, y <= YIn;
7 until ENS (y);
8 Xout <= x, Yout <= y;
end.

(a) Cell program

(b) Stream name definition

Figure 2.9: Heart's program for Horner's method.¹⁴⁴

Hearts

The Hearts language, developed by Lawrence Snyder, is a less abstract systolic programming environment but is also well suited to systolic algorithm experimentation. It was developed for the Poker parallel programming system, originally created for the CHIP computers, but also used elsewhere.¹⁴⁴ Each Hearts program has three parts: one graphical, one in code (Figure 2.9a), and one tabular (Figure 2.9b).

The graphical part of the program is a diagram of processing elements and their connections. Through the use of a library of data flows, the user may select and propagate common graphical dependencies to all processing elements. Although the Hearts language is not itself dependency based, the library of graphical dependencies adds an important abstractness to the system. The graphical part also includes the process assignment, which links specific code to specific processing elements (thus defining equivalence classes), and the port name assignment, which assigns names to each incoming and outgoing link (such as assigning Xin and YIn to two left ports and assigning Xout and Yout to two right ports in each processing element).

The program fragment of Figure 2.9a displays several notable features. The communication channels are used in a natural way; the infix operator <- can both query the ports (blocking if data is not yet available) and send data to the ports. Also, the streams

¹For clarity, the c input are not systolically loaded into the array but passed as parameters to the cell programs. The same approach is used in the Oceans example to follow.
can be queried for sentinels marking the end of the data, allowing the programmer to think asynchronously (thus, the code describes a wavefront array processor in the manner of S. Y. Kung⁴). For use as a systolic co-processor programming language, there are two aspects of Hearts which need improvement. First, the ports should be treated fully as variables in the program, allowing them to be rvalues and lvalues instead of restricting them to a single send or receive operation. Using the style of the Hearts program, the statements:

\[
Y_{out} := Y_{in} + X_{in} \ast C; \\
X_{out} := X_{in};
\]

could completely express the functionality of the cell program, but are illegal. Although Hearts is reasonably abstract with respect to specific hardware, it forces too much consideration of the actual data paths. Second, the interconnection network is externally specified while the actual direction of data movement is specified in the code. The cell program should not involve itself with which way streams move, as it must since the ports have no intrinsic direction: \(X_{in}\) and \(X_{out}\) are simply port names, and a programmer could easily write a program sending data from \(X_{out}\) to \(X_{in}\). This fragmentation of systolic movement is an unneeded complication. The simple movement of a systolic program could be better expressed with a textual statement such as:

\[
\text{right stream } Y;
\]

or some similar expression which indicates that the \(X\) values form a stream of data flowing to the right. Of course, the topology of the array must also be explicitly declared for statements like this to make sense.

The stream name assignment table (Figure 2.9b) is partly generated by the Hearts system: pad numbers and the entire destination half of the table are calculated directly from the graphical portion of the Hearts program. For each "dangling" edge of the systolic array, the user must enter a stream name, index, and direction (input or output). Stream names can be bound directly to files, simplifying data input. Also, the files may be divided up into a records of some size \(k\). The index field of each entry refers to which of the \(k\) items should be used for that particular pad. This is especially useful for distributing matrix elements across the boundary nodes of a systolic array. Since ports have no intrinsic direction, the direction of flow must be entered in the table as well. The Hearts stream table is a reasonably good implementation of systolic array input and output. However, the inability to express a systolic program as a single unit of code is a drawback of the Hearts system.

Occam

Occam, designed by Janos for their Transputer line of microprocessors, is an entirely textual language based on Hoare's notion of communicating sequential processes (CSP)⁶⁴ it includes asynchronous variables and parallel execution of statements. The Occam programmer is only concerned with data movement between processes, not processors. Nevertheless, systolic data movement can be forced upon the processes, as shown
PROC poly(VAR x[n], y[m], c[degree]) =
  VAR tmp:
  PROC inner (VALUE c, CHAN y.in, x.in, 
    CHAN x[n][degree], y[m][degree];
    VAR t, x, y;
    WHILE true
      SEQ j = [0 FOR n]
      x[j] = t * c
      PAR
      SEQ k = [0 FOR m]
      y[k] = t * x
      PAR
    ENDW
    Y.out = t;
  ENDP
  ENDP

Figure 2.10: Occam program for Horner's method.117

in the code of Figure 2.10. The use of streams ("channels", "portals") is similar to that of Hearts with a slight syntactic change: the <channel> ! <var> operation, like <portal> <- <var>, copies a value to a shared asynchronous variable; the Occam statement <channel> ? <var> corresponds to <var> <- <portal> in Hearts. The SEQ and PAR directives, with their associated indentation levels, specify statements to be executed sequentially or in parallel.

Ocaml has several useful features. The flow of data is expressed inside the program, though with some difficulty in the case of a systolic program: an array of channels is used to link together each "adjacent" process. The channels, as with Hearts, may be queried with infix operators. Finally, the PAR statement and associated parallel constructs can be used to avoid simple deadlocks which might otherwise occur using the channels. Unfortunately, Occaml has no concept of a systolic data stream. As can be seen from the example, the programmer must go to great lengths to set up systolic behavior between the processes. With Hearts, Occaml shares the use of overly present asynchronous communication. Since systolic algorithms have a regular flow of data, asynchronous variables (and the associated concern of deadlock prevention) are generally not required in systolic programming languages.

HEP Fortran

Fortran on the Denelcor HEP (Heterogeneous Element Processor) computer has two features to aid the parallel programmer.79 First, shared asynchronous variables may be declared by prefixing a name with a dollar sign. These variables are similar to the channels of Occaml or the ports of Hearts; however they have the advantage of being usable in any expression. This advantage is not without cost: the programmer must still be careful when using the shared variables since each use changes the state of the variable or port. For example the expression J = $I*$I might not make J a perfect
module polynomial (x in, c in, p out)
float x[100], c[100], p[100];
calclprogram (cid : 0 : 9)
begin
begin poly
begin
float coeff, xin, yin, ans;
int i;
end; end;
for i := 0 to 9 do begin
receive (L, X, xin, Z[11]);
receive (L, Y, yin, 0.0);
send (R, X, xin);
ans := coeff * yin * xin;
send (R, Y, ans, p[11]);
end; end;
end;

Figure 2.11: W2 cell program for Horner's method.

square, since the asynchronous variable $Z$ is "emptied" each time it is accessed. The result of an expression such as $2*$cos($X$) will depend on the order of evaluation of the expression. Second, HEP Fortran includes the *create* command which spawns an asynchronous process. This is very similar to the Occam PAR statement, except that the user is restricted to executing subroutines in parallel instead of arbitrary statements.

W2

The W2 programming language is used with the Warp multiprocessor discussed in Section 2.2.3. Warp's queues allow for more asynchronous behavior than the single-element shared variables of Occam and Hearts: the processors can get tremendously out of synchronization without producing the hardware blocking actions that turn off processing elements when queues become full or empty. The waveform style of programming made possible by the queues provides for simpler initialization and result extraction than will be seen in the Paris language example to follow. An example W2 program shown is Figure 2.11. As can be seen from the program fragment, the programmer is aware of every step of the way what hardware configuration is being worked on. The *send* and *receive* statements place data on and take data from the connecting queues. Each call must specify the physical name of the queue ('X' or 'Y'), as well as which side to take the information from ('L' or 'R'). In addition, *send* and *receive* statements require variables or values, respectively, for the end processing elements to write to or read from. A more general version would allow file or stream I/O from the operating system. There is no high-level facility for specifying several data streams, so the programmer must use the hardware queues judiciously if several logical streams are desired: a slight change in the order of *send* and *receive* statements will invalidate a program. Likewise, initialization values for the queues cannot be extracted from the main body of the program.
#define NSIZE 32  /* Allocate memory according to word size */
#define X 0
#define Y (X+NSIZE)
#define TMP (Y+NSIZE)
inner (xin, yout)
real xin, yout;
{
1  CM_f_write_to_processor_SL (0, X, xin, NSIZE);
2  CM_f_write_to_processor_SL (0, Y, 0.0, NSIZE);
3  CM_f_multiply_SL (TMP, C, X, NSIZE);
4  CM_f_add_SL (Y, Y, TMP, NSIZE);
5  yout = CM_f_read_from_processor_SL (degree, Y, NSIZE);
6  CM_send_to_newSL (Y, 1, CM_upward, NSIZE);
7  CM_send_to_newSL (X, 1, 1, CM_upward, NSIZE);
}

poly (x, y, c, degree, n)
real *x, *y, *c;
int degree, n;
{i = 1;
  /* set masks, clear registers, etc */

8  initialize();
9  for (j=0; i <= degree; i++)
10  CM_f_write_to_processor_SL (i, C, c[degree-i], NSIZE);
11  /* fill it up with no output */
12  for (i=0; i < degree; i++)
13     inner (x[i], y[0]);
14  /* continue with output */
15  for (; i < m; i++)
16     inner (x[i], y[i-degree]);
17  /* continue with no input */
18  for (; i < n + degree; i++)
19     inner (0.0, y[i-degree]);
}

Figure 2.12: Pairs program for Horner's method.¹⁴⁹

into a separate routine. The strong hardware orientation of W2 is too cumbersome for a
general-purpose systolic language.

Paris

The Connection Machine is not, of course, a predominantly systolic processor; it is,
however, a good illustration of systolic SIMD programming on a machine with many
simple processing elements. The code in Figure 2.12 is a C program making use of Paris

34
subroutine calls to control the Connection Machine. This program is worse than the W2 example: the initialisation of data streams and allocation of memory is even more primitive. Since the Connection Machine is a synchronous machine, the computation must be divided into three parts, depending on how far through the array the data stream has progressed. Thus line 12 of Figure 5.12 must be repeated with slight changes to reflect each section of the computation. As with W2, there is no abstract sense of a systolic data stream; each must be implemented by telling the processing elements to send data in a given direction. Likewise, there is no abstract concept of a variable; each must be allocated by hand. Despite these disadvantages, the Paris example does integrate host (front end) and co-processor (the Connection Machine) functions in a single program. Thus, a single program can control file I/O and problem setup in addition to instigating co-processor computations.\(^a\)

### 2.3.3 Conclusions

Having considered a selection of languages, from the mathematically elegant to the ponderous, several criteria for a systolic programming language have been developed. They are:

1. The language should cleanly separate systolic cell programs and data flow directives. Shared asynchronous variables are to be avoided because multiple references can produce unpredictable results. Instead, cell programs should be specified as pure transfer functions between systolic inputs and systolic outputs.

2. Programs should be able to execute both host and co-processor array functions, preferably in the context of a conventional language.

3. The programmer should be able to declare systolic data streams as such using concise flow directives. Stream declaration and initialisation should be accomplished apart from the cell program, dissociating cell operation from macroscopic data flow and enabling the reuse of standard data flows and cell operations.

4. The language should be independent of low-level co-processor features and functions, hiding array topology, processing element architecture, array size, and the method of systolic communication from the user. The programmer should not have to specify the physical names of queues, ports, registers, or bits.

Concerning the last point, all the languages just considered have assumed the correct size systolic array, a simplification which is often not valid. Running algorithms on larger than required arrays can always be done (inefficiently) with masking, and often with more sophisticated strategies. Running on smaller than optimal arrays is more difficult, and requires algorithmic analysis. Systolic compilers should be able to perform the first partitioning and, at least, provide ways for the user to perform the second.

\(^a\)The C\(^\circ\) language is a somewhat higher-level programming language for the Connection machine, though as shall be seen in Appendix C it is still closely tied to the physical hardware and does not enable elegant systolic programming.
The New Systolic Language, introduced in Chapter 6, provides a framework for satisfying these requirements.
Chapter 3

Systolic Shared Register Architectures

The Systolic Shared Register architecture is the first major conceptual contribution of this thesis. The Systolic Shared Register paradigm overcomes the cumbersome communication methods of the architectures described in Section 2.2; using its simple method of systolic communication, massively parallel VLSI co-processors may be easily constructed for any class of systolic algorithm.

3.1 Overview

This chapter presents the Systolic Shared Register (SSR) architecture. The design builds upon the problems and merits of the algorithms, architectures, and programming languages discussed previously. Perhaps the most critical issue is that of systolic communication; the simple communication of single-purpose systolic systems such as P-NAC (or the sorting array of Figure 2.1 on page 16) should be maintained, but in contrast to single- and special-purpose systolic arrays, the architecture must be fully and easily programmable. Additionally, to form very large and massively parallel co-processors, processing elements of simplicity similar to those of the Connection Machine, GAPP, or MPP are needed. However, as seen in the algorithm analysis of Section 2.1, large amounts of memory and complicated communication schemes are not required by many systolic algorithms. The four fundamental features of the Systolic Shared Register architecture are:

- broadcast instructions
- regular topology
- systolic shared registers
- a stream model of data flow.
The first two features follow directly from the definition of the class of systolic algorithms. The third feature, the use of shared registers, results from two new ways of looking at systolic co-processors. First, consider the logical structure of a traditional systolic system: an array of processing elements, with memory, connected by several nearest-neighbor data paths via communication registers or queues. More efficient systolic machines can be designed if the computational part and the storage part of the standard processing element are logically split into separate units, allowing more flexible dispositions of computation and memory throughout the VxSI chip. Second, in SSR designs there is no distinction between data memory and communication memory, somewhat reminiscent of early computer engineers discovering that the data and program memories need not be separate in hardware. Finally, the stream model of data flow, motivated by the analysis of systolic programming languages in Section 2.3, provides an elegant method for programming systolic co-processors.

3.2 Architecture Definition

In this section, each of the four defining aspects of the Systolic Shared Register architecture is considered in turn: broadcast instructions, regular topology, systolic shared registers, and data streams. A brief exploration of a generalized SSR architecture is also presented.

3.2.1 Broadcast Instructions

Per-processor programmability currently requires too much area for inclusion in high-density systolic arrays, a result of both the size of the program store and the complexity of the instruction sequencer. To form arrays with hundreds of thousands of processing elements, there should be a multitude of processing elements on each chip; otherwise, the physical space used by the co-processor would be excessively large. Typical independently programmable systolic processing elements, such as iWarp, only have one processing element per chip.14

Additionally, when large numbers of MIMD processors (hundreds or thousands) are used for systolic algorithms, the programs stored in each processor tend to be the similar: by definition, systolic algorithms require only a small number of cell programs.50 Even with a relatively small number of processors, published Warp programs typically only differentiate one of the end processors while all others execute the same cell program. Similarly, excluding some relational database applications, the systolic algorithms of Table 2.1 on page 14 require at most four different cell programs: one for each equivalence class of processing elements. Although in the general case an $E$ equivalence class program will take $E$ times longer to execute on an SIMD machine than on an $E$ or more instruction stream machine, some algorithms can be performed with little degradation, as noted in Section 2.1.4. Thus, multiple equivalence class programs can be efficiently executed using a single instruction stream.

The lack of exploitation of the inherent power of MIMD machines is a strong argument in favor of a single instruction stream. While a systolic instruction stream (discussed in
Section 2.2.4) is an interesting concept, broadcast instructions offer a simpler and more efficient method of control. For these reasons, SIMD broadcast instructions, which play an important part in the shared register design of an SSR architecture, are the method of choice.

3.2.2 Regular Topology

To support systolic algorithms, all of which use a regular topology by definition (see Section 1.1), the systolic co-processor itself should have a regular topology. A fixed-degree regular network with nearest-neighbor connections\(^*\) (such as a line, square mesh, or hexagonal mesh) provides for the orderly and extensible mapping of algorithms to the systolic array. Of these possibilities, a linear network is the best choice for the major target applications of the Brown Systolic Array. A linear network extends easily and requires only two I/O streams, one for each end, so matter how large the array is. Linear systolic arrays cannot, of course, efficiently compute all of the problems mentioned in Section 2.1. Linear arrays can, however, solve a large variety of systolic problems.

This topological constraint enforces the systolic principle of locality. To each processing element, even those at the boundary, the world must look the same: there are some number of neighbors, each with a relative name or direction. In a mesh, these relative names might be north, south, east, and west. This is important for two reasons. First, a regular connection network is easily extensible to larger numbers of processors. Second, relative processor names greatly facilitate the use of both broadcast instructions and shared registers.

3.2.3 Shared Registers for Communication

Before turning to the inter-processor communication scheme of an SSR machine, consider the distinction between the systolic data movement of special- or single-purpose arrays and that of programmable arrays. In the special-purpose case, data movement is implicit in the hardware. For example, a processor designed specifically for a sorting problem (such as that of Figure 2.1 on page 10) might consume a value with an internally stored one and make the smaller available to the next processor all in a single step. Thus, data is pictured as moving through the array at a speed of one processing element per step. Since the processors perform only one function, the timing of the data movement can be built into the hardware.

In a programmable systolic array, data cannot be moved automatically since different applications require different types of data movement. As an extreme example, the linear systolic matrix multiplication algorithm referred to in Table 2.1 on page 14, and which will be discussed in Section 7.2, requires three different types of data movement: two streams flow in one direction at rates of one and two time steps per processor, and a third stream flows in the opposite direction at a rate of one time step per processor. Therefore, in a programmable systolic array data movement must be explicitly specified

\(^*\)That is to say, a regular and planar network. The relaxation of this condition is considered in Section 3.3.3.
Figure 3.1: Linear SSR processor array and processor detail.

by the user. Several machines provide this ability with special instructions and special queues or registers for data movement — with Warp, this entails placing data on the X and Y queues and removing them in the correct order. In an SSR architecture, however, data moves through the array as a natural result of the user’s program.

Figure 3.1 shows a segment of a linear SSR array and a detail of a single functional unit with its neighboring register banks. Note that each functional unit ($F_i$) is adjacent to two register banks. Each functional unit can access data values from the register banks directly to its west and east for both input and output. Likewise, each register bank can send and receive data to and from two adjacent functional units. Input to and output from the array can take place at either end. Systolic communication in an SSR architecture uses these shared register banks. There are no internal registers in the functional units, though a small number of flag bits may be present. This separation of computation and storage is a hallmark of the SSR architecture.

Every instruction broadcast to the SSR array specifies the relative addresses of memory locations: specific registers in either the west or the east register bank.

There are two vaguely similar uses of shared memory in the literature, however both are MIMD multiprocessors with standard microcomputer chips. The lattice QCD machine developed at Columbia University by Christ and Tannen features a triangular mesh of sixteen Intel 80286/80387 processor boards, each of which is able to access one of the two buses in each neighbor. The other example, presented by Jagisch and others, connects Intel 80286/80387 processors with dual-ported random access memory (RAM) and an overlaying hypercube structure for message passing — again, not an example of the simple VLSI implementation of systolic communication that is the SSR paradigm.

40
the execution of an instruction, each processing element will read operands from relatively addressed registers and, after the computation, write the result to a relatively addressed register. A single instruction may easily access two (or more in the general case) neighboring register banks. The use of broadcast instructions and relative addressing prevents contention for single-result instructions; dual-ported memory is not required since no two functional units attempt to access the same register bank simultaneously. Simple switches between functional units and register banks can control the flow of information. While there is no potential for hardware conflicts, software conflicts using the shared register banks must be avoided, as shall be discussed in Section 3.4 and again in Chapter 6. The lack of hardware conflicts is a benefit of the regular topology and broadcast instructions required by the SSR design.

With Systolic Shared Registers, data flow takes place concurrently with computation. For example, the single instruction

$$E_{ij} \leftarrow \min(W_0, W_1)$$

(3.1)

not only performs a minimization but also moves data through the array from west to east. Unlike W2 with its send and receive statements, no additional instructions are required to initiate specific types of data flow. On the bit-serial machines discussed in Section 2.3, data must be placed in communication registers before it may be accessed by adjacent processing elements. In the case of the Connection Machine, this is done with the Paris send instruction, in which the router accesses the processor's flag registers. The Massively Parallel Processor and CAPP machines mentioned in Section 2.2.4 have registers dedicated to communication. No extra synchronization statements or special registers are needed in an SSR machine.

The shared register method does not require the use of hardware queues between processors and does not arbitrarily limit the number of communication channels between processors; the user may decide to use the first few registers of each register bank for local storage, the next for transmission of data from west to east, the next for another data stream, and so on. In summary, shared registers provide a small, fast, flexible, and elegant implementation of systolic communication.

3.2.4 Stream Programming

Finally, SSR machines are programmed using a data stream paradigm. In, for example, the UNIX® operating system, utilities typically read from and write to streams (stdin and stdout) which can then be sent to other programs using the UNIX pipe operator. Thus, various utilities can filter the output stream of a program until it is in the desired form: perhaps sorted and columnized.

The data stream model for SSR machines allows the inputs and outputs at both sides of a linear array to be linked to file streams or functions by the host program. In mesh architectures, all edge processing elements, singly or in groups, may be linked to data

*UNIX is a registered trademark of UNIX System Laboratories in the United States and other countries.
streams. For example, a data stream of \( n \times n \) element vectors forming an \( n \times n \) matrix could be linked to the entire west side of a square mesh. The use of a logical data stream is crucial: as shall be seen in Section 3.4, it is possible and, indeed, probable, that a single stream of data may use different physical registers at different times, changing register meanings to produce a more efficient program. This stream model of data flow allows the programmer to ignore such detail.

Stream programming enhances the use of SSR machines as massively parallel systolic co-processors: data can be fed to and received from the array continuously and asynchronously when data queues are provided between the host and the co-processor (discussed in Section 5.2.4). The instruction of Example 3.1 uses one input and one output. The input value (perhaps the next value of a file stream) is written to the easternmost register bank (\( W_2 \) of \( F_1 \) in Figure 3.1). Although this way at first seem odd, it is quite logical: \( R_0 \) is the only register bank not written to by the minimization instruction. The output from Example 3.1 is recovered from the easternmost register bank: in addition to being written to that register bank (\( R_3 \) in Figure 3.1), it is passed out of the array to the host. A programmable co-processor interface could discard unseeded outputs and provide default inputs without the aid of the host computer.

Unlike the other aspects of Systolic Shared Register machines, stream programming is not primarily a hardware issue: it represents a programming methodology. Stream programming will be further developed in the context of the New Systolic Language presented in Chapter 6.

### 3.2.5 Generalized SSR Machine

The systolic shared register paradigm is not restricted to linear arrays. Figure 3.2 shows abstract designs for square, hexagonal, and octagonal mesh SSR machines. In the octagonal mesh, each functional unit can communicate with eight neighboring functional units through its four adjacent register banks. Although there are two boundary edges on corner memory nodes in the hexagonal mesh and three in the octagonal mesh, just one I/O interface for each boundary register bank is needed since only one value is written to any specific register bank during the execution of an instruction. This is a direct result of the broadcast of instructions and relative addressing used in SSR machines.

The SSR architecture may be further generalised as follows. A generalised Systolic Shared Register machine is an undirected bipartite graph \( G = (F, R, C) \) composed of a set of function nodes \( F \), memory nodes \( R \), and communication links \( C \subseteq F \times R \). To preserve systolic features, the graph must be regular and of fixed degree. These two qualities ensure that graphs of arbitrary size may be constructed, a requirement for massively parallel systolic systems. For VLSI systems, \( G \) will be planar or toroidal.

---

*Although this formal definition of the SSR architecture is not necessary for understanding B-SYS, it is important to precisely define what has been only intuitively described.

*Note that the hexagonal and octagonal meshes featured in Figure 3.2 have the same number of function and memory nodes (like the linear array), while the square mesh has twice as many memory nodes. A hexagonal mesh could, of course, be implemented with a degree 2 network which would see three times as many register banks as functional units.

---

42
Cube-connected cycle networks (degree 3) may also use this paradigm. If the fixed degree restraint is relaxed, hypercube ($|C| = n \log n$) and shared memory ($|C| = n^2$) extensions to the SSR concept result. In general, $R_i$ refers to the $j$th element of register bank $i$, while $F_i$ is the $i$th functional unit.

Function nodes $F_1, F_2 \in F$ share memory node $R \in R$ if $(F_1, R) \in C$ and $(F_2, R) \in C$ or, alternatively, if $\text{dist}(F_1, F_2) = 2$. A functional unit $F$ has access to all registers in its neighborhood

$$\text{ngbd}(F) = \{R_{ij} \in R : R_i \in R \text{ and } (R_{ij}, F) \in C\}.$$  

As with the linear case, register banks are referred to relative to a functional unit (east, northwest, etc.): the regularity of the network and the SIMD control assure that multiple accesses of the same memory bank will not occur.
Each instruction in a program \( I^1, I^2, \ldots, I^r, \ldots \) performs a computation on some set of neighboring registers to determine the values of some other group of neighboring registers, thus each is of the form:

\[
R^t_i \leftarrow f(R^{t-1}_{i-1}, R^{t-1}_{i-2})
\]

\( R^t_i \) refers to register location \( i \) at time \( t \). The instruction can easily be generalized to different numbers of operands and results (for example, a flag operand and result are used in B-SYS). Each instruction \( I^r \) is executed simultaneously in all functional units at time \( t \). The presence of a mask or context flag, such as that in B-SYS, corresponds to including the mask flag and the original value of the result register(s) as operands to each instruction.

The Systolic Shared Register architecture is not just the foundation of the Brown Systolic Array; it is a versatile framework for parallel computing. Although best suited for VLSI, in which register banks and functional units may be arbitrarily positioned with ease, its structure may be used for any parallel machine.

### 3.3 An Evaluation of SSR Communication

It is instructive to examine the Systolic Shared Register design in light of the definition of systolic communication proposed by H. T. Kung. Kung separates memory-to-memory communication and systolic communication. In the former, information is passed from the memory of one processing element to the memory of another via communication queues. In systolic communication, however, information may be both directly
real from and written to the communication queues by the central processing unit. In both cases, there is a distinction between the queue memory used for communication and each processor's own local memory. Figure 3.3 shows these two methods for providing communication between processing elements in a linear array.

This distinction between the two types of memory is important: the communication memory defined by Kung only allows sequential access (as with Warp), while the local memory allows random access. Thus, a systolic array machine is one that can make direct use of data on its input queues. Since the register banks of an SSR architecture are essentially a set of single-element, bi-directional queues, the SSR architecture is systolic by this definition. However, the SSR method has taken the idea of systolic communication one step farther: the shared register banks allow random access, making local memories superfluous. Since SSR systems do not distinguish between memory types, the CPUs not only communicate with their communication registers but compute with them as well.

Another difference between communication queues and shared registers is that the SSR paradigm's use of communication registers forces synchronous execution on the system. If an item is to be read from a specific register, it must be written to that register by an adjacent processor before that time. This is a simple task with synchronous broadcast instructions and closely resembles the data movement present in special-purpose systolic arrays. On the whole, Systolic Shared Registers provide more efficient systolic communication than systolic queues.

3.4 Programming Techniques

Moving data through an SSR machine is more complicated than simply reading two operands from the west register bank (W) and storing the result in the east register bank (E). Since E is the west register bank of an adjacent functional unit F, writing to it may invalidate data used by F. Taking another look at the sorting algorithm of Figure 2.1, the operation of storing the maximum and passing the minimum has been conveniently lumped into one instruction. This operation may be represented as:

\[ E_1, W_0 \leftarrow \min \max (W_0, W_1) \tag{3.2} \]

where register 0 is used for local storage of the maximum and register 1 is used for communication. Since instructions typically have only one output register, this version is unrealistic. A possible translation is:

\[ E_1 \leftarrow \min (W_0, W_1) \]
\[ W_0 \leftarrow \max (W_0, W_1) \tag{3.3} \]

This program appears to have the same effect as Example 3.2, however \( E_1 \) for the \( i \)-th functional unit \( F_i \) is \( W_0 \) of \( F_{i+1} \); the writing to \( E_1 \) by \( F_i \) will invalidate the maximum computation of \( F_{i+1} \), and so on down the line. There are two solutions to this problem: the use of scratch registers and the use of phased programs.
3.4.1 Scratch Registers

Scratch registers are the obvious solution to the problem present in Example 3.3. Using \( W_2 \) as a scratch register, the program becomes:

\[
\begin{align*}
W_2 &\leftarrow \min(W_0, W_1) \\
W_0 &\leftarrow \max(W_0, W_1) \\
E_1 &\leftarrow W_2.
\end{align*}
\]  

(3.4)

At the cost of one register and one instruction, the overwriting problem has been solved. For the simple sorting algorithm, the extra instruction causes a large degradation in performance. A phased systolic program will provide a more efficient solution.

3.4.2 Phased SSR Programs

The second alternative is the use of a phased program. A phased program has several distinct phases during which different registers are used for communication. For example, the communication register might be \( E_2 \) in phase 1 and \( E_1 \) in phase 2. Switching back and forth between these two phases (called weaving) will result in a correct program which, though an additional register is still used, does not require any extra instructions. Two inner loops of the sorting program would then be:

\[
\begin{align*}
E_2 &\leftarrow \min(W_0, W_1) \\
W_0 &\leftarrow \max(W_0, W_1) \\
E_1 &\leftarrow \min(W_0, W_2) \\
W_0 &\leftarrow \max(W_0, W_2).
\end{align*}
\]  

(3.5)

Sorting with this technique is shown in Figure 3.4. The current scratch register is boxed, while the top register is used for local storage. During the first part of each phase, the minimum of the boxed scratch value and the top stored value is passed to the unused register of the adjacent processing element. During the second part, the maximum of the two values is locally stored.

This technique can also be applied to programs with more than two phases. For example, suppose a half-speed data stream is flowing from west to east. The general three-phase program would be:

\[
\begin{align*}
E_0 &\leftarrow f(W_0) \\
E_2 &\leftarrow f(W_1) \\
E_1 &\leftarrow f(W_0).
\end{align*}
\]  

(3.6)

This data stream may then be combined with a two-phase data stream to form a program with six (the least common multiple of two and three) phases.

The use of phased programs to avoid software conflicts will be further considered in Chapter 6, wherein the New Systolic Language's automatic support for phased programs will simplify the programming of Systolic Shared Register machines.

46
The local storage register is enclosed in a dotted box. The scratch register is shown in a solid box. The remaining register is used to receive data from the adjacent processing element. Functional units are not shown.

Figure 3.4: Phased systolic sort.
3.5 Conclusions

This chapter has presented a new architecture for massively parallel systolic processing: the Systolic Shared Register architecture. The SSR architecture combines the best features of special-purpose systolic arrays (many processors, synchronous data movement) with the power of programmable arrays. The four attributes of the Systolic Shared Register architecture are:

- SIMD broadcast instructions for control
- regular interconnections for systolic data flow
- shared registers for communication and computation
- data streams for programming.

The SSR paradigm is applicable to all types of programmable systolic co-processors, including linear arrays and various types of meshes. Although SSR programming has some intricacies, Systolic Shared Registers provide a simple and practical implementation of systolic communication.
Chapter 4

Design of the Brown Systolic Array

The basic architecture of the Brown Systolic Array is that of a linear Systolic Shared Register machine. Apart from this, there are many choices left open in the generic architecture which must be considered for each SSR implementation: functional unit complexity, register bank size, and word size are the most obvious. Proceeding down to the lowest design level, even after these architectural issues have been decided, potential implementations of the chosen features must be suitably analyzed to produce a working chip. This chapter considers all aspects of the B-SYS design, from the architecture down to the CMOS layout.

4.1 Architecture

Recall from Section 2.2 the three types of parallel machine examined. The single-purpose systolic array could achieve massive parallelism because of its size but was not programmable. The SIMD bit-serial machine was also massively parallel but had more memory and routing capabilities than is required for combinatorial problems. The elimination of these features could provide either the same processing power (for combinatorial problems) in a smaller space or a much more massively parallel machine in the same space. Space, of course, corresponds directly to cost. The use of a bit-parallel processing unit could improve performance as long as pin requirements remain feasible. Finally, the MIMD machine lacked the ability to become massively parallel in a reasonable amount of physical space: the processing elements were simply too large. The Brown Systolic Array, being an implementation of the SSR architecture, solves many of these problems for the domain of combinatorial problems.

Developing further the generic linear SSR array of Figure 3.1 on page 40, Figure 4.1 illustrates a B-SYS functional unit and its two adjacent register banks. The Brown Systolic Array uses an 8-bit word, large compared to bit-serial processor arrays and small compared to MIMD machines. For combinatorial problems, with their reliance on
symbols and small numbers, 8-bit words are sufficient. Common ASCII text uses seven or eight bits, thus B-SYS is well suited for text compression, comparison, and other applications. Although an 8-bit ALU cannot be expected to perform eight times faster than a 1-bit ALU, it is in general faster.

Each register bank has sixteen 8-bit registers, the same amount of storage per processing element as the GAPP chip (Section 2.2.4), and an amount suitable for many sequence comparison and other symbolic computations. Access to the register banks is controlled by complementary 8-bit switches on each side of the register bank.

Functional units have eight local flags \( F_0 \ldots F_7 \) which are used for the storage of ALU flag inputs and outputs, as well as the processing of conditionals. One of the flags \( F_0 \) is a context flag which enables the conditional execution of instructions. If the command broadcast to the array enables use of the context flag, only those with a set flag will write the ALU results to memory.\(^6\)

### 4.1.1 B-SYS Arithmetic Logic Unit

The Brown Systolic Array’s arithmetic logic unit, shown in Figure 4.2a, has two one-word inputs, \( A \) and \( B_i \), and a single-bit input \( C \). The outputs are a result word \( R \) and a bit \( Z \). The ALU has a bit-slice architecture: each bit of the output is computed in a manner independent of all other bits in the ALU and, indeed, all other bits on the chip.\(^5\)


\(^6\)Actually, a result is written to a register bank \( R_i \) if the functional unit to its east \( F_i \) has an asserted context flag. Thus, when the destination is in the east register bank, the writing of a result \( R \) will not be controlled by the functional unit \( F_i \) which generated it, but by \( F_{i+1} \). The writing of \( F_i \)'s flag output is always controlled by \( F_i \)'s context flag. This is caused by an easy design flaw which, when discovered, was deemed too costly to fix.

\(^A\) note on fonts used in the text. Architectural names are expressed as GP, \( pr_i \), and the like. Low-level control signals and operands are expressed as \( K_1 \), \( K_{1AB} \), and the like. Hexadecimal numbers are
The \( i \)-th bit-slice, detailed in Figure 4.2b, is loosely based on the design of the OM1 arithmetic logic unit from Mead and Conway.\(^{110}\) A carry lookahead ALU is used: generate and propagate signals are calculated from the \( a_i \) and \( b_i \) inputs and then used to calculate the carry bit output \( c_i \). The result output \( r_i \) is computed from the \( a_i \) and \( b_i \) inputs as well as the carry input \( c_i \) (equal to \( c_{i-1} \) from the previous stage). The generate and propagate signals have their standard meanings: if \( g_i \) is asserted then \( z_i \) is asserted. Otherwise, if \( p_i \) is asserted then \( z_i = c_i \). If neither of these cases hold, \( z_i \) is not asserted.

Arbitrary function logic blocks compute \( p_i \), \( g_i \), and \( r_i \). In the former cases, any 2:1 function (specified by \( CP \) and \( CG \)) and in the latter any 3:1 function (specified by \( CR \)) may be used. For example, consider the addition operation. In this case, a carry is generated at the \( i \)-th stage if \( a_i \) and \( b_i \) are both 1. A carry is propagated though the \( i \)-th stage if either of the inputs is 1. The sum bit is the exclusive-or of the inputs \((a_i \oplus b_i \oplus c_i)\), expressed as \( 0C_{16} \) or \( 6C \).
The logic block control word is the appropriate column of the function table. In the case of addition, the result function \( CR = 96 \_6 = 10010110_2 \) can be read from the \( r_1 \) column of Table 4.1 (the most significant bit of \( CR \) corresponds to \( a_1 = b_1 = c_1 = 1 \) in the table). The generate and propagate control signals, independent of \( c_1 \), are four bits long. Reading from the top half of the table, the generate function is \( CG = 8_{16} \) and the propagate function is \( CP = 6_{16} \). The propagate function \( CP = E_{16} \) would be equally valid for addition, corresponding to \( p_1 = a_1 + b_1 \). As can be seen from the above table, the order of significance for inputs to the function blocks is \( c \), \( b \), and then \( a \).

**B-SYS Microcode**

The 38-bit horizontal microcode format of a B-SYS instruction is shown in Figure 4.3. The instruction word may be conveniently divided into two major parts: the register input and output addresses, along with \( CH \), and the flag input and output addresses, along with \( CG \) and \( CP \). Each broadcast instruction has the following fields:

- \( B \): When asserted, all functional units execute the broadcast instruction regardless of the context flag. Otherwise, only those functional units with a 1 in flag \( F_0 \) will store the ALU results. This field corresponds to the Call control line of the B-SYS chip.
- \( CR \): eight bits expressing the 3:1 function used to calculate \( R \).
- \( A \): one bit specifying a register bank (left or right) and four bits specifying which register in that bank is the \( A \) input to the ALU.
- \( B \): five bits, as with \( A \).
- \( R \): five bits, as with \( A \), specifying the register in which to store the result.
- \( CG \): four bits expressing the 2:1 function for the generate signal of the carry chain.
- \( CP \): four bits expressing the 2:1 function for the propagate signal of the carry chain.
- \( C \): three bits specifying the address of the input flag to the ALU (\( c_0 \)).
- \( Z \): three bits specifying the address for the output flag from the ALU (\( s_1 \)).

All fields are required and there are no immediate operands. The \( CR \), \( CG \), and \( CP \) control words are broadcast to each bit slice of the ALU, while the remaining fields
<table>
<thead>
<tr>
<th>Name</th>
<th>Function</th>
<th>( CR_{B} )</th>
<th>Name</th>
<th>Function</th>
<th>( CR_{B} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>zero</td>
<td>0</td>
<td>00</td>
<td>one</td>
<td>1</td>
<td>FF</td>
</tr>
<tr>
<td>fnA</td>
<td>a</td>
<td>AA</td>
<td>fnC</td>
<td>c</td>
<td>F0</td>
</tr>
<tr>
<td>xorAC</td>
<td>( a \oplus c )</td>
<td>5A</td>
<td>jorAC</td>
<td>ac</td>
<td>C0</td>
</tr>
<tr>
<td>nbotxorC</td>
<td>( \overline{a \oplus c} )</td>
<td>6B</td>
<td>xorAB</td>
<td>( a \oplus b )</td>
<td>68</td>
</tr>
<tr>
<td>xorAB</td>
<td>( a \oplus b )</td>
<td>8B</td>
<td>andAB</td>
<td>ab</td>
<td>8B</td>
</tr>
<tr>
<td>nandAB</td>
<td>( \overline{a \cdot b} )</td>
<td>1F</td>
<td>orAB</td>
<td>( a \cdot b )</td>
<td>EE</td>
</tr>
<tr>
<td>xorAB</td>
<td>( a \oplus b )</td>
<td>11</td>
<td>xorABC</td>
<td>( a \oplus b \oplus c )</td>
<td>66</td>
</tr>
<tr>
<td>selectABonC</td>
<td>( a \oplus b \oplus c )</td>
<td>4C</td>
<td>selectABonC</td>
<td>( a \oplus b \oplus c )</td>
<td>CA</td>
</tr>
</tbody>
</table>

Table 4.2: B-SYS result block control words.

<table>
<thead>
<tr>
<th>Name</th>
<th>Function</th>
<th>( CG_{16} )</th>
<th>( CP_{16} )</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>Zero</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Zone</td>
<td>1</td>
<td>F</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Zadd</td>
<td>carry</td>
<td>8</td>
<td>6</td>
<td>0</td>
</tr>
<tr>
<td>Zadda</td>
<td>increment A</td>
<td>0</td>
<td>A</td>
<td>1</td>
</tr>
<tr>
<td>Zsub</td>
<td>borrow</td>
<td>4</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Zconst</td>
<td>C</td>
<td>0</td>
<td>F</td>
<td>0</td>
</tr>
<tr>
<td>notzeroA</td>
<td>a \neq 0 ?</td>
<td>A</td>
<td>F</td>
<td>0</td>
</tr>
<tr>
<td>zeroA</td>
<td>a = 0 ?</td>
<td>0</td>
<td>S</td>
<td>1</td>
</tr>
<tr>
<td>matchAB</td>
<td>B ; a = b ?</td>
<td>8</td>
<td>F</td>
<td>1</td>
</tr>
<tr>
<td>nomatchAB</td>
<td>( \overline{B} \neq a ) ?</td>
<td>0</td>
<td>6</td>
<td>1</td>
</tr>
<tr>
<td>equalAB</td>
<td>( A = B )</td>
<td>0</td>
<td>9</td>
<td>1</td>
</tr>
<tr>
<td>notequalAB</td>
<td>( A \neq B )</td>
<td>6</td>
<td>F</td>
<td>0</td>
</tr>
</tbody>
</table>

Table 4.3: B-SYS generate and propagate control words.

are used by the processor to access the correct memory elements. If the user only needs one of the two results (\( R \) and \( Z \)), the other must be placed in a scratch register or flag. If a program requires certain constants, they must be loaded into one of the registers beforehand. Often, the first step of a program will be to place values in several flag registers and clear several locations in the register banks. Constants needed in all processing elements can be constructed in a small number of B-SYS statements without having to shift them through the array.\(^4\)

B-ASM Assembly Language

Table 4.2 displays the hexadecimal codes for several result logic block control words. The selection command selectABonC chooses between operands according to a flag input; it is vital for the processing of conditionals without the use of the mask flag. Since masking operations require additional instructions to set and clear \( F_{5} \), register selection is often the most efficient way to process conditionals.

\(^4\)For example, registers can be set to arbitrary values in eight steps by shifting in ones and zeroes from the flags.

53
On appropriate flag inputs the Manchester carry chain can perform several test functions. For example, with \( C = 1, CP = 5_{16}, \) and \( CG = 0_{16}, \) a carry will never be generated, and the input carry will be propagated only if \( ai = 0. \) Thus, these two codes test the \( A \) operand for equality to zero. Several generate and propagate control words are shown in Table 4.3.

Figure 4.4 shows a program for the phased sorting algorithm of Figure 3.4 on page 47. The program (from the Brown Systolic Array Simulator) uses the C preprocessor as an assembler. In addition to the instruction fields just described, each command also has information describing host I/O: when results should be written to and read from the co-processor board, as described in Section 3.2.4. Programs such as this are used both when programming the Brown Systolic Array and when using its simulator.

4.1.2 Architecture Simulation

The Brown Systolic Array Simulator (B-SIM) proved crucial for algorithm development and early evaluation of the B-SYS architecture. The simulator simulates the data flow between the host and the systolic co-processor using UNIX pipes: when running the simulator, the host process analyzes the input assembly code with its primitive looping construct, combines instructions with input data, and passes the information to the board process. The board process simulates the array and, when requested, passes information back to the host process which can then send data to the appropriate files. The host
process can be modified for different programming models without change to the basic simulator.

In addition to passing information back to the host process, the board process can (on command) write the state of the entire array to a file. Using another program, this raw data from the board process can be transformed into TeX snapshots of the array similar to that of Figure 4.5. As an added feature, the board process can also send FIELD messages to an independently running systolic array animation system based on the Tango package. With the animator, the user can see operands move to the functional units and results move back to the appropriate register bank. Flags are represented with red (clear) and green (set) circles. Although, given a B-SIM program, the animation is entirely automatic, the user may also elect to include comments and pause commands at specific points in the animation and to highlight certain registers in the array. Such highlighting can reflect register bank meanings in a phased systolic sort, as illustrated by Figure 4.6. (Of course, two black and white foames do not do justice to the color Tango animation.)

The use of the Brown Systolic Array simulator is detailed in Appendix A.

4.2 Design

This section tours the design of the Brown Systolic Array. After a review of CMOS technology, the overall organization of B-SYS is described. As shall be seen, B-SYS
Figure 4.6: B-SIM Tango animation.
pans 85,000 transistors on a 6.9 mm x 6.8 mm chip, a density made possible by the simplicity and elegance of the SSR design. Finally, the implementation of various B-SYS subsystems is considered.

4.2.1 CMOS VLSI

Before proceeding, a brief review of (or introduction to) digital Complementary Metal Oxide Semiconductor (CMOS) VLSI is in order. As the name implies, CMOS has two basic circuit elements: the nFET (n-type field effect transistor) and the pFET (its complement). Digital CMOS circuits are formed with these two transistors, used as switches, and the supply voltages power (Vdd) and ground (GND). In the ideal circuit, there are no analog effects; that is to say, all signals match (in voltage) one of the supply voltages and all switches both do and do not conduct perfectly, depending on state. CMOS comes very close to this ideal case, certainly much closer than such technologies as nMOS (CMOS without the pFET) and transistor-transistor logic (TTL).

The n-transistor of Figure 4.7a is a positive switch; when its gate is connected to a high voltage (Vdd) the n-transistor will turn on, connecting the source and the drain terminals. The complementary p-transistor is a negative switch: when its gate is connected to a low voltage (GND) the p-transistor will conduct. CMOS transistors are nearly perfect switches. No current is required to maintain the FET's conductive state, although some current is required to change a transistor's state because of the gate capacitance. Also, nFETs conduct ground signals with no voltage loss and pFETs conduct power signals with no voltage loss. CMOS designs therefore use nFETs to control low signals and pFETs to control high signals.

The fourth terminal of the nFET, labeled "substrate", is a connection to the base material on which the FET is fabricated (for an nFET, this is p-material — silicon with an electron deficiency yielding a net positive charge). In digital CMOS circuit design, substrate terminals are always connected to ground (for nFETs) or power (for pFETs),

Figure 4.7: Example CMOS circuits: transistor, inverter, and inverter layout.
and will not be included in future transistor diagrams.

Figure 4.7b shows a CMOS inverter, the basic circuit of the CMOS logic family. When the input \( \text{In} \) is high, the \( n \)-transistor pulls the output \( \text{Out} \) low and the \( p \)-transistor is off. An analogous case holds when the input is low. This gate will use no power except when the input signal is changing levels, at which time power and ground will be briefly connected as one transistor switches on and the other switches off; during static operation, neither the gate node nor the output node (which is connected to other perfect transistor gates) requires a flow of current.

What has just been described is an ideal circuit, but CMOS transistors are not quite so perfect and do not pass voltages perfectly. The ideal model is, nevertheless, close to the truth and sufficient for most aspects of digital circuit design.

CMOS circuits are formed by placing layers of various materials on top of one another. The major abstract layers for B-SYS, manufactured in a typical CMOS technology, are shown in Figure 4.8. There are five basic layers: polysilicon, two metal layers, and two types of diffusion. Polysilicon is the gate material of transistors: whenever a line of polysilicon crosses either diffusion layer, a transistor is formed. All other layers do not interfere with each other unless connected. For example, polysilicon and both metal layers may pass on top of one another without any electronic effect. The first metal layer (metall) is the primary connection medium: it can be attached to all other layers using contacts or, in the case of metal2, vias. The second layer of metal, which can only be connected to the first, is well suited for control signal and other long-distance routing. Substrate contacts, usually ignored in transistor diagrams, are crucial in the design of functioning circuits. Beyond these basic layers, more advanced (and more costly) technologies may have additional layers of metal or polysilicon.

A symbolic layout of the CMOS inverter (using only one layer of metal) is shown in Figure 4.7c. Information pertaining to the electrical and switching characteristics of such CMOS gates as well as the use of substrate contacts may be found in the literature.5,158,151

58
B-SYS was designed in a lambda-based scalable CMOS style. In scalable circuit design, layouts are made using an abstract unit of measurement (λ) and various design rules, such as polysilicon having to be both 2 λ wide and spaced a similar amount from other polysilicon. The general figure of merit for a fabrication process is its feature size, or the smallest size of material which can be reliably fabricated. In abstract units, this corresponds to a 2λ × 2λ square. Thus, when B-SYS was implemented in a 2 micron CMOS process, λ was set to 1 micron. The great advantage of scalable CMOS is that all parts of a digital system (excepting pads, which require a certain physical size that does not depend entirely on feature size) may be designed without concern for the eventual technology — the basic functional unit and register bank of B-SYS were designed well before the final implementation technology was known. The inverter in Figure 4.7c could be implemented in a 0.6 μm process as easily as a 4 μm process.

The University of California at Berkeley’s Magic VLSI design system (both version 4 and beta version 6) was used to lay out the B-SYS chip. A scalable CMOS technology specification for Magic, obtained from the MOS Implementation Service (MOSIS, the fabricator of B-SYS), provided rules for the Magic design rule checker. The design rule checker, running constantly during design sessions, highlights rule violations for correction, a great aid to the circuit designer.

The circuit layouts in this chapter are plots of symbolic Magic designs.

4.2.2 Overview of the B-SYS Chip

The B-SYS chip (shown Figure 4.9 without connecting wires) measures 6.9 mm × 8.8 mm. Each of the 84 square pad cells ringg the chip measures 200 λ × 200 λ, or 100 μm square. These are so “large” both because powerful (hence, large) inverters are needed to drive signals off of the chip and because wires must be bonded from the chip’s pads to the pins in the packaging. The pads, obtained from MOSIS, were designed for use with Magic by Charles Seitz at the California Institute of Technology.

The B-SYS chip contains 47 functional units, each coupled with its own register bank (identical to the structure shown in the expanded processor of Figure 4.9), and one additional register bank. Each row of the chip has its own control circuitry, including a finite state automata (FSA) for control signal generation and two decoders, one for the 16-word register banks and one for the eight flags. Also visible are row buffers for the 16 bits of function information associated with each instruction. The entire chip has 85,000 transistors, a modest size in these days of million-transistor chips.

Because sending information on- and off-chip is slow, it is not practical for end functional units to obtain ALU operands from either adjacent chips or the board. For this reason, the register bank shared between the rightmost processing element of one chip (block 47 in Figure 4.9) and the leftmost of the next (block 1) is duplicated. This performs the function of a write-through cache so that chip I/O does not take place during the loading of instruction operands. Whenever a result is written to the right, the value is sent both to the shadow register bank (above the B-SYS logo) and off-chip.

*Lambda-based design rules were introduced by Medd and Conway. 129*
At the same time, an input is received (either from an adjacent chip or the board) for the leftmost register bank (part of block 1 in Figure 4.9).

The mapping of the pads to the 84-pin pin grid array (PGA) is shown in Figure 4.10. The pins have the standard spacing of 0.1 inches, and the square package measures 1.2 inches on each side (about four times the length or width of the actual chip).

The meanings of the pin names are presented in Table 4.4. The pins may be grouped into five categories: clocking (4), instruction (38), data (9 for each side, including a control line for masking), diagnostic output (6), and source voltages (12). The remaining 6 pins are unused. Instruction signals correspond exactly to those described in Section 4.3.1. Of the four signals in the clocking group, three are actual clocks: K1, K2, and K3. These roughly correspond to decoding the memory address and precharging memory (or evaluating the carry chain), accessing the register bank, and clearing the register selection. There are three primary states the chip can be in: \( P_o, P_i, \) and \( P_{out} \). During
Figure 4.10: B-SYS pin grid array.

<table>
<thead>
<tr>
<th>Pin</th>
<th>Description</th>
<th>I/O</th>
<th>Pin</th>
<th>Description</th>
<th>I/O</th>
</tr>
</thead>
<tbody>
<tr>
<td>K1</td>
<td>Clock</td>
<td>I</td>
<td>C0</td>
<td>Clock Control</td>
<td>I/O</td>
</tr>
<tr>
<td>K2</td>
<td>Clock</td>
<td>I</td>
<td>C1</td>
<td>Clock Control</td>
<td>I/O</td>
</tr>
<tr>
<td>K3b</td>
<td>Clock</td>
<td>I</td>
<td>C2</td>
<td>Clock Control</td>
<td>I/O</td>
</tr>
<tr>
<td>Ini</td>
<td>Ignore Clocks</td>
<td>I</td>
<td>C3</td>
<td>Clock Control</td>
<td>I/O</td>
</tr>
<tr>
<td>C0-C2</td>
<td>C Flag Input Address</td>
<td>I</td>
<td>B0-B3, BL</td>
<td>B0-B3, BL</td>
<td>I</td>
</tr>
<tr>
<td>B0-B2</td>
<td>Z Flag Output Address</td>
<td>I</td>
<td>X0-X3, XL</td>
<td>X0-X3, XL</td>
<td>I</td>
</tr>
<tr>
<td>B3-B7</td>
<td>Right Result Bus</td>
<td>I/O</td>
<td>L0-L7</td>
<td>Left Write Bus</td>
<td>I/O</td>
</tr>
<tr>
<td>Rm</td>
<td>Right Write Control</td>
<td>I</td>
<td>Lm</td>
<td>Left Write Bus</td>
<td>I/O</td>
</tr>
<tr>
<td>C0i</td>
<td>Ignore mask flag</td>
<td>I</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Ca</td>
<td>Phase A</td>
<td>O</td>
<td>RDV</td>
<td>Propagate Control</td>
<td>I</td>
</tr>
<tr>
<td>Ch</td>
<td>Phase B</td>
<td>O</td>
<td>V1-V5, VT</td>
<td>Generate Control</td>
<td>I</td>
</tr>
<tr>
<td>Cz</td>
<td>Phase R</td>
<td>O</td>
<td>G1-G4</td>
<td>Result Control</td>
<td>I</td>
</tr>
<tr>
<td>Cni</td>
<td>Phase R or I</td>
<td>O</td>
<td>n5,n5,n5-n8</td>
<td>A Operand Address</td>
<td>I</td>
</tr>
<tr>
<td>Cni</td>
<td>Phase R or I (slow)</td>
<td>O</td>
<td>Null pad</td>
<td>Ground</td>
<td>—</td>
</tr>
</tbody>
</table>

Table 4.4: B-SYS pin names.

the first two states, the A and B inputs are latched in each ALU. During the last phase \((P_{st})\), the results \(R\) and \(Z\) are calculated and either stored or discarded, depending on the masking conditions. As is perhaps indicated by its name, this phase is longer than the others to allow time for receiving and storing values from adjacent chips. The chip then moves to the neutral \(P\) state, where it remains until signaled (with the init line) to move on, allowing the host to set up the next instruction.

The data outputs from the execution of an instruction are first available during \(P_{st}\).
and remain until they decay or the next \( P_k \) is entered. To aid board design, the chip also provides output of \( P_0, P_1, P_2, P_{n+1} \) and \( P_{n+1} \) for use with edge-triggered latches, and a \( \text{RDY} \) signal. The \( \text{RDY} \) signal is briefly asserted at the end of \( P_n \), and can be used to control the latching of a new instruction. Figure 4.11 illustrates the following just described. During the first cycle of three clocks (K1, K2, and K3) the input signal was asserted, preventing B-SYS from entering \( P_0 \) until the end of the second cycle. As can be seen, the K2 and K3 signals are allowed to overlap.

The layout of a functional unit and register bank pair is shown in Figure 4.12. In addition to the major features to be discussed shortly (the register bank, function blocks, carry chain, flag circuitry, left and right register bank selectors), several control signal buffers can also be seen. This chapter closes with the completely expanded layout of a functional unit and register bank pair on page 74.

Because of the simplicity of the SSR architecture, the 1730 transistors of a functional unit and register bank pair are densely packed in a 999 \( \lambda \times 607 \lambda \) cell. A larger chip with smaller technology (B-SYS used \( \lambda = 1 \mu m \)) could fit 500 or more functional units and register banks on a single chip, producing an even more powerful single-chip systolic co-processor.

### 4.2.3 B-SYS Subsystem Design

Many of the B-SYS subsystems are based on circuits found in the literature, in particular the books by Annaratone and by Weste and Eshraghian. The units have been designed so that control signals can pass over them from left to right in the second metal layer. This design style is crucial for easily forming large arrays of processing elements:
the basic cells can simply be abutted to form all necessary connections. In practice (Figure 4.9), the cells are not quite abutted to allow room for control signal buffers and supply voltage connections.

Subsystems (and the entire chip) were simulated with Carnegie Mellon University’s COSMOS (Compiled Simulator for MOS) switch-level simulator and the Crystal timing analyzer.16,128 It must be noted that Crystal is a notoriously pessimistic timing analyzer, and thus all Crystal results are loose upper bounds. These two tools guided all aspects of the layout.

Register Banks

Storage elements, both the flags and the registers, use a standard six-transistor CMOS static (without a refresh cycle) random access memory (RAM) cell. The cell, a pair of cross-connected inverters, is shown in Figure 4.13. To read a value from the cell, first the bit and the bit’ lines are precharged to Vdd. Then, the desired cell is selected with the sel line. Depending on the value stored, one of the bit lines will be pulled low while the other will remain high. The reason for precharging is size: nFETs do not pass high voltages well, so either precharging or the inclusion of two additional pFETs and the complement of the selection control signal is required. Since the latter choice would significantly increase the size of the RAM cell without a significant decrease in access time (pFETs are slower than nFETs), precharging is used.

To store a value in the cell, one of the bit lines is raised and the other lowered. When
the select line is raised, the cell will flip to store the new value.

At the top of each column of RAM cells (Figure 4.12) is control circuitry for the 16-element column: logic for precharging during the precharge clock $K_1$ of $P_a$ and $P_b$ and for writing to the RAM cell during $P_{+1}$. Above this is the left or right selection block which connects a register bank to the appropriate functional unit during each phase of instruction execution.

Through extensive experimentation with Crystal, optimal precharge transistor sizes (the larger the transistor, the faster the precharge but the slower the signalling of the precharge) and write buffer transistor sizes were determined. The use of a sense amplifier, an analog CMOS circuit which can shorten the read access time of RAM circuits, was found to yield little improvement for B-SYS' small register banks. Crystal estimates that the register precharge time is 20 ns, the evaluate time is 20 ns, and the write time is 25 ns.

Flags

The eight flags are similar in design to the 16-element RAM columns of the register banks. A selected flag is latched into the C register during $P_b$, and the C flag is written during $P_a$. The flag block of Figure 4.12 also controls access to the RAM cells. The first flag ($P_0$), the Cell control line (corresponding to the '!' field of the instruction word in Figure 4.3), and the Cry signal are used to determine whether or not a memory write should take place, indicated with the Cm control line. Cm activates memory writing in both the register banks and the flag block.

Function Blocks

The implementation of an arbitrary function block is illustrated in Figure 4.14. For any combination of $a_1$ and $b_1$ values, a single control line ($P_0$, $P_1$, $P_2$, or $P_3$) is selected. The value on this line is then used as the logic block's result $P$. As has been mentioned,
nFETs do not pass high voltages well. To speed evaluation in this logic block, a weak pull-up pFET is used. Because it is weak, a low signal from a control line will be able to dominate the circuit and turn off the pFET. Although precharging could have been used instead for the P and G logic blocks, it could not have been used for the R block since the \( c_3 \) input to the R block is liable to change after evaluation has commenced. Precharging will not work unless circuit inputs are constant during evaluation. Special clocking could enable the use of precharging, but this solution is deemed better. According to Crystal, the R logic block requires 27 ns to evaluate after receiving an input (\( a_1 \), \( b_1 \), or \( c_2 \)), while the P and G blocks require 15 ns to respond. Changes in control inputs at the chip pads require 23 ns for \( P \), 21 ns for \( G \), and 35 ns for \( R \) to stabilize. Since control words are constant for each individual instruction, these times are not important because they are all shorter than the duration of \( P_a \).

**ALU Latches**

The bottom of each ALU bit-slice includes latches for the \( a_1 \) and \( b_1 \) inputs, inverters to provide their complements, precharge transistors to quicken register bank evaluation, and a tristate buffer for the \( R \) value. The R logic block and these latches are shown in Figure 4.15. This figure is also an excellent illustration of the design style: control signals are passed over the local logic in a regular grid. In each register bank and functional unit pair, 98 control, data, and supply voltage lines pass over all eight ALU and RAM bit-slices, all but ten in the second metal layer. Given the spacing and width requirements.
of this layer, 124 is the maximum practical number of wires which could cross over the cell, though this would eliminate any use of metal2 for local connections. Both transistor logic and communication and control signals are densely packed throughout the B-SYS chip.

Manchester Carry Chain

Instead of the traditional carry look-ahead tree, B-SYS uses a Manchester carry chain, first proposed by Kilburn and others in 1966 when a 24-bit bipolar transistor adder could evaluate in 200 “millimicrosec.” The basic cell is illustrated in Figure 4.16a. It computes complemented carry signals, so that during the evaluate phase (ϕ is high), the output 2 will be grounded if there is a generate signal. Otherwise, the input ε is passed
if there is a propagate signal or the precharged value of $\bar{x} = 1$ remains after evaluation.

As can be seen from the layout (Figure 4.16b), a double carry chain is used. The top carry chain propagates the carry as quickly as possible from $c_0$ to $c_4$ to $s_2$, while the bottom carry chain drives the $R$ logic blocks in the ALU bit-logic (Figure 4.17a). The top $s_3$ is used to drive both the top and the bottom $c_4$, an arrangement suggested by Weste and Eshraghian.

A transistor diagram of a 4-bit segment of the carry chain is given in Figure 4.17b. The input is sent through a dynamic inverter and the output is buffered. This electronic separation of each set of four carry cells is necessary because of the large capacitance present in series chains of transistors.

As an early experiment, a ripple carry ALU was tested for B-SYS. In this design, the carry's computed not by a special carry chain but by another 3:1 arbitrary function logic block. Thus, the $i$-th logic block must be evaluated before the $(i+1)$-st logic block. Crystal experiments with this ripple carry adder show a carry propagation time of 400 ns. The Manchester carry chain requires less than 5 ns to evaluate.

Control Circuitry

The row control logic has two major parts: the control signal generator at the top (the right in the rotated Figure 4.18) and the decoders at the bottom (the left in the rotated figure). The clocks and input addresses pass from the top to the bottom of the cell so
Figure 4.18: Block diagram of the control circuitry.

that rows can be vertically abutted. Also visible are several control signal buffers and the 8-bit data bus from the adjacent row of processing elements turning to meet the next row.

The FSA is of simple design: four static latches, one for each state (from the bottom, $P_a$, $P_b$, $P_c$, and $P_1$), four next state decoders to the left, and four output decoders to the right. The next state decoders latch an input value during $K1$ which is then stored in the static state latch during $K3$ (this is the value seen in Figure 4.11). The value is made available for control signal generation by the output decoders during the next $K1$. The $Ca$ and $Cb$ control signals (which control the $A$ and $B$ operand latches of the ALUs) are asserted only during the $K1$ and $K2$ clocks. The state output decoders for $P_a$ and $P_b$ are more complicated, asserting the $Ci$ line during $K1$ of $P_a$ and maintaining it until the end ($K3$) of the first $P$. The intermediate values, latched during $K3$ in the state latches and connected to the edge of the control cell with the second metal layer, are used to speed address translation.

Decoders

The decoders are based on those of Annaratone. Address decoding commences during $K3$ when the register selection of the previous phase is cleared. At this point, the address to decode ($A$, $B$, or $R$ for register banks) is selected according to the intermediate values of the FSA mentioned in the previous section. The decoders shown in Figure 4.18 continue the address decryption during the precharge ($K1$) part of the current phase. The inverters (labeled decov) allow the decoded value to pass to individual register banks during $K1$, where the signal is buffered again. The value is not passed to the individual RAM cells until $K2$ since otherwise data would be corrupted during the RAM
precharge. The decoder stops driving the address line at the end of K2, long enough for the register banks to dynamically latch the value. The individual register banks will clear the selection during K3.

The purpose of this scheme is, of course, speed. Decoding register addresses and passing the values to register banks takes a relatively long time (89 ns). By controlling the flow of information (when to make a selection) as close to the RAM cells as possible, a quicker response time is available (34 ns).

The layout of the flag decoder is shown in Figure 4.19. The address inputs are F0, F1, and F2. The selection signals s0–s7, seen at the top of the figure, are buffered, enabled, and cleared in the row control block before being passed to the processing elements. In contrast to the register bank selection signals, this is feasible because flag selection lines only drive one flag in each of the eight flag blocks of the row.

Signal Buffering

One of the most important aspects of chip design, as opposed to subsystem design, is signal buffering. Without the comisent use of sufficiently strong buffers with equal rise and fall times, chip performance will be severely lacking. All control signals are generated locally to each row and then passed to the eight ALUs of the row. All eight functional units are driven by the same control signal, each buffering to eight gates. Thus, the row controller passes the control signal to eight buffers which then distribute the signal locally to 64 bit-slices. Typically, the local control signal is generated 20 ns to 30 ns after the clock (K1, K2, or K3) assertion. Earlier experiments in ripple buffering with in-line inverter produced 70 ns to 90 ns delays, provoking this design switch to faster buffering.

Input and Output

Inputs to the B-SYS chip drive at most six gates, one for each row. These are low-capacitance (minimum size) gates except in the case of the CR, CG, and OP control signals. Since a lengthy setup time is available for all these control lines (7R and most of
the increased load does not degrade performance. Outputs from the B-SYS chips use buffer chains of increasing size to drive the large inverters present in the output pads. Driving a signal off-chip through this buffer chain requires approximately 50 ns according to Crystal. Experimental results show this to be much closer to 15 ns than to 50 ns.

The data buses, both left and right, are more complicated. Tri-state I/O pads from Charles Seitz's pad set are used. Input is enabled on the appropriate side when an end processing element's local Tri line is asserted, while output is enabled a short while before this using a row's Ctrl signal. The delay for input enabling is to assure correct treatment of the mask bits. The masking bit for the rightmost processing element is read from off-chip, and this control setup ensures that the correct (non-transient) value of the mask bit is used. In a B-SYS implementation without a mask-bit anomaly (see the footnote on page 50), both the left and the right mask inputs and outputs (depending data flow direction) would be treated this way.

4.3 Design Verification

Logic designs are, of course, worthless without verification. This section briefly considers the logic simulation, timing analysis, and power estimation used to design the Brown Systolic Array chip.

4.3.1 Logic

Logic simulation proved to be an early stumbling block in the design of the Brown Systolic Array because many switch-level simulators are unable to simulate the basic 6-transistor static RAM cell described in Section 4.2.3. Initial simulation used the eam simulator, part of the Magic (version 4) design suite. Discovery of the RAM problem led to the use of Slic, written by Alexander Sherstinsky at the MIT Microelectronics Center. Although this simulator correctly handled the RAM cells, it did not handle indeterminate signals well (for example, an ox gate with one asserted input and one indeterminate input would have an indeterminate output). The third and, thankfully, final simulator used was COSMOS from Ronald Bryant and others at Carnegie Mellon University.

COSMOS is a compiled logic simulator: it generates C and assembly language code allowing the circuit to be run as a program. Although this compilation can make the analysis of simple circuits long and tedious, the performance for large circuits is well worth the trouble. Each subsystem was individually simulated, as were entire rows of eight functional unit and register bank pairs. Chip simulation was performed with only four processing elements in the chip at a time: the remaining ones were replaced with stub modules that passed control and data signals from one side to the other.

4.3.2 Timing

As has been mentioned, timing analysis was performed with the Crystal timing analyzer, a pessimistic design tool. Because of the use of properly designed buffers, the B-SYS
Figure 4.20: Maximum clock speeds from Crystal analysis.

chips can be run faster than the worst-case critical path found with Crystal. For example, the first and the eight’s processing elements do not need to be executing in lock-step, while the first and the second do need to be reasonably well synchronized. Also, note that all outputs occur far from the row control units, thus inter-chip communication will be reasonably well synchronized.

Applying this principle of allowing small local clock skew (and larger global clock skew) to Crystal timing analysis yields the minimum instruction cycle time of 80 ns shown in Figure 4.20. The dominant requirements are as follows. First, 20 ns are required to precharge memory. Second, 10 ns are required to latch a decoded RAM selection while 30 ns are needed to read a RAM selection (writing is faster). Although only 5 ns are needed to clear a memory selection, 50 ns are needed to propagate the left or right aspect of register selection to memory banks, a task which must be completed before entering K2. Finally, an extra 5 ns are allocated to each clock phase to compensate for buffers with slightly unequal rise and fall times.

The actual chips can be expected to perform at this speed, and perhaps at even faster speeds because of the pessimistic nature of the Crystal timing analyzer.

4.3.3 Power

Power estimation is crucial for the sizing of source voltage conductors. From such calculations, the average current densities of wires may be derived. Much the same as the use heavy extension cords with air conditioners, VLSI wires must be appropriately sized according to their expected current flow.

In VLSI, pad power requirements are often dominant because of the large drivers needed to send information off-chip. The pads used for B-SYS feature wide supply voltage conductors (50 μm) which ring the chip and are tied to every power and ground pin. Because B-SYS has a relatively small number of output pads, this design is sufficient.

In addition to pad power use, it is important to ensure sufficient power distribution to the main body of the chip, both during static and dynamic operation. The B-SYS chip has negligible static power requirements: in a steady state, there are no direct

71
connections between power and ground and only sufficient current for maintaining gate voltages is required. This is a marked contrast to nMOS and some CMOS design styles which use the MOS equivalent of pull-up resistors.

Dynamic power consumption is the major current sink in CMOS designs. As gates are switched on and off at high speeds, current must be provided to charge and discharge gate and node capacitances. The general equation for dynamic power use is

$$P_d = CV_d^2 f$$

(4.1)

which is to say that the dynamic power dissipation $P_d$ is equal to the product of the load capacitance, the square of the supply voltage, and the operating frequency. The average current $I_{ave}$ is then

$$I_{ave} = \frac{P_d}{V_{dd}}$$

(4.2)

For the thumbnail B-SYS power and current calculations, the values $V_{dd} = 5 \text{V}$ and $f = 1 \text{MHz}$ were used. Some subsystems switch at higher or lower frequencies, and a corresponding value for $f$ was used. Node capacitances were obtained from Crystal and conservative estimates.

The worst case power consumption of a B-SYS functional unit and register bank pair is $P_d = 7.2 \text{mW}$, or $342 \text{mW}$ for all 47 blocks in the B-SYS chip. These correspond to average current requirements of $1.44 \text{mA}$ per block and $69 \text{mA}$ for the 47 blocks. Adding the control circuitry and other buffers to this total yields a conservative maximum power estimate of $375 \text{mW}$ or average current of $75 \text{mA}$ for the entire chip, excluding pads. The rule of thumb for conductor sizing is that at most $1.0 \text{mA}$ per micron of metal width should be used. (Conductor sizing is not based on $\lambda$ units.) For B-SYS' $75 \text{mA}$, a $75 \mu\text{m}$ conductor is needed to provide sufficient current to the body of the array. Referring to Figure 4.9, a $44 \mu\text{m}$ wire is stretched between pads V1 and V4 as well as G2 and G5, a width sufficient for $88 \text{mA}$. Because the wires are so thick, it was important to position these main supply pads carefully to avoid any bends in the wire (a bend in a $44 \mu\text{m}$ wire requires $88 \mu\text{m}^2$).

Each B-SYS row has a grid of 17 power lines and 16 ground lines. In the case of ground, the 16 lines have a total width of over $50 \mu\text{m}$, well above the requirements for carrying $12 \text{mA}$ of current to each row. Between processing elements, the power and ground lines are tied together to ensure that no supply line carries a disproportionate amount of current.

4.4 Conclusions

This chapter has considered all aspects of the B-SYS design, from the architecture to CMOS layouts to assign verification. As has been seen, the design of B-SYS is the result of both interactions between the design levels and experimental testing within levels as the design progressed. A strict layout style was used to simplify the placement of subsystems on the chip. The Brown Systolic Array's Magic design hierarchy has 100 cells, only 46 of which
contain transistors. Although each B-SYS chip has 85,000 transistors, less than eight tenths of one percent of the transistors (656) had to be laid out by hand. Such relative simplicity in the design of powerful co-processors is a prime advantage of the Systolic Shared Register architecture and systolic arrays in general.

The extensive testing and analysis of all parts of the architecture and design, using the B-SIM simulator, a logic simulator, and a timing analyzer, were crucial to creating a chip that was to work on first fabrication.
Figure 4.21: Fully expanded B-SYS functional unit and register bank.
Chapter 5

Fault Testing and Fault Detection

Faults are an ever-present fact of hardware design, both in the manufacture and assembly of hardware systems as well as in their daily use. This chapter considers first the testing of the Brown Systolic Array chips and the assembly of the B-SYS prototype system. Then, moving to a more abstract level, the problem of transient fault detection during array operation is examined. The principle of software fault detection for programmable systolic arrays, a flexible alternative to hardware methods, is introduced.

5.1 Overview

There are two types of fault detection: off-line and on-line. Off-line testing is used to find permanent faults created by fabrication errors, an unavoidable aspect of VLSI manufacture. This initial screening is performed by shifting the appropriate test vectors into and out of the chip and then checking the results. In the case of the Brown Systolic Array, this was done with the aid of a logic analyzer and pattern generator, as described in Section 5.2.

On-line fault detection takes place as the array is being used, and can find transient (one-time) and intermittent (recurring) faults as well as new permanent faults. Having such a short duration, transient faults are the most restrictive case and thus the most important to consider. Faults can occur in functional units, register banks, and control circuitry; a fault in reading an operand, evaluating an instruction, or writing to memory will corrupt the result.

Before continuing, it is necessary to both define the term "fault" and consider the generic causes of faults. The execution of an instruction is faulty if, after execution, the value stored in some register of some register bank and the expected value of that register are not identical. A fault is detectable if, in a fault-free computation after the occurrence of the fault, there exist two registers in the neighborhood of a specific functional unit that are unequal as a direct result of the fault. Of course, the assumption of the equality of the two registers is more restrictive than necessary: any form of redundant computation will
suffice, such as working with logical compliments, multi-base arithmetic, and shifted
operands. In general, the detection method is assumed to be implemented efficiently in
the functional units. Because additional faults can corrupt the redundant computation,
it is kept to detect the fault soon after its occurrence.

Every hardware implementation has its own probable causes of faults and linkages
between them. However, there are three basic categories of faults which bear considera-
tion.

1. A fault may be randomly and independently introduced by some external cause
(e.g., gamma rays). These are called independent faults.

2. A nearby fault may trigger another fault. For example, an entire register bank or
control block could fail, resulting in a large number of faults. These are referred
to as space-linked faults.

3. A fault may persist for several clock cycles. These time-linked faults take into
account that a fault is not necessarily a single event lasting for the duration of a
single instruction. Permanent faults are time-linked faults which last forever, while
intermittent faults have higher than normal chances of returning.

Although it is not the purpose of this discussion to provide in-depth probabilistic analysis,
it will prove useful to keep these different fault models in mind when considering fault
testing and detection.

5.2 Fault Testing

Twenty-four Brown Systolic Array chips were received from MOSIS in the late May of
1990. The chips were subjected to a battery of fault testing routines with the aid of
a logic analyzer and pattern generator. Ten chips passed the test procedures and were
combined to form the L-SYS prototype system. Despite the prototype’s limitations
(namely, a slow bus), the 470-element array can perform over one hundred million 8-bit
operations per second (108 MOPS).

In addition to describing the test procedure and the B-SYS prototype board, this
section examines the performance of a sequence comparison program used as the final
stage of the test procedure and considers the design of an ideal board for the Brown
Systolic Array.

5.2.1 Chip Screening

In spite of the simplicity of the Systolic Shared Register architecture and the depend-
ability of Magic design, no complicated CMOS circuit can escape fabrication defects.
Fortunately, testing an SSR machine is a straightforward task: the regularity of the
architecture and its small number of basic units lead to simple test procedures, a great
contrast to general-purpose microprocessors. The separation of register banks and func-
tional units implies that, to a large extent, the testing of these two units can be performed

76
separately. Since information is passed into and out of an SSR machine using the register banks, they are tested first. Streaming numbers through the register banks will, in addition to testing the registers themselves, perform minor testing on the functional units and control circuitry. After checking the performance of the register banks, the test procedure moves to a more exhaustive verification of functional units. Each logical component of the functional units (the logic blocks, carry chain, and flags) is individually tested to pinpoint any defects on the chip. The information generated by these tests may then be used to enhance the yield of future designs.

The Brown Systolic Array chips were tested for defects using a Hewlett Packard 16500A logic analysis system, including a 16510A state and timing analyzer and a 16520A pattern generator. The chips were controlled by the pattern generator and their responses verified by the state analyzer. The initial setup buffered all pattern generator outputs to the chip in a manner similar to that of the final prototype board. This setup design motivated by the desire to combine the functionality of the test frame with that of the final prototype board as much as possible.

After confirming that a chip's clock and control circuitry worked correctly using the diagnostic outputs listed in Table 4.4 on page 61, the following tests were performed with the pattern generator and state analyzer:

**AA55** The AA55 test shifts the numbers $A_{10}$ through $A_{14}$ through the array (these numbers will detect both bridge faults between adjacent bits and stuck bits). Streaming these numbers through the array in one direction will quickly discover most functional unit, register bank, and control circuit errors. However, no information about the location of the fault (or faults) is available.

**InOut** The InOut test shifts numbers into the array from one side and then shifts them out again from that same side. Using this test, the location of register bank and functional unit errors can be identified. For example, if there is a fault in $F_5$ (the fifth functional unit from the west), this will be reflected in five correct results followed by 42 incorrect results when the InOut test is performed from the west side. Testing from both the west and the east side can isolate control errors to specific rows.

**Flags** The Flags test first sets and clears the flags in each functional unit using the Zero and Zone control words of Table 4.3 on page 53. Next, the flags are shifted into a working (as determined by the previous tests) register, say $W_0$, by performing eight additions of the type $W_0 = W_0 + W_0 + Z$. Finally, these results are shifted out and compared to the expected values.

**AddSub** The AddSub test also checks the carry chains and the $P$ and $G$ logic blocks. First, one flag is cleared and one register is set to 1. Then, two streams are shifted through the array, one stream having 1 subtracted from it and the other having 1 added to it by each functional unit. The result of this test should be the alternating numbers 48 and 209 ($256 - 47$).
Mask The Mask test is, as its name implies, a test of the masking capabilities of the functional units. First, half the context flags are set (odd functional units) and the other half cleared (even). Then, two streams of numbers are shifted into the array. The functional units with asserted context flags copy the first stream to the second, and the resulting stream is shifted out of the array. The test is then repeated, reversing the two sets of functional units.

The state analyzer was used to record invalid output for analysis. By far the most frequent cause of testing errors was chip seating, or ensuring that all 84 pins were firmly connected to the PGA socket. The data busses (R7-R0 and U7-L0) were particularly critical, perhaps because their drivers are less powerful than those of the latch chips, and thus less able force signals over bad connections.

The results of these tests are displayed in Table 5.1. As can be seen, ten of the chips worked correctly. Also, the test suite was able to provide exact information about many of the fabrication defects. It is difficult to draw any conclusions from the faults both because of the small sample size and the lack of information about the positioning of the chips on the wafer. However, all the defects cluster around the outside of the chip (row 0, row 5, the control circuitry, and processing elements close to the edge of the chip, as seen in Figure 4.9 on page 60), possibly implying that the defects occurred in the process of

<table>
<thead>
<tr>
<th>Chip</th>
<th>Comments</th>
<th>Chip</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Works.</td>
<td>13</td>
<td>Row 5 stuck low, perhaps caused by bad precharge phase. Works.</td>
</tr>
<tr>
<td>2</td>
<td>(F_2) bit 2 stuck low. Masking does not work.</td>
<td>14</td>
<td>Works.</td>
</tr>
<tr>
<td>3</td>
<td>Right register bank of (F_2) does not work.</td>
<td>15</td>
<td>Works.</td>
</tr>
<tr>
<td>4</td>
<td>Works.</td>
<td>16</td>
<td>Works.</td>
</tr>
<tr>
<td>5</td>
<td>Row 3 does not address (F_0) correctly, producing a stuck-at-1 fault.</td>
<td>17</td>
<td>(F_2) bit 6 stuck low.</td>
</tr>
<tr>
<td>6</td>
<td>Control circuitry in row 0 does not work.</td>
<td>18</td>
<td>Output does not work, always high.</td>
</tr>
<tr>
<td>7</td>
<td>Works.</td>
<td>19</td>
<td>Row 5 stuck high, perhaps caused by bad decode phase.</td>
</tr>
<tr>
<td>8</td>
<td>Works.</td>
<td>20</td>
<td>Undependable masking of (F_2) bit 0 and (F_2) bit 2.</td>
</tr>
<tr>
<td>9</td>
<td>Works.</td>
<td>21</td>
<td>Row 5 stuck high, perhaps caused by bad decode phase.</td>
</tr>
<tr>
<td>10</td>
<td>Control failure.</td>
<td>22</td>
<td>Control errors in row 2.</td>
</tr>
<tr>
<td>11</td>
<td>Works.</td>
<td>23</td>
<td>(F_2) bit 0 stuck low. (F_{10}) bit 0 stuck high. Error in CR signals or blocks. Works.</td>
</tr>
<tr>
<td>12</td>
<td>Flags in rows 0 and 1 do not work.</td>
<td>24</td>
<td></td>
</tr>
</tbody>
</table>

Table 5.1: B-SYS chip statuses.
cutting the wafer into chips or bonding the chips to the PGA packages. Unfortunately, apart from using much less of the chip's total area, such faults are difficult to avoid by design modification.

The chips were tested with a 100 ns clock cycle time (K1 to K3), as illustrated in Figure 5.1. Because of the pattern generators' timing gradations, 50 ns was the next fastest speed that could be checked, and a speed at which the support logic and timing analyzer combination could not reliably deliver clock signals to the chip. Experimentation with the clock phases at 100 ns seems to indicate, however, that the B-SYS chips can function at considerably higher speeds.

5.2.2 The Prototype Board

A prototype board for the ISA* bus was acquired from JDR Microdevices. This board, configured as an I/O device, provides both bus interface circuitry and address decoding. To the board's bus interface, latches for instructions, inputs, and outputs were added, as well as room for 12 B-SYS chips (Figure 5.2, the support logic is itemized in Table 5.2). Headers were used during chip testing to connect the pattern generator outputs to the instruction and input latches. The board's one-thousand connections were made by the author with a hand-held wire-wrap gun.

The B-SYS prototype array uses the ISA bus' I/O addresses 300 through 30C. The first three 16-bit words are for instruction and input (I7-I0) signals to the array while the last is used to read output values (O7-O0) from the co-processor. The meanings of individual bits are shown in Table 5.3.

As the board was assembled (running into many chip seating problems), several test programs were executed to check its operation. In addition to assembly language and

*Industry Standard Architecture, IEEE Standard P996, also referred to as the AT bus.

5On the subject of address decoding, the board's preprogrammed PAL (programmable array logic) chips did a less than admirable job of interpreting bus traffic. The first set of PALS interfered with bus traffic whenever a read operation to the graphics card took place. A second set of PALS was tried, but this set interfered with disk I/O. The first set was chosen as the lesser evil.
Figure 5.2: B-SYS prototype board.

<table>
<thead>
<tr>
<th>Label</th>
<th>What</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>1-12</td>
<td>PGA</td>
<td>84-pin PGA sockets for 12 B-SYS chips</td>
</tr>
<tr>
<td>A</td>
<td>373*</td>
<td>1-to-2 latch for X1, X2, X3, and loc.</td>
</tr>
<tr>
<td>B</td>
<td>373</td>
<td>Latch CG3-C60 and CP3-CP0</td>
</tr>
<tr>
<td>C</td>
<td>373</td>
<td>Latch CR7-CR0.</td>
</tr>
<tr>
<td>D</td>
<td>373</td>
<td>Latch X1, X3-X6, B1, B3, B2</td>
</tr>
<tr>
<td>E</td>
<td>373</td>
<td>Latch B1-B0, A1, A3-A0.</td>
</tr>
<tr>
<td>F</td>
<td>373</td>
<td>Latch Cell, Z2-Z0, C1-C0.</td>
</tr>
<tr>
<td>P</td>
<td>00</td>
<td>NAND, control signal inversion</td>
</tr>
<tr>
<td>Q</td>
<td>00</td>
<td>NAND, enable signal for N.</td>
</tr>
<tr>
<td>HA</td>
<td>Hdr</td>
<td>Clock input header</td>
</tr>
<tr>
<td>HB</td>
<td>Hdr</td>
<td>D header, now tied to L.</td>
</tr>
<tr>
<td>HC</td>
<td>Hdr</td>
<td>C header, now tied to M.</td>
</tr>
<tr>
<td>HD</td>
<td>Hdr</td>
<td>D header, now tied to L.</td>
</tr>
<tr>
<td>HE</td>
<td>Hdr</td>
<td>E header, now tied to M.</td>
</tr>
<tr>
<td>HF</td>
<td>Hdr</td>
<td>F header, now tied to L.</td>
</tr>
<tr>
<td>H1n</td>
<td>Hdr</td>
<td>G/2 header, now tied to M.</td>
</tr>
</tbody>
</table>

Table 5.2: B-SYS prototype board key.

- *374000: Quad NAND gate
- *374023: Tri-state octal D-type latch
- *374042: 1-to-2 header suitable for pattern generator output
<table>
<thead>
<tr>
<th>Address</th>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>300</td>
<td>CG3</td>
<td>CG2</td>
<td>CG1</td>
<td>CG0</td>
<td>CP3</td>
<td>CP2</td>
<td>CP1</td>
<td>CP0</td>
<td>CR7</td>
<td>CR6</td>
<td>CR5</td>
<td>CR4</td>
<td>CR3</td>
<td>CR2</td>
<td>CR1</td>
<td>CR0</td>
</tr>
<tr>
<td>304</td>
<td>X1</td>
<td>X2</td>
<td>X2</td>
<td>X1</td>
<td>X0</td>
<td>B4</td>
<td>B3</td>
<td>B2</td>
<td>B1</td>
<td>B0</td>
<td>A1</td>
<td>A0</td>
<td>A3</td>
<td>A2</td>
<td>A1</td>
<td>A0</td>
</tr>
<tr>
<td>308</td>
<td>Z1</td>
<td>Z2</td>
<td>Z2</td>
<td>Z1</td>
<td>Z0</td>
<td>C2</td>
<td>C1</td>
<td>C0</td>
<td>17</td>
<td>16</td>
<td>15</td>
<td>14</td>
<td>13</td>
<td>12</td>
<td>11</td>
<td>10</td>
</tr>
<tr>
<td>30C</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 5.3: B-SYS prototype I/O addresses.

---

Figure 5.3: Prototype clocking.

C versions of the pattern generator tests, more sophisticated programs, such as sorting and sequence comparison, were tested.

5.2.3 Prototype Performance

Figure 5.3, a transcription of logic analyzer data, illustrates clocking on the prototype system. The clocks are produced by the pattern generator which triggers a set of five clock cycles whenever the third word of an instruction is written to the board (I/O address 308). During the first set of three clocks (K1, K2, and K3) the init signal was asserted, preventing B-SYS from entering P8 until the end of the second cycle. As can be seen, the processing speed of the B-SYS prototype board is overwhelmed by the time required to write three 16-bit words to the prototype board (IOW is the ISA bus' asserted low I/O device write line). Also note that the decay of B-SYS outputs, mentioned on page 62, takes place approximately 2 ms after the end of an instruction.

As the board was assembled, correct execution of a typical application was verified. The sequence comparison problem of Section 2.1.3 was solved both on B-SYS and on
<table>
<thead>
<tr>
<th>Processors</th>
<th>File Access</th>
<th>C</th>
<th>B-SYS</th>
<th>B-SYS MOPS</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>94</td>
<td>7.6</td>
<td>63.0</td>
<td>4.2</td>
<td>24</td>
<td>15.0</td>
</tr>
<tr>
<td>141</td>
<td>9.1</td>
<td>144.</td>
<td>7.9</td>
<td>32</td>
<td>18.8</td>
</tr>
<tr>
<td>188</td>
<td>9.9</td>
<td>216.</td>
<td>10.6</td>
<td>43</td>
<td>20.3</td>
</tr>
<tr>
<td>235</td>
<td>10.6</td>
<td>336.</td>
<td>13.2</td>
<td>54</td>
<td>25.5</td>
</tr>
<tr>
<td>282</td>
<td>33.5</td>
<td>460.</td>
<td>15.8</td>
<td>65</td>
<td>30.4</td>
</tr>
<tr>
<td>339</td>
<td>33.6</td>
<td>653.</td>
<td>18.3</td>
<td>76</td>
<td>35.7</td>
</tr>
<tr>
<td>376</td>
<td>33.3</td>
<td>876.</td>
<td>20.9</td>
<td>86</td>
<td>40.9</td>
</tr>
<tr>
<td>423</td>
<td>33.5</td>
<td>1106.</td>
<td>23.4</td>
<td>97</td>
<td>46.2</td>
</tr>
<tr>
<td>470</td>
<td>33.4</td>
<td>1330.</td>
<td>26.3</td>
<td>108</td>
<td>50.0</td>
</tr>
</tbody>
</table>

Times are in milliseconds.
C and B-SYS times ±5%, file access times ±3%

Table 5.4: B-SYS sequence comparison timings.

the 80386, and then the results were compared (the B-SYS solution will be described in Section 7.1). Timings for this algorithm, as well as the disk access time for loading the sequences, are shown in Table 5.4. In all cases, the routine had to be executed multiple times to compensate for the low granularity of the system clock. For the full array, the C routine was executed 5 times and the B-SYS routine 213 times. It is interesting to note that although the prototype board is slow, it can keep pace with file access to the system's hard disk.

In addition to this algorithm, Chapter 7 analyzes a faster sequence comparison algorithm which performs over 80 times faster than the 80386, as well as a variety of other applications.

5.2.4 The Ideal Board

As is illustrated by the timing diagram (Figure 5.3), the B-SYS prototype board does not provide an efficient interface between the host and the co-processor. The large amount of time spent accessing the board reduces performance so a mere 108 MOPS, or 0.23 MOPS per processing element. This section considers ways to drive the array at full speed, approaching the 4 MOPS per processing element indicated by the optimal clocking scheme of Figure 4.11 on page 62.

Systolic algorithms typically repeat a single cell program many times. Thus, over the course of a computation, only a small set of instructions is needed. This leads to the first refinement: instead of requiring the host machine to access the board for each instruction, a collection of instructions could be preloaded to an on-board instruction memory and indexed by the host. To execute a preloaded instruction, the host would send, for example, an 8-bit instruction address and an 8-bit input value to the board, reducing the bus use to once per instruction (twice if operands must be read from the board). Such an arrangement could triple the speed of the co-processor, moving it up to 0.75 MOPS per processing element. In spite of this improvement, the processing speed would still be controlled by the bus access time since the host must access the board to
execute each instruction regardless of whether or not it must transfer data to or from the board. Performance could be further improved by decoupling the board and the bus.

The final refinement is to include both a program memory and a microsequencer on the co-processor board. Also, to allow asynchronous operation, blocking input and output queues could be used to transfer data between the host and the board. This board, referred to as B-SYS*, is illustrated in Figure 5.4 (note that it is decidedly not an ISA board). Entire routines would be preloaded to the board and could be initiated with a single command. The program, in addition to storing the required instructions, would also have specifications for default inputs (such as zero or the value of an on-board register) to limit the use of the queues. The host could then fill the input queue and remove data from the output queue at its leisure, though it must ensure that deadlocks do not occur. Also, this system would use the B-SYS diagnostic outputs to control instruction, input, and output latching. Such a system could drive the array very quickly, perhaps increasing performance to 2 MOPS or more per processing element. With this setup, a one-board, 32-chip array (1504 processing elements) could provide 3 GOPS of systolic co-processing power.
5.3 Fault Detection

Having described permanent fault testing for the Brown Systolic Array chips, this section turns to transient and intermittent fault detection during the execution of arbitrary systolic programs. Fault detection research has generally focused on two strategies: hardware fault detection and algorithmic fault detection. Typical hardware fault detection methods duplicate each computation with additional hardware units and compare outputs for discrepancies with additional circuitry. Algorithmic methods focus on the high-level algorithm; matrix operations may be made fault-tolerant if checksum rows and columns are added to the arrays. Such an approach requires no special hardware but is limited in its utility because rigorous mathematical analysis is required for each new application.

This section proposes a third approach: software fault detection for programmable systolic arrays. Automatic program transformations for creating fault-tolerant programs are considered, as well as refinements which make use of specific features of the original program. Unlike algorithmic methods, software fault detection does not rely on a mathematical study of the problem being solved. Unlike hardware methods, the user has full control over all available resources (processors, memory, and time). The user can make the computation more robust by allocating additional resources to fault detection or even eliminate the software fault detection if either an alternate type of detection or no detection is preferred.

Thus, software fault detection methods allow a programmable systolic array to be used both when obtaining maximum performance is the primary concern and when fault detection is so important that the user is willing to trade decreased performance for an increased ability to detect faults. Instead of building two separate systems, one optimized for performance and the other for fault detection, software fault detection makes use of existing hardware with no special modifications; the inherent redundancy of the systolic array is used for fault detection. Systolic programs are automatically transformed from their original version to a version able to detect any predetermined number of both permanent and transient faults. Unlike algorithmic schemes, software fault detection is easily automated. The tradeoff is that more robust programs require either more time to execute or more processing elements to execute on.

5.3.1 Hardware Fault Detection

Perhaps the best known approach to fault detection and correction is triple modular redundancy (TMR). In this classic scheme, each computation is performed on three independent hardware units which then vote to determine the result. A generalization of this is d-way modular redundancy (d-MR). For example, 3MR allows only single fault detection without correction. TMR, in addition to more than trebling the hardware cost, will slow down the computation: each comparison or vote adds hardware complexity and computational delays.

*Parts of this section were presented by the author at the Second IEEE Symposium on Parallel and Distributed Processing.*

24
In the roving spares method (RS) proposed by Shombert and Abraham,\textsuperscript{143} $k$ extra processing elements independently "roam" through the array checking processors in turn. Since processors do not actually move, each processing element must be able to act as a spare or as an active processor. Roving spares work best with multiple instruction and multiple data stream (MIMD) computation since the spares will be able to act autonomously. Unless there is one spare for each processing element, a transient fault can go undetected.

Token triggered comparison with duplicate data (TTCDD), presented by Choi and others, is more appropriate for machines with either broadcast instructions or no instructions (i.e., systolic machines in which each cell performs some fixed function each and every time unit).\textsuperscript{27} In this approach, $k$ extra processing elements must be added to the array as well as extra routing and comparison circuitry for each processing element. As the computation proceeds, a sequence of $k$ tokens is cycled through the array. These tokens determine which computations should be replicated for fault testing. In effect, these are roving spares without autonomy. As with the RS method, unless $k = N$, the size of the original array, transient and intermittent faults can go undetected.

5.3.2 Software Fault Detection

Checksum methods for detecting faults during the execution of parallel matrix operations have recently gained attention.\textsuperscript{12,26} By adding checksum entries to matrices, 85% or better fault coverage may be achieved for problems such as matrix multiplication, fast Fourier transform, $QR$ factorization, singular value decomposition, and adaptive filtering.\textsuperscript{11} This tolerance is accomplished entirely in software; complicated hardware changes are not required. However, these routines do not apply to general systolic programs but instead result from a careful analysis of a specific computation. This section proposes fault detection methods for arbitrary algorithms. The user is not required to perform a complicated analysis to determine the means of fault detection. Instead, simple schemes which duplicate computation in software are considered.

As an example, consider the code fragment

$$E_0 \leftarrow \min(W_0, W_1). \quad (5.1)$$

Since this statement not only performs a minimization but also moves data eastward,
making this single statement fault-tolerant will illustrate the process of making entire programs fault-tolerant.

To detect transient faults in arbitrary programs, computation must be duplicated and monitored: without complete coverage, a transient fault can be missed. Since arbitrary programs have no predetermined relationship between the data in various parts of the processor array (as opposed to checksums), full duplication is required. General fault detection thus requires at least twice as much time or at least twice as many processing elements as the original program. With software fault detection, even when twice as many processing elements are used, the actual hardware remains fixed: special routing and comparison circuitry is not required by the method.

Perhaps the most obvious method of software fault detection is to use a gross replication of the computation, shown in Figure 5.5, in which output results are compared as they leave the array. The two computation streams, $A$ and $A'$, should produce identical results — after the execution of each instruction, $A_i$ should equal $A'_i$ for all i. Whenever the results from the $A$ and $A'$ execution are not identical, a fault has occurred. Gross replication of computation requires double the amount of time or equivalently, double the number of functional units and memory nodes. For all this effort, only one comparison is made during each time step, leaving the potential for many undiscovered faults. If the same hardware is used for both computation streams (time multiplexing), the same time-linked fault could occur in both streams and go undetected. Also, gross duplication of computation will not provide any indication as to where the fault occurred. Locating faults is critical for array reconfiguration and software avoidance; it may be desirable to reconfigure the array to avoid functional units or memory banks with intermittent faults. Without the location, one may as well replace the entire array and hope for the best. Parallel fault detection methods are a vast improvement over this simple scheme.

A better way to detect transient faults on an SSR machine is to replicate the computation stream as shown in Figure 5.6. Two identical (in the fault-free case) computation streams flow through the array at the same time. Since there are two different and neighboring memory locations which, in a fault-free computation, are identical, faults are detectable in the sense of Section 5.1. Comparisons may be made at all functional units as frequently as desired, providing a strong method for detecting and locating transient faults. Thus, fault detection can occur close, in both time and space, to the actual fault. For the moment, it will be assumed that functional units have a dependable (faultless) method for reporting faults, such as the wired-or is Figure 5.6. Taking the statement of Example 5.1, the $A'$ computation stream is assigned the second block of
Figure 5.7: Skewed replicated computation streams for fault detection.

four registers (i.e., \( E_0 \) is shadowed by \( E_t \)). The code is then expanded to process both computation streams and relay the fault information to the host computer:

\[
E_0 \leftarrow \min(W_0, W_t) \quad (5.2.1)
\]
\[
E_t \leftarrow \min(W_t, W_1) \quad (5.2.2)
\]
\[
wired-\text{or} \leftarrow (E_t \neq E_0). \quad (5.2.3)
\]

Here, the first instruction (5.2.1) operates on the \( A \) (original) stream, while the next instruction (5.2.2) operates on the \( A' \) stream. Input values must be duplicated and made available to both the \( A \) and \( A' \) computation streams.

Most systolic algorithms have both moving and fixed data streams. Since errors in the computation of fixed (local) data will be reflected latter in the moving data stream, it is not necessary to compare entire register banks. Only those registers used by the moving data streams need to be checked (by comparing corresponding values in the computation streams).

One of the problems with the simple duplication scheme just described is the ability for a corrupt register bank to go unnoticed. If, for example, a fault results in a bit position always being written as a ‘1’ throughout an entire register bank, the \( A \) and \( A' \) values would be both the same and incorrect. Although computing the \( A' \) stream as the logical negation of the \( A \) stream would solve this case, bridge faults (in which neighboring nodes are forced to the same value via a connecting bridge) would remain a problem. The solution to this space-linked problem is the skewed replicated computation streams (SRCS) method of permanent and transient fault detection (Figure 5.7). The computation is again replicated on two independent computation streams, \( A \) and \( A' \), however the streams are skewed by one register bank. Faults are still detected by comparing corresponding \( A_i - A'_i \) pairs using a broadcast instruction, and when they differ a fault is signalled.

Further replication to \( d \) computation streams will provide better performance in terms of the worst-case number of detectable faults (\( d - 1 \)).

Using this method, the fault-tolerant version of Example 5.1 is:

\[
E_0 \leftarrow \min(W_0, W_1) \quad (5.3.1)
\]
\[
E_t \leftarrow \min(W_t, W_5) \quad (5.3.2)
\]
\[
wired-\text{or} \leftarrow (W_t \neq E_0). \quad (5.3.3)
\]

Of course, this transformation is nearly identical to that of Example 5.2, however the skewing of the computation streams is a critical conceptual point. Again, although
Example 5.1 expanded to three statements in Example 5.3, SRC6 does not generally produce a factor of three slow-down since a systolic cell program with $S$ non-local data streams expands an SRC6 program with $2f + S$ statements, where $f$ is the original number of instructions. For added fault detection (or correction), SRC6 can be extended to $d$ computation streams requiring cell programs of size $df + S(d - 1)$.

SRC6 has several advantages over the previous two schemes. First, unlike gross duplication of computation, the location of the faulty processor can be determined by host query. Such faulty processors may then be avoided with appropriate software modifications or (if available) array reconfiguration. Second, the skewing of the streams protects the system from catastrophic memory bank failure: since $A_i$ and $A'_i$ are in different memory banks, a corrupt memory bank will not invalidate the method. Also, since $A_i$ and $A'_i$ share a common functional unit, no extra data communication is required for the comparison. If that common functional unit is faulty, an adjacent functional unit will be able to detect the problem when it operates on a different data set (i.e., $A_{i-1}$ or $A_{i+1}$).

For programmable systolic arrays with small amounts of memory, the use of skewed replicated computation streams could overwhelm the register banks. The solution to this problem is to use twice as many functional units and register banks, interleaving the $A$ and $A'$ data sets in neighboring registers (Figure 5.8). This method of interleaved replicated computation streams (IRCS) has the flavor of a hardware approach but does not require the use of special circuitry or modified functional units. As one might expect from the use of more processors, an IRCS program will be faster than the corresponding SRC6 program. Using IRCS, even functional units process one computation stream ($A$) while odd units process the other ($A'$). Reading from and writing to the west register bank takes place as normal (i.e., even processing elements have $A$ values and odd have $A'$ values for their respective west register banks). Accessing the east data set is more complicated since it is no longer adjacent to the appropriate functional unit. For each original access of the east register bank, a SIMD instruction to shift values from east to west (for a read) or west to east (write) must be added to create the fault-tolerant version. Example 5.1 is transformed via IRCS to

$$E_0 \leftarrow \text{mix}(W_0, W_1) \quad (5.4.1)$$

$$E_0 \leftarrow W_0 \quad (5.4.2)$$

$$\text{wreed-0s} \leftarrow (E_0 \neq W_0) \quad \text{and} \quad (\text{Odd processor}) \quad (5.4.3)$$

Here, instruction (5.4.2) shifts values throughout the array to the east (recall that $W_0$
is \( E \) for an adjacent processor). For better fault protection, a data check in the even processors similar to statement (5.4.3) may be added after statement (5.4.1). As with SRCS, the number of extra instructions depends on the number of moving data streams in the cell program. Taking this number \( S \) into account, the SRCS cell program will have \( I + 2S \) instructions. Thus, an SRCS program does not increase the leading coefficient of the execution time, yielding nearly the same performance as the original program.

This method may be extended to \( d \) replications of the computation, in which case \( dN \) processors are required to execute the cell program of \( I + 2S(d - 1) \) instructions.

5.3.3 Propagation of Fault Information

There are two problems with using a wired-on bus for fault reporting. First, indication of the location of the fault is not available. Second, of course, this is a hardware fault notification method. Since it requires the availability of special hardware, it is not useful for some extant arrays, B-SYS among them. The alternative is a data stream for fault notification which, unfortunately, might be compromised by a faulty register bank or functional unit. This section considers methods for generating dependable fault detection streams in software.

First, word-size redundancy in the fault stream can increase its dependability. In the B-SYS case, this would imply that one 8-bit value indicates a correct computation and the 255 other values indicate a faulty one. For example, any non-zero value could represent a fault. Such a scheme will provide some resilience to single-bit errors.

The two preferred methods of the previous section, skewed replicated computation streams and interleaved replicated computation streams, protect against functional unit errors by computing corresponding parts of the two computation streams in different functional units and register banks. If there is only one fault stream, it could be corrupted by a single functional unit or register bank. To provide added security (beyond the word-size redundancy of the error message), two fault detection streams may be used, one traveling in each direction. With two notification streams, fault location can be readily determined without host query: if fault flags exit opposite ends of the array during the same step, the fault occurred in the central functional unit or one of its register banks. Failures in other functional units and register banks can be pinpointed by observing the delay between fault reports in the two fault streams, a manner reminiscent of the InOut test used to screen the B-SYS chips. Multiple transient faults cannot be located with ease because of the I/O bounds on lines: systolic arrays. If only one stream indicates a fault, an error in one of the fault detection streams has occurred.

The use of fault detection streams does not significantly increase the size of the fault-tolerant program. Locally generated fault information on the \( S \) data streams of a program can be combined as it is computed. Then, this information can be combined with the fault detection stream as it is shifted though the array, a feature of the SIS architecture. Thus, one instruction per cell program iteration per fault detection stream is added by the use of software fault notification.
<table>
<thead>
<tr>
<th>Resource</th>
<th>Software Approaches</th>
<th>Hardware Approaches</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hardware</td>
<td>$N+1$</td>
<td>$2N+1$</td>
</tr>
<tr>
<td>Time</td>
<td>$T^{\frac{N+1}{2}}$</td>
<td>$T^{\frac{N}{2}}$</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Hardware</th>
<th>$N$-RS</th>
<th>$N$-TTCDD</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time</td>
<td>$T(1+\delta)$</td>
<td>$T(1+\delta)$</td>
</tr>
<tr>
<td></td>
<td>$T(1+\epsilon)$</td>
<td>$T(1+\epsilon)$</td>
</tr>
</tbody>
</table>

Table 5.5: Comparison of single fault detection methods.

5.3.4 Analysis

As mentioned earlier, algorithmic methods are perhaps the most efficient means of fault detection but cannot be applied to general programs. This section will restrict itself to comparing skewed replicated computation streams and interleaved replicated computation streams with the hardware methods previously described.

For all practical purposes, RS and TTCDD require $k = N$, since if $k < N$ the probability that a transient fault will not be detected because the spares (or tokens) were in the wrong place is $1 - \frac{k}{N}$. Of course, any single permanent fault will be detected with $k = 1$, since the token or spare will eventually arrive at the faulty processing element.

Table 5.5 summarises the methods when applied to the task of detecting any single transient fault. In this table, $N$, $T$, $I$, and $\delta$ are the number of processors, amount of time, and number of cell program instructions and moving data streams used by the original code. All running times in Table 5.5 are in terms of the original running time $T$. The $\epsilon$ and $\delta$ terms represent overhead due to the special circuitry required by the hardware methods. The hardware methods are similar because they have been normalized for single fault detection. Shmohorib estimates that $\delta_1 \leq 0.33$ for the RS case, and such a figure seems reasonable for the other methods (it is likely that $\delta_1 \geq \delta_2 \geq \delta_3$).

The value $\delta = 0.33$ is comparable to a 2RCS program operating on four data streams, or two data and two fault streams, with $I = 12$.

As can be seen, 2RCS and all the hardware method execution times are similar even when $\delta < 0.33$. 2RCS is an appealing solution when extra processing elements are not readily available.

5.4 Conclusions

Permanent faults in the B-SYS chips were detected with a suite of test programs; the simplicity of the Systolic Shared Register design lead to a simple test pan. The B-SYS prototype board, hampered by its slow bus interface, can nevertheless run at 108 million operations per second. A more sophisticated board, perhaps the most important future task involving the Brown Systolic Array, could execute over 1 billion operations per second using only 16 chips.

This section also introduced the method of software fault detection, which provides flexible fault coverage at a reasonable speed. In contrast hardware approaches, the
interleaved replicated computation stream (IRCS) and skewed replicated computation stream (SRCS) methods do not require special hardware and do allow the user to vary the degree of fault detection. As seen by the use of automatic program transformations, software fault detection is a simple and versatile alternative to specialized algorithmic methods and complicated hardware solutions.
Chapter 6

A Framework for Systolic Programming

The systolic paradigm is possibly the most efficient means of harnessing the vast power of massively parallel processors. However, systolic programming language development has not kept pace with systolic hardware development. As has been seen, most systolic programming languages draw tools from a bag of special tricks that are neither intuitive nor elegant. This chapter proposes the New Systolic Language as a general solution to the problem of systolic programming. Additionally, the systolic data stream aspect of the SSR architecture is explored. The New Systolic Language and its stream model of data flow overcome the difficulties and adapt the advantages of existing systolic programming methods.

6.1 Overview

The features of the New Systolic Language (NSL) draw heavily on the analysis of systolic programming languages and design systems presented in Section 2.2. Recall that none of these languages is entirely satisfactory for systolic co-processor programming. Many force the user to consider the implementation details of a specific systolic machine, while others force the user into a very abstract world, requiring the specification of abstract dependency vectors and mappings. Although both of these views are helpful at times, it is felt that a new methodology for programming systolic co-processors is required. To repeat the conclusions of the language analysis (page 35), a systolic co-processing language should, in the author’s opinion, have the following characteristics:

1. The language should cleanly separate systolic cell programs and data flow directives. Shared asynchronous variables are to be avoided because multiple references can produce unpredictable results. Instead, cell programs should be specified as pure transfer functions between systolic inputs and systolic outputs.

2. Programs should be able to execute both host and co-processor array functions, preferably in the context of a conventional language.

93
3. The programmer should be able to declare systolic data streams as such using concise flow directives. Stream declaration and initialization should be accomplished apart from the cell program, dissociating cell operation from macroscopic data flow and enabling the reuse of standard data flows and cell operations.

4. The language should be independent of low-level co-processor features and functions, hiding array topology, processing element architecture, array size, and the method of systolic communication from the user. The programmer should not have to specify the physical names of queues, ports, registers, or bits.

The goal is to design a general-purpose systolic programming environment suitable for a wide variety of machines, from the Connection Machine, which can emulate meshes of arbitrary dimension, to the fixed linear topology of the Brown Systolic Array. The programmer should be able to quickly, concisely, and naturally specify the systolic cell functions and data flows of common systolic algorithms without any knowledge of the target architecture, apart from the fact that it is capable of supporting systolic operations over some broad range of topologies.

By the nature of this thesis, NSL's focus is strongly, though not entirely, on the B-SYS programmable systolic co-processor. As will become evident in the programming examples of the next chapter, low-level B-SYS code is not the easiest programming method for the Brown Systolic Array, in spite of its efficient implementation of systolic communication. Thus, this chapter considers more intuitive ways to program both B-SYS and programmable systolic arrays in general. In concert with this, the study of the SSR architecture is completed with a full explanation and exploration of systolic data stream programming, the fourth defining attribute of the SSR architecture. Although the Systolic Shared Register architecture can produce very efficient NSL implementations, it is not required by the NSL paradigm. The abstract idea of a systolic stream can be implemented on any parallel processor that can provide or simulate communication between processing elements over a regular network.

The New Systolic Language has been designed for generic systolic co-processing systems with the following traits:

- There is a host and a co-processor. Instructions and data flow from the host to the co-processor while results flow from the co-processor to the host.
- The host allocates all co-processor resources.
- All common arithmetic and logical operations are supported by the co-processor, though some may require multiple instructions.

It is believed that these criteria fully represent the interactions between a programmable systolic co-processor and its host computer.

As a consequence of being designed for systolic programming, the language does not support arbitrary programming of individual processing elements. Currently, algorithms that require more than one equivalence class of cell program (see Section 2.1.4) must define systolic streams to manually distinguish processing elements, as is done in
the alternate transcription program of the next chapter (Section 7.1.4). It is expected that NSL will be extended to support any constant number of equivalence classes; for broadcast machines, the several logical instructions streams would be automatically converted to a single SIMD program, while for MIMD machines, all cell programs would be evaluated simultaneously.

The current NSL implementation is a prototype system which has proven useful for exploring the problem of systolic programming. As such, it generates code but does not execute it either on the B-SIM simulator or on the actual hardware (which is connected to a different machine). Nevertheless, experimentation with NSL has lead to many refinements and extensions to the original language specification. The NSL prototype has been designed with an eye toward implementing a complete system, and the prototype forms a firm basis for the development of an integrated system for systolic co-processor algorithm development, programming, simulation, animation, and execution.

This chapter continues with an overview of the New Systolic Language and a look at NSL cell programs and main programs. The C++ prototype implementation is then considered, followed by a review of possible enhancements to the prototype system. In conclusion, NSL is evaluated and future research problems in the domain of systolic programming languages are considered.

6.2 NSL Programs

The systolic stream is the most important aspect of the New Systolic Language. Conceptually, the systolic stream can be viewed as a flow of data buckets passing by the processing elements. During each time step, one bucket in the stream is directly accessible to each processing element. The value contained in the bucket may be used in computation and, if desired, changed. Before the next evaluation of the cell program, this bucket will have moved downstream to the next processing element or delay register, making room for a new input value.

Systolic: speed refers to the number of time steps required for data to travel from one processing element to the next. That is, a stream implements some number of logical delay elements between processing elements which can be represented as shown in Figure 6.1. The figure diagrams a westward flowing stream of speed 1 and an eastward flowing stream of speed 3. Speed 0 streams are used for immobile, local data. Processing elements can also look small distances upstream and downstream, accessing future inputs.
and past outputs, as shall be described latter.

Systolic streams are defined in NSL main programs and, using the `store_code()` and `execute_code()` NSL procedures, are passed to the systolic cell programs. Systolic cell programs do not concern themselves with stream speed or type: NSL will assign registers and perform functions according to the declared type and speed of the systolic stream at runtime.

As seen in Figure 6.2, the major parts of the NSL system can be divided into three categories: the cell program, the main program, and the NSL system. As mentioned, the NSL cell program specifies computation on abstract systolic data streams, independent of the macroscopic data movement. Cell programs can call user-defined routines which process NSL data types (registers, flags, and data streams) as well as a large number of predefined operators and functions. The NSL prototype system was developed in the
object-oriented C++ language which provided the ability to overload operators. Thus, common operations have been defined for all basic systolic data types (i.e., when $x$ and $y$ are systolic streams, the operation $x+y$ is evaluated by NSL to generate co-processor instructions).

NSL routines and NSL cell programs are not called like common C++ routines. References to an NSL cell program are restricted to calls from other NSL routines and NSL cell programs and to use as a parameter to the store_code() function of Figure 6.2. The NSL system itself will call the cell program and, as a consequence of the overloaded operators, generate code for the systolic co-processor.

NSL main programs control the flow of information through the systolic co-processor by defining systolic data streams. Systolic stream flow specifications include data type (integer or character) and precision information as well as the direction and speed of flow. In the current implementation, flow direction is restricted to eastward and westward, the limits of a linear systolic array; an NSL implementation for mesh systolic arrays would enable several more directions of flow.

In addition to flow information, streams include input and output specifications. Whenever a stream moves through the array, an input is needed and an output produced at the array boundaries. These inputs and outputs can be linked to files, functions, and arrays. By default, the input is zero and the output is ignored.

When the NSL system is initialized, information about the co-processor or simulator size, topology, and architectural features is defined. Much of this information is accessible to the user for the construction of routines that must depend on hardware parameters. This initialization informs NSL the type of systolic array required by the programmer's application — the NSL system must emulate that architecture if it differs from the actual hardware. A complete NSL system could, for example, automatically execute hexagonal-mesh programs on square- or triangular-mesh co-processors.

After initialization, a call to the store_code() procedure in the user's program will generate co-processor code. The cell program and stream flow specifications are combined and converted to the co-processor's native tongue. Hardware-specific resource optimization may also be performed.

The systolic co-processor is controlled by the execute_code() routine. This NSL function retrieves the previously stored code and sends it to the co-processor in whatever manner is appropriate to the machine. Using the I/O specifications of the systolic streams, the execute_code() routine also sends input to and store output from the co-processor. As mentioned, the stream I/O specifications can link these inputs and outputs to files, functions, or arrays. In essence, the store_code() procedure can be regarded as NSL compilation while execute_code() performs NSL execution. Note, however, that these are done dynamically within a C++ program, so that NSL parameters and function definitions can remain indeterminate until the C++ program is executed on the host.

In addition to overloading operators, the NSL prototype system takes advantage of several other features from C++. First, C++ can convert between object types given the appropriate specifications, for example the register part of the flag and register result of an arithmetic operation can be automatically extracted. Second, C++ allows the


```c
#include "nsi.h"

void
hvner (SStream C, SStream X, SStream Y)
{
    Y = Y + (C * x);
}
```

Figure 6.3: NSL cell program for Horner's method.

definition of object constructors and destructors. When registers or flags are declared, NSL constructors are called to allocate one of the systolic co-processor's storage locations. Similarly, to destroy registers or flags, destructors are automatically called to return the resource to the pool of available resources. Thus, the scope of register and flag allocation follows normal scoping conventions.

6.2.1 Cell Programs

NSL cell programs are C++ procedures that make use NSL's special systolic data objects. No information about data flow, data type, or array topology is present in the cell program.

An NSL cell program for Horner's method (Section 2.1.2) is shown in Figure 6.3. Following the first requirement for systolic programming, this cell program is a pure transfer function that does not involve the macroscopic flow of data; no information about the systolic streams, apart from their formal parameter names, is available to the cell program. A stream name (Y) as an rvalue (on the right-hand side of an assignment statement) refers to the input value of that stream, while a stream name as an lvalue (left-hand side) will set the output value of that stream (corresponding to Y' in the SDEF systolic design system example of Figure 2.8 on page 29). Stream names which do not occur as lvalues (C and x) are given the identity transformation (C=C and x=x), as with SDEF. For more complicated access to the streams, several other options will be detailed later.

Cell programs can use most C++ arithmetic and logical operations and can declare registers or flags for local use, as is done in the NSL cell program for sorting (Figure 6.4). In addition to the predefined NSL operators and routines, users may program additional routines involving the NSL data types for use with cell programs, such as the following minimization routine:

---

For those unfamiliar with C++, the notation SStream C indicates that the parameter C is a reference to a SStream object equivalent to the Pascal var declaration. Constant parameters can be specified with the const keyword.

The notation of the Heats example (Figure 2.9 on page 30) can also be used, giving direct access to the stream input (Y, in X) and output (Y, out X) registers.

98
```c
#include "nsl.h"
void
sort (SStream & Max, SStream & Min)
{
    Flag f;
    Max = select (f = (Min > Max), Min, Max);
    Min = select (f, Max, Min);
}
```

Figure 6.4: NSL cell program for sorting.

<table>
<thead>
<tr>
<th>Field</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>Config::n</td>
<td>Array size.</td>
</tr>
<tr>
<td>Config::topo</td>
<td>Array topology.</td>
</tr>
<tr>
<td>Config::reg</td>
<td>Registers per register bank.</td>
</tr>
<tr>
<td>Config::flags</td>
<td>Flags per functional unit.</td>
</tr>
<tr>
<td>Config::edges</td>
<td>Number of banks accessible to each functional unit.</td>
</tr>
<tr>
<td>Config::banks</td>
<td>Number of banks per functional unit.</td>
</tr>
<tr>
<td>Config::word</td>
<td>Length of word in bits.</td>
</tr>
<tr>
<td>Config::zero_flag</td>
<td>Pointer to constant zero flag.</td>
</tr>
<tr>
<td>Config::one_flag</td>
<td>Pointer to constant one flag.</td>
</tr>
<tr>
<td>Config::mask_flag</td>
<td>Pointer to mask flag.</td>
</tr>
</tbody>
</table>

Table 6.1: Static Config class members.

```c
Register
min (const Register* r1, const Register* r2)
{
    return (select ((r1 < r2), r1, r2));
}
```

Cell programs and NSL routines can also make use of C++ looping constructs to create functions which depend on the target architecture, such as the following NSL routine to generate a test vector for bridge faults:

```c
Register
bridge_test (void)
{
    Register result;
    for (int i = 0; i < Config::word; i++ )
    {
        result <<= 2; result++;
    }
    return (result);
}
```

The static Config class stores the systolic co-processor's configuration — size, topology, bits per word, pointers to constant flags, and the like, as tabulated in Table 6.1.

NSL cell programs only have systolic streams as arguments and never return a value. Data is passed to and from the main program (running on the host) through the systolic
#include "nsi.h"
main(void) {
  int n=8;
  // Array length 5, 16 registers, 8 flags, two directions,
  // 1 register bank per PE, linear array, 8-bit word.
  //Config:setup (n, 16, 8, 2, 1, LINEAR, B);

  // Fixed stream of one-byte integer coefficients
  Fix2Stream Coeff (INT, NSI_BYTES);

  // Two mobile streams
  SSStream X (EAST, NSI_SPEED1, INT, NSI_BYTES1);
  SSStream Y (EAST, NSI_SPEED1, INT, NSI_BYTES1);

  // Preprocess and store the cell program,
  store_code ("Horner", horner, &Coeff, &X, &Y);

  // Use a file for input of coefficients,
  // reversing them in the order they appear
  // in the file.
  Coeff.source ("file1", n, REVERSE, 0);
  // Read n X inputs from file2, then default to 0.
  X.source ("file2", n, IDENT, 0);
  // Place n Y outputs in file3 after waiting n
  // steps (Y inputs default to 0)
  Y.sink ("file3", n, IDENT, n);

  // Execute the complete cell program until the
  // Y stream has received n outputs.
  execute_code ("Horner", &Coeff, &X, &Y);
}

Figure 6.5: NSL main program for Horner's method.

data streams according to their input and output specifications. General NSL routines,
such as min() and bridge_test(), can process and return any NSL or C++ data struc-
ture.

In summary, NSL cell programs are C++ routines that make use of the NSL Register,
Flag, and SSStream data types. The systolic cell programs contain no information about
data movement, array configuration, or stream initialization and result extraction; cell
programs can be easily reused with different data movements.

6.2.2 Main Programs

The NSL calling routines of Figures 6.5 and 6.6 have full control over the systolic array:
they configure the array, maintain the data streams, and call systolic cell programs as
#include "nsl.h"
main(void)
{
    int n=50;
    Config::setup (n, 16, 8, 2, 1, LINEAR, 0);

    FixStream Local (INT, NSL_BYTE1, 0);
    SStream Result (FAST, NSL_SPEEDI, INT, NSL_BYTE1);

    store_code ("Sort", sort, &Local, &Result);

    // Take n numbers in as-is order, then default to 2i6
    // (the infinity value)
    Result.source ("unsorted", n, IDENT, 255);
    // Sink n output results to a file, but start output
    // collection 2n steps into execution.
    Result.sink ("sorted", n, IDENT, 2n);

    execute_code ("Sort", &Local, &Result);
}

Figure 6.6: NSL main program for sorting.

needed. First, the array is initialized to the type (in this case, linear) and size desired. This is followed by several data stream definitions. Then, the code is preprocessed and stored, and inputs and outputs are linked to the systolic streams. Next, the systolic program is executed, and finally the streams are released for future use (in this case, at the completion of the main program).

Co-processor code is generated by calling the store_code() procedure with a symbolic name, the cell program name, and as many stream specifiers as are required. The store_code() function executes the systolic routine, in which operations involving NSL objects generate co-processor code but do not perform any computation on the systolic co-processor. The code is then analyzed to determine if weaving is necessary and, if so, the number of cycles required for the phased systolic program (see Section 3.4.2).

After determining weave conditions, code is regenerated using the revised stream flow specifications and then is compressed to eliminate temporary variables. Figures 6.7 and 6.8 show the verbose and final B-SYS code generated by the NSL system for the Horner's method cell and main programs (multiplication has been replaced by an exclusive-or operation to shorten the B-SYS assembly language output). In addition to the standard instruction fields described in Section 4.1.1, numeric stream identifiers are given, corresponding to the read and write fields of B-ASM assembly language (Section 4.1.1). Whenever a stream is accessed, input or output values are written to and read from the co-processor according to the stream definition. The sink() and source() member functions of the stream class specify the stream inputs and outputs as files, functions, arrays or default values.

After confirming that the systolic stream specifiers are appropriate to the stored
; Initialization of flags
1 Rfn Oxza W0 W0 P0f0 0x0 0x0 FF F7 Set mask flag
2 Rfn Oxza W0 W0 P0f0 0x0 0x0 F6 F6 Set one flag
3 Rfn Oxza W0 W0 P0f0 0x0 0x0 F6 F5 Clear zero flag
4 Rfn Oxza E14 E14 P0f0 0x0 0x0 F6 F5 Set register
5 Rfn Oxza E14 E14 P0f0 0x0 0x0 F6 F5 IOSTream Word copy

; Load fixed stream into array.
1 Rfn Oxza E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
2 Rfn Oxza E16 E16 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
3 Rfn Oxza E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
4 Rfn Oxza E16 E16 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
5 Rfn Oxza E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy

; Execute_code routine Horner (O)
; Code for Horner cycle 0 of 1
1 Rfn Ox06 W16 W14 W11 P0f0 0x0 0x0 F6 F5 Logical_bop XOR i=0
2 Rfn Ox06 W11 W13 W12 P0f0 0x0 0x0 F6 F4 Arith_op ADD i=0
3 Rfn Oxza W12 W12 E13 P0f0 0x0 0x0 F6 F5 <2> IOSTream Word copy
4 Rfn Oxza W16 W16 E16 P0f0 0x0 0x0 F6 F5 <1> IOSTream Word copy

Figure 6.7: Uncompressed B-SYS code generated for Horner's method.

; Initialization of flags
1 Rfn Oxza W0 W0 W0 P0f0 0x0 0x0 0x0 FF F7 Set mask flag
2 Rfn Oxza W0 W0 W0 P0f0 0x0 0x0 F6 F6 Set one flag
3 Rfn Oxza W0 W0 W0 P0f0 0x0 0x0 F6 F5 Clear zero flag
4 Rfn Ox06 E14 E14 E14 P0f0 0x0 0x0 F6 F0 Set register
5 Rfn Oxza E16 E16 E15 P0f0 0x0 0x0 F6 F5 IOSTream Word copy

; Load fixed stream into array.
1 Rfn Oxza E15 E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
2 Rfn Oxza E16 E16 E16 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
3 Rfn Oxza E15 E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
4 Rfn Oxza E16 E16 E16 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy
5 Rfn Oxza E15 E15 E15 P0f0 0x0 0x0 F6 F5 <0> IOSTream Word copy

; Execute_code routine Horner (O)
; Code for Horner cycle 0 of 1
1 Rfn Ox06 W16 W14 W11 P0f0 0x0 0x0 F6 F5 Logical_bop XOR i=0
2 Rfn Ox06 W11 W13 W12 P0f0 0x0 0x0 F6 F4 Arith_op ADD i=0
3 Rfn Oxza W16 W16 W14 P0f0 0x0 0x0 F6 F5 <1> IOSTream Word copy

Figure 6.8: B-SYS code generated for Horner's method.

code, the execute_code() routine locates code for the named routine and executes it on the co-processor (or, in the case of the NSL prototype system, prints out the code). The code is executed until all outputs have been generated, as determined by the stream specification. A step_code() function is available for running cell programs through one or more iterations instead of until output completion.

Two iterations of the NSL sorting program are shown in Figure 6.9. Results are
automatically woven between the registers $R_{13}$ and $R_{14}$. As can be seen, this assembly language code is semantically identical to the B-SIM sorting program of Figure 4.4 on page 54, though of course the NSL source code is much easier to understand.

6.3 The NSL Implementation

The New Systolic Language prototype is a collection of C++ objects which allocates and deallocates co-processor resources. The objects, corresponding to flags, sets of registers, and systolic streams, can be manipulated with a wide variety of predefined operators and functions. In addition, there are several functions available for controlling the co-processor (for example, setting and clearing the mask bit) and the available resources (for example, accessing all data in a systolic stream, not just the cell inputs and outputs).

The structure of the basic data objects (classes Register, Flag, and RPair) is displayed in Figure 6.10. The Register object is an abstract register: it has its own type (i.e., signed integer, unsigned integer, or character) and precision (in bytes). Each physical register (in the case of B-SYS, sixteen) and flag (in the case of B-SYS, eight) has its own unique identifying object, a Reg obj or Flag obj. These are allocated and deallocated by the Register and Flag classes according to normal C++ scoping rules. A reference counter is used to determine when an object may be safely returned to the pool of free resources. Only pointers are used to refer to these basic resources because of C++’s prudence to copy objects as needed. Thus, the reference counter and pointer use ensures both that multiple copies of a resource do not exist and that resources do not vanish until all references to them have been eliminated. In the case of B-SYS, three of the flags are permanently removed from the resource pool during configuration for special use, and their addresses are stored in the static Config class, as can be seen from Table 6.1.

The RPair, consisting of a Flag and a Register (either can have zero length), is the basic unit of NSL computation. Most C++ operators have been defined for RPair.
objects, as well as methods for automatically converting Register, Flag, and SStream objects to the RFpair class. RFpair objects can also be manually or automatically cast to Register or Flag objects, in the process of which the other half of the RFpair becomes inaccessible. The operators and functions implemented for the NSL prototype system are displayed in Table 6.2. Unfortunately, the C++ selection operator :: cannot be overloaded.

Register and Flag objects do not persist between cell program iterations: systolic stream must be used for persistent data, both flow and fixed. Each stream acquires some number of registers according to its data type, speed, and use of wearing. Data in a speed n stream must take n steps to reach the next processing element, so n delay registers must be allocated to the stream. A stream moving at a speed of one requires one register, while a stream moving at a speed of two requires one buffer element and one other register. One additional register is allocated when wearing is required. For example, the Max stream in the sorting example uses two registers since it is a speed 1 stream with wearing.

SStream objects (Figure 6.11), as seen in the code examples, must also specify inputs and outputs for the stream, though a lack of either is acceptable. The structure of the InputObj and OutputObj classes used by SStream objects is displayed in Figure 6.12. Each member stores a mapping, such as as-is or reverse order (similar to Heartex), a number of inputs to give or outputs to receive, and the number already processed. The various

![Diagram of NSL prototype data objects](image-url)
 subclasses of IOobj store additional information, such as default input values and delays before recording output values, buffers for files, and pointers to arrays or functions.

An NSL program for sequence comparison (Sections 2.1.3 and 7.1) is shown in Figure 6.13. This program illustrates the use of functions and arrays for input and output: the function seqsight() returns the value \( d_{i,j} = t \) for each time step \( t \) (the index of the time step is passed to the routine when called by NSL). More importantly, the program illustrates advanced manipulation of the systolic stream object type. In the sequence comparison algorithm, three previous results are required: \( d_{i-1,j-1}, d_{i,j-1}, \) and \( d_{i-1,j}, \) the first two having been computed in \( F_{j-1} \) (at times \( t - 2 \) and \( t - 1 \)) and the remaining in \( F_j \) (at time \( t - 1 \)). Since cost data is required from two time steps ago, the cost stream moves at a speed of 2 (thus, three registers are allocated to the stream because weaving is required). The output of \( F_j \)'s cell program will be \( d_{i,j} \), and the input is \( d_{i-1,j-1} \). The problem is to access \( d_{i,j-1} \), which will be next time step's stream input, and \( d_{i-1,j} \), which was last time step's stream output. In short, \( F_j \) must look upstream and downstream from its current position. In NSL, these values are accessible using array notation, indexing the systolic data stream relative to the current input or output. Thus, the expression \( \text{Cost}[0] \) is equivalent to both \( \text{Cost}.in() \) and \( \text{Cost} \) as an rvalue, all of which refer to the stream input value \( d_{i-1,j-1} \). \( \text{Cost}[1] \) retrieves the value one time unit down the systolic stream, or \( d_{i,j} \) (produced by \( F_j \) last time step). Similarly, negative indices retrieve values upstream, so \( \text{Cost}[-1] \) retrieves the value one time unit up the systolic stream, or \( d_{i,j-1} \) (produced by \( F_{j-1} \) last time step). Thus, functional unit \( F_j \) sees Cost as a stream of data flowing past it, and is able to look short distances upstream and downstream as it computes values to place in that stream. The implementation of these options is simple because shared registers are used for systolic communication. The methods available for

<table>
<thead>
<tr>
<th>Operator</th>
<th>Meaning</th>
<th>Operator</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>-</td>
<td>Unary minus</td>
<td>^=</td>
<td>Logical negation</td>
</tr>
<tr>
<td>&amp;</td>
<td>Bitwise and</td>
<td>*=</td>
<td>Bitwise and</td>
</tr>
<tr>
<td>*</td>
<td>Bitwise exclusive-or</td>
<td>/=</td>
<td>Bitwise exclusive-or</td>
</tr>
<tr>
<td></td>
<td>Bitwise or</td>
<td>&lt;&lt;=</td>
<td>Bitwise or</td>
</tr>
<tr>
<td>&lt;</td>
<td>Left shift</td>
<td>&gt;&gt;=</td>
<td>Left shift</td>
</tr>
<tr>
<td>&gt;&gt;</td>
<td>Right shift</td>
<td>&lt;=</td>
<td>Right shift</td>
</tr>
<tr>
<td>&lt;</td>
<td>Less than</td>
<td>=&gt;</td>
<td>Less than or equal</td>
</tr>
<tr>
<td>&gt;</td>
<td>Greater than</td>
<td>==</td>
<td>Greater than or equal</td>
</tr>
<tr>
<td>==</td>
<td>Equal</td>
<td></td>
<td></td>
</tr>
<tr>
<td>+</td>
<td>Addition</td>
<td></td>
<td></td>
</tr>
<tr>
<td>-</td>
<td>Subtraction</td>
<td></td>
<td></td>
</tr>
<tr>
<td>*</td>
<td>Multiplication</td>
<td></td>
<td></td>
</tr>
<tr>
<td>++</td>
<td>Increment</td>
<td></td>
<td></td>
</tr>
<tr>
<td>&amp;&amp;</td>
<td>Flag and</td>
<td></td>
<td></td>
</tr>
<tr>
<td>!</td>
<td>Flag negation</td>
<td></td>
<td></td>
</tr>
<tr>
<td>match()</td>
<td>Matching bit position</td>
<td></td>
<td></td>
</tr>
<tr>
<td>i</td>
<td>Selection</td>
<td></td>
<td></td>
</tr>
<tr>
<td>set(i)</td>
<td>Nonmatch()</td>
<td></td>
<td></td>
</tr>
<tr>
<td>force_mask()</td>
<td>Unconditionally set mask</td>
<td></td>
<td></td>
</tr>
<tr>
<td>use_mask()</td>
<td>Do or not obey mask flag</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 6.2: NSL operators and functions.
use with SStream objects are tabulated in Table 6.3.

The resulting B-SYS code for the sequence comparison problem is displayed in Figure 6.14. As can be seen, the current NSL compression algorithm is not optimal: the cell program's length would be reduced by one instruction (17%) if the character matching and character movement (the first and last statements) were combined. The stream of costs has been allocated three registers ($W_{11}$, $W_{12}$, and $W_{13}$) because it is a speed 2 stream with weaving.

### 6.4 Implementation Enhancements

The current NSL prototype implementation fully understands all the features and operators mentioned herein, but although it generates valid B-SYS code it is unable to run the code on either the simulator or the co-processor hardware. Since the goal of this research was to investigate languages for general-purpose systolic programming, an examination of several NSL applications and the generated B-SYS code proved sufficient to evaluate the programming paradigm.
Experimentation with the New Systolic Language prototype system has led to many observations about and refinements of the New Systolic Language. Several of these have already been incorporated in the NSL system, in particular the ability to access upstream and downstream data just described. However, there are several implementation enhancements which should be incorporated into future versions of the NSL system. These include:

1. Systolic streams should be modified to have three principle flow attributes: direction or location, speed, and persistence. In the current implementation, only neighboring locations, such as east or west, may be specified to indicate logically adjacent processing elements. This could be generalized to allow connections to any relatively-addressed local processing element. For example, in one of the applications to be discussed in the next chapter (Section 7.1.5), it would be useful to specify a stream that automatically hops over processing elements, sending data from each \( P_i \) to \( P_{i+2} \) instead of the standard \( P_{i+1} \).

The speed of a systolic stream can remain as defined in the NSL prototype system, equal to the number of delay elements between processing elements. However, in conjunction with the systolic speed, a persistence attribute should be added to each stream to indicate how long data persists in the systolic stream beyond the minimum requirements of the speed definition. Thus, a fixed systolic stream (\( \text{speed 0} \)) with a persistence of 2 would make available both the current step’s and the previous step’s stream input value. Similarly, mobile streams should allow persistences in case stream input values must be reused latter.
#include "ns1.h"
define DEL_COST 1

void sequence (SStream &Ch1, SStream &Ch2, SStream &Cost)
{
  // Cost input is d_{i-1,j-1}
  // Cost[0] is last step's input, or d_{i,j-1}
  // Cost[i] is last step's output, or d_{i-1,j}.
  Cost = select (match (Ch1, Ch2),
    Cost,
    select (Cost[1] < Cost[0],
      Cost[1],
      Cost[0]) + DEL_COST);
}

int initweight (const int n) // n-th initial weight d_{n,n}
{
  return n;
}

main(void)
{
  int n = 6;
  int final_cost;
  Config::setup (n, 16, 8, 2, 1, LINEAR, 0);

  // Fixed stream of one-byte characters.
  FixSStream Seq1 (CHB, NSL_BYTE1);

  // Two mobile streams
  SStream Seq2 (EAST, NSL_SPEED1, CHAB, NSL_BYTE1);
  SStream Weight (EAST, NSL_SPEED2, INT, NSL_BYTE1);

  store_code ("Sequence", sequence, &Seq1, &Seq2, &Weight);

  Seq1.source ("seq1", n);
  Seq2.source ("seq2", n);
  // Use a function for n input values.
  Weight.source (initweight, n);
  // Only save one result (in an array).
  Weight.sink (final_cost, 1, IDENT, 2*n-1);

  execute_code ("Sequence", &Seq1, &Seq2, &Weight);
  cout << "Sequence Distance: " << final_cost;
}

Figure 6.13: NSL cell program for sequence comparison.
Table 6.3: SStream member functions.

<table>
<thead>
<tr>
<th>Function</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>S()</td>
<td>Stream input (value) or output (value).</td>
</tr>
<tr>
<td>S.in()</td>
<td>Stream input value.</td>
</tr>
<tr>
<td>S[0]</td>
<td>Stream input value.</td>
</tr>
<tr>
<td>S.out()</td>
<td>Stream output value.</td>
</tr>
<tr>
<td>S=</td>
<td>Assignment of stream output value.</td>
</tr>
<tr>
<td>S.source()</td>
<td>Assignment of file, function, or array as stream's input source.</td>
</tr>
<tr>
<td>S.wink()</td>
<td>Assignment of file, function, or array as stream's result sink.</td>
</tr>
<tr>
<td>S.force_woven()</td>
<td>Force S to be woven.</td>
</tr>
</tbody>
</table>

; Initialization of flags
; fnm Oxaa w0 w0 w0 P&fn Ox0 Ox0 F7 F7 Set mask flag
; fnm Oxaa w0 w0 w0 P&fn Ox0 Ox0 F6 F6 Set one flag
; fnm Oxaa w0 w0 w0 P&fn Ox0 Ox0 F5 F6 Clear zero flag
; fnm Ox0 E14 E14 P&fn Ox0 Ox0 F0 F0 Set register
; Load fixed stream into array.
; fnm Oxaa E16 E16 E16 P&fn Ox0 Ox0 F5 F5 <0> IDStream= Word copy
; fnm Oxaa E15 E15 E15 P&fn Ox0 Ox0 F6 F6 <0> IDStream= Word copy
; fnm Oxaa E16 E16 E16 P&fn Ox0 Ox0 F6 F6 <0> IDStream= Word copy
; fnm Oxaa E16 E16 E16 P&fn Ox0 Ox0 F6 F6 <0> IDStream= Word copy
; fnm Oxaa E16 E16 E16 P&fn Ox0 Ox0 F6 F6 <0> IDStream= Word copy
; fnm Oxaa E15 E15 E15 P&fn Ox0 Ox0 F6 F6 <0> IDStream= Word copy
; Execute_code routine Sequence (O)
; Code for Sequence cycle 0 of 2
; fnm Oxaa W14 W14 E10 P&fn Ox0 Ox8 F6 F4 Rel_op MATCH
; fnm Ox06 W13 E12 W10 P&fn Ox9 Ox4 F5 F3 Rel_op SUB
; fnm Oxac W13 E12 E10 P&fn Ox0 Ox0 F3 F3 select
; fnm Ox0a E10 E10 E10 P&fn Ox0 Ox0 F3 F3 Arith_uop INCR
; fnm Oxac W12 E10 E12 P&fn Ox0 Ox0 F4 F4 C2> select
; fnm Oxaa W14 W14 E14 P&fn Ox0 Ox0 F5 F6 <1> IDStream= Word copy
; Code for Sequence cycle 1 of 2
; fnm Ox08 W14 W14 W1 P&fn Ox0 Ox8 F6 F4 Rel_op MATCH
; fnm Ox06 W13 E12 E11 P&fn Ox9 Ox4 F5 F3 Rel_op SUB
; fnm Oxac W12 E13 W11 P&fn Ox0 Ox0 F3 F3 select
; fnm Ox0a W11 W11 W11 P&fn Ox0 Ox0 F3 F3 Arith_uop INCR
; fnm Oxac W13 E12 E13 P&fn Ox0 Ox0 F4 F4 C2> select
; fnm Oxaa W14 W14 E14 P&fn Ox0 Ox0 F5 F6 <1> IDStream= Word copy

Figure 6.14: B-SYS code generated for sequence comparison.
2. Support for on-board program storage should be added. The NSL system would store a selection of routines on the co-processor board for autonomous execution, perhaps using one of the many well-studied instruction caching algorithms. The separation of the store_code() and execute_code() routines will aid in this task. Additionally, the optimization routines used by NSL should be improved to more efficiently allocate the co-processor’s resources. This would include, for example, examination of cell programs for integer constants which can then be loaded before the execution of a cell program (currently, integer constants are constructed in every processing element during each cell program iteration). More efficient means of accessing the mask flag should also be considered.

3. When insufficient memory is available in processing elements for execution of the cell program, NSL should automatically allocate multiple processing elements per systolic cell program, reconfiguring the systolic streams as necessary. The protein scanning algorithm of Section 7.1.5 will illustrate the need of this feature.

4. Support for programmable systolic co-processors other than the Brown Systolic Array should be included, as well as a general-purpose simulator interface, possibly based on the Tango system mentioned in Section 4.1.2.

As can be seen, the NSL framework is not yet entirely full, however experimentation with the prototype system has lead to many important ideas and methods for systolic programming.

6.5 Conclusions

The New Systolic Language greatly expedites B-SYS programming, freeing the programmer from many tedious tasks. The obvious question is whether or not it has satisfied the goal of providing a simple systolic programming language for the creation of machine-independent programs using a logical structure suitable for systolic co-processors. As the examples of this chapter have illustrated, NSL does provide a concise and intuitive interface for systolic co-processor programming. NSL uses the clean separation of cell function and data movement present in SDEF, while remaining suitable for use with real hardware. The pitfalls of shared asynchronous variables and low-level systolic communication directives have been deftly sidestepped. The design is not limited to any particular machine or network topology, and future extensions could make it truly independent of hardware.

The most obvious future development issue is the extension of NSL to topologies of two or more dimensions. In concert with this, stream direction specifications to access arbitrary processing elements within these topologies should be added. A vector notation similar to that of the SDEF system could specify, for example, a stream flowing to the processing element two units north and one unit east. More common flows, such as northwest, would only require simple textual specification. Depending on actual network topology, some of these streams may require the generation of extra co-processor data.
movement instructions (i.e., simulating an octagonal mesh on a square mesh). Also, support for two-dimensional input and output specifications, perhaps similar to those of Hearts, should be provided: a systolic stream composed of entire rows from a matrix, distributed along a column of processing elements, should be simple to specify and easy to use. Many of these issues should be developed and refined concurrently with the stream enhancements discussed in the previous section.

Support for wavefront programming should be investigated. Some algorithms, such as systolic data compression, do not have a fixed relationship between input to and output from the array: the delay between input and output depends on how compactly the text can be compressed. Methods for automatically providing sentinels or other mechanisms which implement wavefront programming should be evaluated.

Since NSL is intended for arbitrary systolic co-processors without program modification (apart from NSL initialization via the Configuration call), support for oversized and undersized arrays must be developed. In the former case, this could involve controlling a variable-length co-processor, masking certain processing elements, or user-guided cell program analysis to find algorithmic methods of coping with large arrays. For example, in the case of sequence comparison, extra processing elements can be loaded with wild-card characters which match all other characters. Thus, by the sequence comparison algorithm, these processing elements will not affect the cost values exiting the array. Similarly, padding strings with null characters (\$) will have an entirely predictable effect on the distance computation. Performing functions on undersized arrays is a more complicated, though still achievable, goal. With guidance from the programmer, NSL should be able to partition problems for undersized arrays by storing intermediate results in the host's memory.

The New Systolic Language prototype system is a framework for a complete and integrated systolic co-processing environment for arbitrary topologies and implementations. Cell programs are, as they should be, independent of both the implementation and the form of systolic communication; cell programs do not access specific data queues or communication registers, but only access the abstract systolic stream data type. NSL main programs, written in C++, can execute host and co-processor functions in a simple and intuitive manner. Additionally, NSL has been designed for use with independently controlled co-processors; a complete NSL system would extract the highest possible performance from systems similar to B-SYS*.

Systolic communication is expressed clearly and concisely with the definition of systolic stream objects. The user can link files, functions, and arrays to the streams, and additional input and output specifications can be easily added to the NSL class library. Finally, the user can specify different topologies, though at present only linear systolic arrays are supported.

The case for the New Systolic Language is further supported in the next chapter, wherein several applications are programmed in NSL.
Chapter 7

B-SYS Applications

In an effort to gauge the performance and success of the Brown Systolic Array, this chapter considers the programming of several applications on the system. The problem which has had the greatest influence on the design of B-SYS has been sequence comparison, a problem critical to the analysis of the human genome. A system such as B-SYS (Section 5.2.4) would bring both high performance and programmability to this problem, features which are exploited in the several sequence comparison variations discussed here. The Brown Systolic Array is not just a programmable sequence comparison machine, it can perform a variety of other tasks to speed combinatorial computations. As an example, a B-SYS program for transitive closure, traditionally solved on 2-dimensional systolic arrays, is examined.

7.1 Sequence Comparison

As has been mentioned, there are many variations on the basic sequence comparison problem, including use of different character matching costs, searches for a subsequence, searches for the longest common subsequence, and the introduction of gap penalties for breaking or joining a sequence. P-NAC can only perform sequence comparison over a four-character alphabet with hardwired costs. BioSCAN, first mentioned in the footnote on page 18, can perform subsequence searches over a 28-character alphabet with programmable costs but without insert or deletion operations (only point mutations are allowed), as described in Section 7.1.5. Splash can theoretically perform many types of sequence comparison, however more complicated versions, such as comparisons over large alphabets or with different cost metrics, will result in a low number of systolic cells per field-programmable gate array. Because it provides a high concentration of fully programmable processing elements, the Brown Systolic Array is well suited to be a general-purpose programmable sequence comparison engine.

To begin this examination of sequence comparison on the Brown Systolic Array, two sequence comparison programs for the algorithm considered in Sections 2.1.3, 5.2.3, and 6.3 are analyzed. Next, a sequence comparison algorithm with an affine (or gap) cost function is presented. These three algorithms are most appropriate for nucleic-acid
sequence comparison, in which all four characters (nucleotides) are treated identically. Next, the problem of finding an amino acid encoding region within a nucleic acid is examined. Finally, continuing to move up the hierarchy of biological structures, the analysis of two amino acid chains (proteins) in a manner similar to that of BioSCAN is considered.

### 7.1.1 Simple Sequence Comparison

There is a problem with the sequence comparison algorithm of Figures 6.13 on page 108. If the sequences are over 128 characters long, the total edit distance could exceed 256, a number beyond the range of one-byte integers. That is, if the two sequences have no matching characters (for example, one being a sequence of the letter 'A' and the other of the letter 'C'), the final cost \( d_{128,128} \), corresponding to deleting the entire first string (at a cost of 128) and inserting the entire second string (128), will be 256. With one-byte integers, the nefarious principle of integer wraparound will turn this result into \( d_{128,128} = 0 \), incorrectly indicating a perfect match.

One solution is the use higher-precision integers, however this will make the cell program depend on the length of the array: longer array\s and sequences will require the assignment of more registers to the cost data stream to maintain adequate precision. As an alternative, BioSCAN imposes minimum 0 and maximum 16384 values on \( d_{i,j} \), the latter being treated as an infinity value which is never decremented.\(^{142}\) Character matching costs must be scaled according to sequence length because the threshold is not programmable. Regardless of scaling, these arbitrary caps will limit BioSCAN\'s effectiveness when processing sequences of fifty thousand or more characters.

A much better solution is the modulo sequence comparison method developed by Lopresti, who noticed that although global cost differences between two strings can be large, local differences (such as the difference between \( d_{i-1,j} \) and \( d_{i,j-1} \)) are always small.\(^ {129} \) The problem is to determine which of two values (\( d_{i-1,j} \) or \( d_{i,j-1} \)) is the smaller given that integer wraparound may have occurred. Thus, when comparing values modulo 256, the number 255 is smaller than the number 1 because \( 1 - 255 = 2 \pmod{256} \), and the alternative (a difference of 254) is impossible by the magnitude constraint on local differences. The four cases in comparing two numbers \( z \) and \( y \) modulo 256 are shown in Table 7.1, along with the \( Z \) flag and \( R \) type results of each subtraction. As can be seen, it is not the carry output \( Z \) that indicates which value is the smaller modulo 256,
but the most significant bit of the result. Module \( m \) sequence comparison can be used whenever the local differences are strictly less than \( 2^n \), 128 for 8-bit integers (that is, when \( |a - b| < 2^n \), \( \min(a, b) \) can be unambiguously determined by comparing \( a \pmod{m} \) and \( b \pmod{m} \)).

An NSL cell program for modulo sequence comparison is displayed in Figure 7.1. A C program similar to that of Figure 7.2 was used to drive the B-SYS prototype system as the board was being assembled. In this program, the \texttt{Instruction()} and \texttt{Instruction()} macros execute an instruction on the B-SYS prototype board with a zero input or an arbitrary input, respectively. The \texttt{ReadBays()} macro queries the prototype board for an output value. The macros follow from the port assignments of Table 5.3 on page 81; the macros are defined in Figure 7.3. As can be seen, input and output data values are passed to and from the prototype board in complemented form. The \texttt{post_bays()} routine processes the modulo sequence comparison data to extract the true \( d_{\text{SN}} \) result by applying the constraint on local differences.

As mentioned in Section 5.2.3, this sequence comparison program was executed as the array was being built, and its timings are tabulated in Table 5.4 on page 82. As can be seen in Figure 7.4, the Brown Systolic Array prototype system greatly outperforms the host 25 MHz Intel 80386 microprocessor.

### 7.1.2 Comparison of One String Against Many

Unfortunately, the sequence comparison program of the previous example only maintains 50% processor utilization: as the string enters the array all functional units to the east of the first character are not performing useful computation, and as the string exits the

---

"In the real version, the two loop iterations are split into two halves. During the first part, the input sequence \( x \) is streamed into the array but results are not collected. In the second part, null characters are used for character input while cost results are stored. As commented in the figure, not shown in the initialization during which the sequence \( t \) is loaded into the array and several flags and registers are set.

115
int bcompare (register unsigned char *s, int n, register unsigned char *t)
{
    int i;

    /* Store sequence t in the array and initialize */
    /* flags and registers. F7 = 0, F6 = 1. */
    /* ED = fixed character, W1 = moving character */

    /* First n steps (loading sequence) and second n steps */
    /* combined for simplicity */
    for (i = 0; _CX>0; i < 2*n; i++)
        /* Compare two previous costs (W2, E2) */
        ZInstruction(1, xorAB, W2, E2, WF, Zeab, F7, F1);
        /* But do the comparison mod 256 */
        ZInstruction(1, fna, WF, WF, WF, Zeab, F1, F1);
        /* Select the minimum place in WF */
        ZInstruction(1, selectABonC, W2, E2, WF, Zconst, F1, F1);
        /* Add 1, the insertion or deletion cost, to WF */
        ZInstruction(1, xorC, WF, WF, WF, Zeab, F6, F1);
        /* Compare the characters in s1 and s0, passing the s1 */
        /* character east and feeding a character to the array */
        Instruction (1, fna, WF, ED, EI, matchAB, F7, F2, *s);
        /* If the character's matched, use W4 as the cost, */
        /* otherwise use WF. Store result in R4 and load d(t,i)*/
        Instruction (1, selectABonC, W4, WF, RA, Zconst, F2, F5, _CX);
        ReadBytes (+s);
        _CX++; *s++; r++;

        /* Repeat, exchanging the meanings of W2/E2 with W4/E4 */
        ZInstruction(1, xorAB, W4, E4, WF, Zeab, F7, F1);
        ZInstruction(1, fna, WF, WF, WF, Zeab, F1, F1);
        ZInstruction(1, selectABonC, W4, E4, WF, Zconst, F1, F1);
        ZInstruction(1, xorC, WF, WF, WF, Zeab, F6, F1);
        Instruction (1, fna, WF, ED, EI, matchAB, F7, F2, *s);
        Instruction (1, selectABonC, W2, WF, E2, Zconst, F2, F5, _CX);
        ReadBytes (+s);
        _CX++; *s++; r++;
    }
    return (post_bytes (r, n));
}

Figure 7.2. B-SYS modulo sequence comparison program.
#define W1_PORT 0x300
#define W2_PORT 0x304
#define W3_PORT 0x308
#define IN_PORT 0x30C

#define opoint(port) asm ("mov dx, port; out dx, ax; ")

#define Instruction(Call, fla, a, b, d, g, c, z, inval) \  \ \ \ \  \\ __AL = ((gPfa) << 8) \ | \ (fka); opoint (W1_PORT); \  \ \ \ \  \\ __AL = (((((b) << 5) \ | \ (d)) << 5) \ | \ (a)) << 1; opoint (W2_PORT); \  \ \ \ \  \\ __AL = inval; \ AL = "AL; \\ __AH = ((((((Call) << 3) \ | \ (Z)) << 4) \ | \ (C); \  \ \ \ \  \\ opoint (W3_PORT); \ asm ("jmp $+2;\});

#define EmInstruction(Call, fla, a, b, d, g, c, z) \  \ \ \ \  \\ __AX = ((gPfa) << 8) \ | \ (fka); opoint (W1_PORT); \  \ \ \ \  \\ __AX = (((((b) << 5) \ | \ (d)) << 5) \ | \ (a)) << 1; opoint (W2_PORT); \  \ \ \ \  \\ __AX = ((((((Call) << 3) \ | \ (Z)) << 4) \ | \ (C)) << 8) \ | \ 0xff; \  \ \ \ \  \\ opoint (W3_PORT); \ asm ("jmp $+2;\});

#define ReadSys(var) \  \ \ \  \\ {asm ("mov dx, IN_PORT; \ in ax,dx; \ var = \ AL;\});

array all functional units to the west of the final character are not performing useful computation. Database search typically involves comparing one string against many other strings (the database) to find interesting or similar sequences, and for this variation of sequence comparison close to 100% processor utilization can be maintained.

To maintain high processor utilization, the systolic period (the time between streaming one sequence and the next through the array) must be reduced to approximately \( n \) from 2\( n \). To achieve this another systolic stream, a reset stream, is introduced to the algorithm. This stream will always have a value of either 0010 or 0110. The FF16 value indicates that a reset should be processed. That is, a reset signal indicates that valid data (data involving the new sequence) is only available upstream from the current functional unit (downstream data involves the exiting sequence). In NSL, this reset stream can be used as shown in Figure 7.5. The reset signal forces a functional unit 'to choose Cost 0d at (d1) as the minimum of d1,1 and d1,0, regardless of the true modulo minimum. The next problem is to ensure that this value is used in the computation of d2. To prevent the old d1,0 from being chosen as d1,0 (as it is when the characters in \( F_j \) match), the null (non-matching) character is passed in the mobile character stream in conjunction with the reset signal. Finally, whenever the reset signal is introduced, the initial weights (d0,0 values) sent to the array are restarted from 0.

The actual program used to test this on the prototype hardware was written in C, with a slightly more efficient reset computation (Figure 7.6). The generate and propagate control signals of A15 and F16 always propagate a carry input \( C = 1 \) to the carry output \( Z \), and will also assert \( Z \) whenever the most significant bit of the reset stream is set. That is, when the reset stream input is FF16, a carry is always generated; without the reset signal, the flag indicating the modulo minimum of d1,1 and d1,0 is left

117
unchanged. The inclusion of the reset stream only requires one additional instruction, an instruction which concurrently shifts the reset stream through the array and modifies the modulo minimum cost selection according to the reset stream's value. Such efficiency would not be possible without the Systolic Shared Register implementation of systolic communication.

In Figure 7.1, two reset signals are sent through the array to preserve the correct systolic phasing, a complication which is avoided in NSL. Also note that each iteration of the bcompare() routine returns the cost of matching the stored sequence against the previous sequence sent into the array. Since only n cell program iterations are executed during each call to bcompare(), the current input sequence is left in the array, having not yet sent any data back to the host. As with the previous program, the raw modulo
#include "nsl.h"
define DEL_COST 1

void
sequence (SStream Char1, SStream Char2, SStream Cost, SStream Reset)
{
    Flag f;
    // set f to bit 7 of the difference
    i = (Cost[i-1] - Cost[i+1]) << 1;
    // Reset = Off0 or Off1. If Reset = Off1, always choose
    // Cost[i-1], otherwise use minimum of costs (mod 264).
    f = (Reset != Off1) || f;
    Cost = select (match (Char1, Char2),
                   Cost,
                   select (f, Cost[i-1], Cost[i+1]) + DEL_COST);
}

Figure 7.5: NSL many-against-one sequence comparison.

output data is processed to extract the true \(d_{m,n}\) value.
This optimized sequence comparison algorithm maintains 100% processor utilization,
though only 99.6% of the computation involves the input strings: the remaining 0.4% of
the computation involves the reset stream (these percentages are based on a 470-element
array). This program can compare 470-character sequences on the B-SYS prototype
system in 16.8 ms per sequence, corresponding to a speedup of 91.2 times faster than
the 80386. Speedup ratios up to this point are displayed in Figure 7.7. A B-SYS* implementation
with 470 processing elements could be expected to perform 470-character sequence comparison over 600 times faster than the 80386.

Performance Analysis
Comparing a 470-element systolic co-processor to a single microprocessor is not exactly fair. True evaluation of the Brown Systolic Array requires a performance comparison to a variety of machines. Fortunately, sequence comparison is a well-studied problem, in particular the four-character, many-against-one sequence comparison problem just described. Core and others have analyzed this problem on the Intel IPSC hypercube, Connection Machine (CM-1), and Encore Multimax.36 Cheever and others have considered a signal processing approach to DNA correlation.26 Lopresti has implemented the dynamic programming method on a variety of machines, including massively parallel and traditional supercomputers as well as systolic co-processors.4,111

The basic sequence comparison algorithm requires \(O(n^2)\) resources, so that the performance of even the most powerful traditional supercomputer (with some small number of powerful central processing units) cannot surpass that of a simple systolic co-processor, even one with a slow bus interface such as the B-SYS prototype or P-NAC. Figure 7.8

119
int bcompare (register unsigned char *a, register unsigned char *b) /* Moving sequence */
{ int n, /* Array size */
  int i; /* Array of b-sys results */

  /* Send 2 reset signals into the array with 0 weights. */
  ZInstruction(1, xorABC, L2, R2, LF, Zemb, F7, F1); /* Compare costs */
  ZInstruction(1, fnA, LF, LF, LF, Zemb, F1, F1); /* Check MSB */
  Instruction(1, fnA, L6, L6, R6, oxAF, F1, F1, oxRF); /* Check reset */
  ZInstruction(1, selectABonC, L2, R3, LF, Zconst, F1, F1); /* sel min */
  ZInstruction(1, xorAC, LF, LF, LF, Zadda, F6, F1); /* Add 1 */
  ZInstruction(1, fnA, L1, R0, R1, matchAB, F7, F2); /* Compare char */
  ZInstruction(1, selectABonC, L4, LF, R4, Zconst, F2, F2); /* ReadBsys */

  for (i = 0; _CX<0; i < n; i++)
  { /* First pass */
    ZInstruction(1, xorABC, L2, R2, LF, Zemb, F7, F1); /* Compare costs */
    ZInstruction(1, fnA, LF, LF, LF, Zemb, F1, F1); /* Check MSB */
    Instruction(1, fnA, L6, L6, R6, oxAF, F1, F1); /* Check reset */
    ZInstruction(1, selectABonC, L2, R2, LF, Zconst, F1, F1); /* sel min */
    ZInstruction(1, xorAC, LF, LF, LF, Zadda, F6, F1); /* Add 1 */
    Instruction(1, fnA, L1, R0, R1, matchAB, F7, F2, *);/* Compare char */
    Instruction(1, selectABonC, L4, LF, R4, Zconst, F2, F2, _CX); /* ReadBsys */
    r++;
  _CX++; z++; /* Second pass */
    ZInstruction(1, xorABC, L4, R4, LF, Zemb, F7, F1); /* Compare costs */
    ZInstruction(1, fnA, LF, LF, LF, Zemb, F1, F1); /* Check MSB */
    ZInstruction(1, fnA, L6, L6, R6, oxAF, F1, F1); /* Check reset */
    ZInstruction(1, selectABonC, L4, R4, LF, Zconst, F1, F1); /* sel min */
    ZInstruction(1, xorAC, LF, LF, LF, Zadda, F6, F1); /* Add 1 */
    Instruction(1, fnA, L1, R0, R1, matchAB, F7, F2, *);/* Compare char */
    Instruction(1, selectABonC, L2, LF, R2, Zconst, F2, F2, _CX); /* ReadBsys */
    r++;
  _CX++; z++; } /* Reconstruct result of the previous sequence from modulo data */
  return (post_bsys (r, n));
}

Figure 7.6: B-SYS code for many-against-one sequence comparison.

120
graphs the performance of several different machines on the many-against-one sequence comparison problem (execution times are per 100 sequences).\textsuperscript{5} The timings for the Cray-2, Multiflow Trace, P-NAC, and Splash array are extrapolated from data on 100 comparisons of 100-character sequences.\textsuperscript{58} For the first two machines, the extrapolation follows an $O(n^2)$ relationship, while for latter two the extrapolation follows an $O(n)$ relationship until the hardware bounds of the machine are reached, at which point execution time becomes $O(n^2)$ because the problem must be partitioned. The B-SYS timings are derived from the experimental results just described and the B-SYS* timings are estimated (both with a combined $O(n)$ and $O(n^2)$ relationship). The Connection Machine timings are more complicated, and are based on configuring the Connection Machine as a set of systolic pipelines of the appropriate length. The timings are from the experimental results presented in Appendix C.

The BioSCAN system has not been included in Figure 7.8 because a working system will not exist for at least another year. It can be expected to perform perhaps five times faster than B-SYS* (assuming that data can be rapidly passed to the BioSCAN chip), though only for sequence comparison without insertions or deletions or gap penalties but with programmable weights. Although potentially faster, it is not in the same league as B-SYS or Splash because it is not generally programmable — it is a more appropriate match to P-NAC.

As can be seen from Figure 7.8, B-SYS* is competitive: it is approximately twenty

---

\textsuperscript{5}The figure is a logarithmic plot; the lines corresponding to the traditional computers have a slope of 2 (indicating an $O(n^2)$ execution time), while the systolic co-processors have a slope of 1 (indicating an $O(n)$ execution time) for sequences of equal or shorter length than the array. Vertical distances between curves indicate speedup factors between machines.

121
times faster than the P-NAC system (indicating both a faster bus interface and a higher-performance VLSI implementation than P-NAC). The Splash programmable hardware system is twice as fast as B-SYS* on sequences shorter than 400 characters, but the higher density of processing elements on the single-board B-SYS* system lets it perform about 1.8 times faster than Splash on sequences with 1500 or more characters. With a redesigned 500-processor B-SYS chip, a 32-chip (16000-element) B-SYS system would perform 2500 times faster than P-NAC and 20 times faster than Splash when $n > 16000$.

Recall also that B-SYS features general programmability: P-NAC's algorithm cannot be varied at all, while slight variations to the Splash algorithm will result in a much lower number of processing elements per Splash board. Programming the machines for 7-bit character comparison will have no effect on the number of available B-SYS functional
units, but the number of systolic cells on the Splash system will drop by one third, moving Splash's limit between $O(n)$ and $O(n^2)$ performance to $n = 250$, increasing the performance factor difference to 2.7 on sequences with 1500 or more characters. More complicated systolic algorithms (which cannot be handled at all by P-NAC or BioSCAN) will produce an even starker contrast between Splash and B-SYS.

The ease of programmability is also a concern. For this problem, P-NAC requires no lines of code; B-SYS requires tens of lines of code, and the Connection Machine and other supercomputers, without automatic support for systolic streams, require slightly longer programs. The Splash system, on the other hand, requires nearly 3000 lines of code to perform 4-bit sequence comparison.\textsuperscript{12} With the development of NSI, the programming and testing of new B-SYS applications requires little additional time beyond that required for selection of an appropriate systolic algorithm.

Thus, in addition to providing a higher concentration of systolic processing power than Splash, B-SYS is much more readily programmable, allowing rapid testing of new algorithms and metrics.

Linear systolic arrays are eminently scalable; the formation of large linear arrays has no inherent limitation, unlike I/O-bound meshes or complex hypercube networks. A ten-thousand-element B-SYS\textsuperscript{*} machine would require approximately seven boards, certainly a reasonable size when one considers the complexity of the Connection Machine. A Splash system, unable to pack systolic cells as tightly, would require almost thirty boards to compare 10,000-character sequences drawn from a 4-character alphabet. The CM-2, however, cannot be easily scaled beyond its 64 K processing elements because its message routers only support 16-dimensional hypercubes. (It is possible to compare long strings on short arrays, but not as efficiently or quickly.\textsuperscript{100})

Thus, execution time alone is not a just figure of merit, and the use of resources, both time and space, should be considered. To solve a specific problem as quickly as possible, the length of a linear systolic array should match the size of the data. Given this match, a specific amount of resources will be required to solve the problem. Figure 1.9 graphs the product of system size and execution time for the massively parallel machines. As a rough estimate of system size, the approximate number of chips in each system has been used. To compare sequences of a specific length, B-SYS\textsuperscript{*} requires 4 times fewer resources than Splash, a factor which will quickly rise as more complicated sequence comparison algorithms are programmed on the two machines. When complete, the BioSCAN system could be expected to have a resource measure many times better than that of B-SYS\textsuperscript{*} due to its high concentration of single-purpose processing elements. A more aggressive VLSI implementation of the B-SYS chip would improve the B-SYS\textsuperscript{*} values tenfold. The Connection Machine, with its large amount of hardware (over 4 K chips), is no longer an attractive solution to the sequence comparison problem: its high complexity overcomes the advantages of its massive parallelism.

It is hoped that this discussion of sequence comparison has left the reader convinced that although B-SYS can be slower than single-purpose systolic arrays on simple problems, the power of programmability more than makes up for the small difference in performance. Having examined sequence comparison performance in great detail, the remaining applications will only discuss implementations on the B-SYS system, for which
cell-program length is an accurate gauge of execution time (each instruction averages 4.35 ms on B-SYS and an estimated 0.54 ms on B-SYS*).

7.1.3 Sequence Comparison with Gap Penalties

Gap penalties are used by computational biologists who feel that the deletion or insertion of any number of nucleotides has an extra cost for the introduction of a gap in the sequence (computational biologists are by no means unanimous about what metrics should be used when comparing sequences). For example, the strings AUAUUC and UUAUUC both have a standard edit distance of 3 from AAC, however the latter breaks the sequence twice in its insertion of three uracil nucleotides, and thus could be considered the more distant from AAC. Applying a gap penalty of 2 will reflect this difference: the
costs would become 5 and 7 with the introduction of the gap penalty, indicating that 
AADUUUC is more similar to AAC than AUUAUC.

In general, molecular biologists assign a higher cost to gap creation than to point 
mutation, but an incrementally lower cost to the continuation of gaps. In evolutionary 
terms, it is difficult (costly) to break a nucleic acid, but once a break has occurred the 
cost of continuing that break is not high. For sequence comparison with gap penalties, 
the cost of deleting n contiguous nucleotides will be

$$\text{gap}(a_i) + \sum_{i=1}^{n} \text{dist}(\phi, a_i),$$

where the penalty for starting a gap by inserting or deleting the character c is \(\text{gap}(c).\)

For nucleic acid research, typical values are a mutation cost 

$$\text{dist}(c_1, c_2) = 1 \quad \forall c_1 \neq c_2,$$

a gap introduction cost \(\text{gap}(c)\) between 3 and 5, and an incremental deletion or insertion 
and an incremental deletion or insertion cost \(\text{dist}(c, \phi) = 0.1\) (when comparing nucleic acids, gap, mutation, and insertion or 
deletion costs are character invariant, leading to the affine deletion cost of \(n \cdot \text{dist}(c, \phi) + 
\text{gap}(c))\).169

Sequence comparison with gap penalties is somewhat complicated and is thus a good 
indicator of the power of the B-SYS Programmable Systolic Array. The recurrence 
equations for this type of sequence comparison are on three variables, \(f, g, h\), and cost 
values are obtained by minimizing these three variables.85

\[
\begin{align*}
    f_{0,0} &= \text{gap}(a_i) \\
    g_{0,0} &= \text{gap}(h_j) \\
    h_{0,0} &= 0 \\
    f_{i,0} &= f_{i-1,0} + \text{dist}(a_i, \phi) \\
    g_{i,0} &= g_{i-1,0} + \text{dist}(a_i, \phi) \\
    h_{i,0} &= h_{i,0} \\
    f_{0,j} &= f_{0,j-1} + \text{dist}(\phi, h_j) \\
    g_{0,j} &= g_{0,j-1} + \text{dist}(\phi, h_j) \\
    h_{0,j} &= h_{0,j} \\
    f_{i,j} &= \text{dist}(a_i, \phi) + \min \begin{cases} 
        f_{i-1,j} + \text{gap}(a_i) \\
        h_{i-1,j} + \text{gap}(a_i)
    \end{cases} \\
    g_{i,j} &= \text{dist}(\phi, h_j) + \min \begin{cases} 
        g_{i,j-1} + \text{gap}(b_j) \\
        h_{i,j-1} + \text{gap}(b_j)
    \end{cases} \\
    h_{i,j} &= \text{dist}(a_i, b_j) + \min \begin{cases} 
        f_{i-1,j-1} \\
        g_{i-1,j-1} \\
        h_{i-1,j-1}
    \end{cases}
\end{align*}
\]
\[
\begin{array}{cccc|cccc|cccc}
 f & \phi & A & C & g & \phi & A & C & h & \phi & A & C \\
 \hline
 0 & 2 & 3 & 4 & 5 & 0 & 2 & 3 & 4 & 5 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 U & 4 & 5 & 6 & 7 & U & 4 & 5 & 6 & 7 & U & 4 & 5 & 6 & 7 & U & 4 & 5 & 6 & 7 \\
 U & 5 & 6 & 7 & 8 & U & 5 & 6 & 7 & 8 & U & 5 & 6 & 7 & 8 & U & 5 & 6 & 7 & 8 \\
 U & 7 & 8 & 9 & 10 & U & 7 & 8 & 9 & 10 & U & 7 & 8 & 9 & 10 & U & 7 & 8 & 9 & 10 \\
\end{array}
\]

Figure 7.10: Dynamic programming matrices for gap sequence comparison.

(7.7)

\[
d_{i,j} = \min \begin{cases} 
 f_{i,j} \\
 g_{i,j} \\
 h_{i,j}
\end{cases}
\]

In the computation of a specific \(d_{i,j}\) value, the \(h\) values determine the cost of mutating the character \(a_i\) to \(b_j\), while the \(f\) and \(g\) values determine the cost of introducing or continuing a gap in the \(A\) or \(B\) sequences, respectively. The comparison of \(AAC\) with \(AUAAUC\) using \(gap(c) = 2, dist(c_1, c_2) = 2, \) and \(dist(c, \phi) = 1\) yields the dynamic programming matrices of Figure 7.10.

An RSL program for gap sequence comparison using a gap penalty of 5, insertion or deletion cost of 0.1, and point mutation cost of 1 is presented in Figures 7.11 and 7.12. Because decimals are messy, the costs have been scaled by a factor of 10. In spite of this factor, differences between neighboring costs remain less than 128, so comparison modulo 256 is valid. Again, the modulo sequence comparison data must be processed (in linear time) to determine the true \(d_{\text{final}}\) value. The RSL program produces a 24-line R-SYS program, and gap sequence comparison will take four times longer than basic one-against-one sequence comparison and 3.9 times longer than basic many-against-one sequence comparison on addition of a reset stream.

7.1.4 Alternate Transcription

The alternate transcription problem is a generalization of sequence comparison in which there are several equally acceptable target sequences. One of the earliest studied alternate transcription problems was the matching of a sequence to a context-free grammar, with the motivation being the construction of error-correcting parsers. Wagner surveys the complexity of various formal language error correction problems and proves the context-sensitive case to be NP-hard.\(^{114}\) The redundancy in the meaning and pronunciation of words makes speech recognition another possible application.

Wagner has proposed a dynamic programming algorithm for comparing a regular language and a string.\(^{116}\) Kruskal and Sainioff describe a dynamic programming algorithm for comparing the languages of two finite state automata without feedback, which

\(^{*}\text{This section extends part of the work which lead to the author's master's degree.}^{18}\)

126
they refer to as networks. 86 A network for four code words is displayed at the top of Fig. 7.13. When two or more paths branch out from a node in the network, a choice must be computed. Two or more paths entering a node indicate the minimization operation of picking the best choice from a previous branching. The network version of the sequence comparison recurrence (Equations 2.8-2.11 on page 12) for a string $A$ and network $B$ is:

$$
\begin{align*}
    d_{0,0} &= 0 \\
    d_{i,0} &= d_{i-1,0} + \text{dist}(a_i, \emptyset) \\
    d_{0,j} &= d_{0,j-1} + \text{dist}(\emptyset, b_j) \\
    d_{i,j} &= \min \left\{ d_{i-1,j} + \text{dist}(a_{i-1}, b_j), \\
    & \quad \quad \quad d_{i,j-1} + \text{dist}(a_i, \emptyset), \\
    & \quad \quad \quad \text{and dist}(a_i, b_j) \right\} \\
\end{align*}
$$

The notation $j^*$ (from Kruskal and Sankoff) refers to all the predecessors of the current element $d_{i,j}$ in the second (columnar) coordinate, or all the nodes with edges entering node $j$ of the network. Note that there may be several different $d_{i,j}$ values in the network, corresponding to all nodes in $B$ that are a distance $j$ from the source (root) node. When comparing two networks, the $(i-1)$ terms in Equation 7.11 are replaced by $i^*$, and the first two cases of Equation 7.11 are minimized over $i^*$. The initial conditions for network
```c
#include "nsl.h"

int fgweight (const int n)
{
    return ((n + 30) * 256); // n*dist(c,0)+gap(c)
}

int hweight (const int n)
{
    return (n == 0 ? 0 : fgweight(h));
}
main(void)
{
    const int n = 5;
    int f[n], g[n], h[n];
    Config::setup (n, 16, 8, 2, 1, LINEAR, 0);
    // Two constants of 10 and 30 for mutations and
    // gaps (insert/delete assumed to be 1)
    FixStream Cmutex (INT, 1, 10);
    FixStream Gap0 (INT, 1, 30);
    SStream Seq1 (EAST, NSL_SPEED0, CHAR, NSL_BYTE1);
    Seq1.source (*"seq1", n);
    SStream Seq2 (EAST, NSL_SPEED1, CHAR, NSL_BYTE1);
    Seq2.source (*"seq2", n);
    SStream fWeight (EAST, NSL_SPEED2, INT, NSL_BYTE1);
    fWeight.source (*fgweight, n);
    fWeight.sink (f, n, IDENT, n);
    SStream gWeight (EAST, NSL_SPEED2, INT, NSL_BYTE1);
    gWeight.source (*fgweight, n);
    gWeight.sink (g, n, IDENT, n);
    SStream hWeight (EAST, NSL_SPEED2, INT, NSL_BYTE1);
    hWeight.source (*hweight, n);
    hWeight.sink (h, n, IDENT, n);
    store_code (*"GapSequence", gapsequence, &Seq1, &Seq2,
                &fWeight, &gWeight, &hWeight, &Gap0, &Cmutex);
    execute_code (*"GapSequence", &Seq1, &Seq2,
                  &fWeight, &gWeight, &hWeight, &Gap0, &Cmutex);
    post_process (f, g, h, n);
}
```

Figure 7.12: NSL main program for gap sequence comparison.
Figure 7.13: Network comparison.
comparison are similar to those of the basic algorithm, and the zeroth row or column is a super-source for the network.

Network sequence comparison is rather involved, and an example is in order. Consider a code used by a clan of secret agents that consists of the words SHOBOP, SHOWOPP, DOOBOP, and DOOWOPP. One day, a message with the word COOBWOP is received. Since all messages are of vital importance, it must be decoded. Not finding the word in the code book, the agent must determine which word the writer most likely meant. This is solved with string comparison and network comparison is more efficient than comparing COOBWOP to each of the four words in the code. The network and the associated dynamic programming calculations are shown in Figure 7.13. Looking at the places where minimization occurs (the third letter and the final φ), it is evident which branches were chosen. The closest match to the incorrect code word is DOOBOP, and the edit distance between DOOBOP and COOBWOP is three, with a mutation and a deletion. The minimizing route is indicated with boldface entries.

Protein encoding regions

A code of more pressing concern than that of the secret agents is the genetic code of Table 1.1 on page 3, in which each amino acid is encoded by up to six different codons (sets of three nucleotides). Given a desired amino acid sequence (corresponding to a specific protein), the problem is to determine its similarity to a nucleic acid. Pelto has used an approximate dynamic programming solution131 (an analysis of this approximation has been previously presented by this author143), and neural network solutions have also been proposed.44 Subsequence variations on this problem can locate protein encoding regions within a nucleic acid. This section describes an exact dynamic programming algorithm and its implementation on the B-SYS programmable systolic array.

First, it should be noted that the translations of individual amino acids are independent. Consider the 2-element amino acid sequence of phenylalanine followed by leucine. The translation of phenylalanine to a 3-nucleotide codon in no way affects the possible translations of leucine to a codon. Thus, the basic network has a width of at most six (the highest level of redundancy in the genetic code), as displayed in Figure 7.14. A careful analysis of the genetic code, however, can simplify the basic network considerably. In all
Figure 7.15: Uniform comparison network for phenylalanine—leucine.

Figure 7.16: Wildcard comparison network for phenylalanine—leucine.

cases, there is at most one pair of 2-nucleotide prefixes for every codon encoding of an amino acid. Thus, the initial segments of an amino acid network can be compressed, as seen in Figure 7.16.

Further simplification of the phenylalanine and leucine network is achieved with the wildcard encoding commonly used for nucleotides. Four bits are used to represent each nucleotide, encoding them as $A = 1000$, $C = 0100$, $G = 0010$, and $U = 0001$. Two characters match if they have any corresponding asserted bit positions, as was checked in previous examples with the B-SYS match instruction. Thus, the wildcard character 0101 will match both $C$ and $U$ (called the pyrimidines), 1010 will match both $A$ and $G$ (the purines), and 1111 will match all four nucleotides. Using wildcards, the network is further compressed to have a maximum width of 2, as illustrated in Figure 7.16. A close examination of Table 1.1 on page 3 will reveal that all amino acids can be represented with width-2 wildcard networks.

Implementation of this algorithm on a programmable systolic array is simple. The two sequence comparison tracks of the network are evaluated sequentially. After the completion of each pair of node updates, every third processing element (corresponding to the last nucleotide of a codon) selects the minimum of its two tracks to pass to the next processing element. An NSL cell program for this algorithm can be seen in Figure 7.17. Note that it uses the basic sequence comparison cell program seen earlier, and could just as easily use a reset stream as well. Local cost differences are less than 128, so single-byte modulo comparison is used. The main program (not shown) will generate and store the wildcard network in the fixed Charia and Charib systolic streams. The nucleotide sequence is passed through the array in the mobile Char2 stream. The fixed $T_{third}$ stream identifies every third processing element for the conditional minimization. As usual, the
#include "nl.h"
define DEL_COST 1
void
sequence (SStream¹ Char1, SStream¹ Char2, SStream¹ Cost)
{
    Flag f;
    f = (Cost[-1] - Cost[1]) << 1;
    Cost = select (match (Char1, Char2),
                  Cost,
                  select (f, Cost[-1], Cost[1]) + DEL_COST);
}
void
alt_trans (SStream¹ Charia, SStream¹ Chari2, SStream¹ Costa,
          SStream¹ Charib, SStream¹ Costb,
          SStream¹ Third);
{
    sequence (Charia, Chari2, costa);
    sequence (Charib, Chari2, Costb);

    Flag f;
    // Third is a FinalStream, = 0xff for every third PE.
    force_mask (Third := 0);
    r = (Costa.out() - Costb.out()) << 1  // Min mod 256
    Costa.out() = Costb.out() = select (f, Costa.out(), Costb.out());
    force_mask (1);
}

Figure 7.17: NSL amino acid encoding comparison.

movement of all systolic streams is not processed until the entire cell program (in this case, alt_trans()) has been evaluated. A B-SYS cell program for this problem has 16 instructions, requiring only 2.7 times more time than the simple sequence comparison program.

This example highlights the major steps of systolic algorithm development. First, the network algorithm was developed and refined for this particular problem, leading to the simple computational requirements of the sample cell program. Optimization of this program for the B-SYS hardware, making full use of the Systolic Shared Register architecture, lead to a cell program with no overhead: each instruction performs a required operation and there are no instructions devoted solely to data movement.

7.1.5 Protein Matching

As for nucleic acid analysis, sequence comparison is vital for protein analysis. Proteins tend to be shorter than nucleic acids, with lengths in the thousands or tens of thousands of amino acids. This section considers a B-SYS program for the protein scanning algorithm implemented by BioSCAN. It is properly called an alignment or diagonal scan
algorithm since it does not allow insertions or deletions; when comparing two sequences \( A \) and \( B \) of lengths \( m \) and \( n \) (\( m > n \)), the cost of matching \( B \) to a subsequence of \( A \) is determined. The diagonal scan information may be statistically analyzed to determine an appropriate matching between the sequences. Although not as rigorous as full-fledged sequence comparison, this alignment algorithm is a valid filter for databases searches.

The BioSCAN system can align sequences over any 28-character alphabet, but was particularly designed for protein sequence analysis. BioSCAN implements the recurrence:

\[
\begin{align*}
\sigma_{0,0} &= 0 \\
\sigma_{i,0} &= 0 \\
\sigma_{0,j} &= \sigma_{i-1,j-1} + \text{dist}(a_i, b_j) \\
\sigma_{i,j} &= \min(\sigma_{i-1,j-1}, \sigma_{i,j-1}, \sigma_{i-1,j}) + \text{dist}(a_i, b_j).
\end{align*}
\]  

(7.12) (7.13) (7.14) (7.15)

This is a subsequence search: the system is used to find regions in a long string (\( A \)) which are similar to the entire shorter string (\( B \)). Thus, there is no penalty for starting at any position in \( A \) as is reflected by Equation 7.13. However, to find a complete copy of \( B \), the \( \sigma_{0,0} \) initialization values from the simple sequence comparison recurrence are maintained. The region of highest similarity to the \( B \) string is located by minimizing all \( \sigma_{i,n} \) values, since the subsequence alignment is allowed to end at any point within the \( A \) string. In general, \( A \) is a collection of sequences from a database and \( B \) is the sequence of specific interest.

In contrast to nucleic acid sequence comparison, protein sequence comparison does not use a single mutation cost. Some amino acids mutate less readily than others and are assigned correspondingly higher mutation costs. Thus, a few-hundred-element (20 \( \times \) 20) table of mutation costs among the 20 amino acids is required. Biologists most often use the PAM-200 cost table proposed by Dayhoff and others when comparing proteins. These costs are scaled logarithms of the probabilities of point mutations from one amino acid to another. That is, for a pair of amino acids, if it were experimentally determined that the first mutates to the second with some low probability \( p \), a value proportional to \(-\log(p)\) represents the high cost of matching the two amino acids (since a match would require an unlikely mutation to take place). The summing of these logarithmic weights corresponds to the multiplication of probabilities as the diagonal scan is computed.

For the systolic algorithm that assigns characters from one sequence to individual processing elements, each processing element need only store the twenty costs corresponding to a single column of the table. The corresponding character does not need to be stored; the 20 costs are sufficient. Unfortunately, each B-SYS register bank has only 16 registers. The solution to this problem is to couple pairs of functional units to represent each character in the \( B \) sequence, and as it turns out this will produce a very efficient algorithm: each pair of functional units will concurrently locate the correct cost, fully utilizing all functional units and memory banks. If many processing elements are available, the characters can be divided among three or more cells for an even faster algorithm.

An NSL program for protein alignment is displayed in Figures 7.18 and 7.19. The
#include "ns1.h"

void

psequence (SStream Char, SStream Cost,
SStream Table, SStream Base)
{

Register TrialCost (INT, 1, EAST);
Register TrialChar (CHAR, 1);

TrialCost = 0;
TrialChar = Base;

// Search for a match
int i = 0;
do {
    TrialCost = select (TrialChar == Char,
                      Table.in().byte(i), TrialCost);
    TrialChar++;
} while (i < 10);

// Combine costs to give both PEs same cost.
force_mask (Base == 0); {
    TrialCost = TrialCost + TrialCost.opposite_reg();
    TrialCost.opposite_reg() = TrialCost;
} force_mask (1);

// Shift streams twice
Cost += TrialCost;
Cost = Cost;

Char1 = Char1;
Char1 = Char1;
}

Figure 7.18: NSL protein alignment cell program.

program exhibits several new NSL functions. Each pair of adjacent processing elements is assigned one character, and each member of the pair is assigned half of the appropriate cost table. Which set of 10 costs a functional unit is responsible for is indicated by the base stream. The characters are numbered consecutively from 0 to 20, and the base stream places values of 0 and 10 in adjacent functional units. The loop of Figure 7.18 increments the base number and checks for a character match. If they do match, the appropriate cost from the table is stored in the TrialCost register — the byte() function accesses a specific byte from a multi-byte NSL register. Since the character will only match in one of the two neighboring functional units, only one TrialCost is non-zero for each pair of functional units at the end of the loop. The two TrialCost values are added together by the leader of the pair (in this program, the eastern member of the pair) and the result is then copied to the TrialCost register of both members of the pair: since
#include "nsl.h"
const int n = 200;
int weights[200][200];
int fixed_string[n];
int base (const int n)
{
    return (n % 2 == 0) ? 10 : 0;
}
int tablecolumn (const int n)
{
    return (weights[fixed_string[n/20]][n/20]);
}
main(void)
{
    int d[2*n];
    Config::setup (2*n, 16, 8, 2, 1, LINEAR, 8);
    FixStream Base (INT,1);
    Base.source (base, n);
    read_file_into_array ("seq1", fixed_string, n);
    read_file_into_array2 ("cost_table", weights, 20, 20);
    SSStream Seq1 (FAST, NSL_SPEED1, CHAR, NSL_BYTE1);
    Seq1.corre_weave (NOWAVE);
    Seq1.source ("seq2", n);
    // Default source is 0
    SSStream Cost (FAST, NSL_SPEED1, INT, NSL_BYTE1);
    Cost.corre_weave (NOWAVE);
    Cost.sink (d, 2*n, IDENT, 2*n);
    FixStreamTable (INT, 10);
    Table.source (tablecolumn, n);
    store_code ("ProteinSequence", psequence, &Seq1,
                &Cost, &Table, &Base);
    execute_code ("ProteinSequence", &seq1,
                  &Cost, &Table, &Base);
    post_process (d, n);
}

Figure 7.19: NSL protein alignment main program.

135
TrialCost is stored in register $E_2$ (for example) of $F_1$, $F_0$'s TrialCost is stored in $F_1$'s $W_2$ register, referred to by the opposite_reg() function of the Register class. This value is then added to the cost data stream. Both mobile data streams are then shifted twice to pass data to the next pair of processing elements (this would not be required with NSL implementation enhancement number 3 on page 110).

In the main program (Figure 7.19), initialization functions for the base stream and the cost table are declared. Also, weaving is disabled for the mobile cost and sequence streams. Normally, since the streams are referred to after they are set during the shifting process, NSL would weave both streams. Since this algorithm requires the streams to be shifted twice, weaving must be disabled for proper code generation. Finally, as usual, the output data is processed to recover the exact diagonal score from the modulo data.

The cell program for protein alignment has 41 lines, and requires 204 ms per iteration on B-SYS or 25.4 ms per iteration on B-SYS².

### 7.2 Matrix Multiplication

Matrix multiplication and the generalized algebraic path problem are the basis of many problems: transitive closure, all-pairs shortest path, and the dynamic programming recurrence for optimal parenthesization and RNA folding. Multiplication of $n \times n$ matrices is not a problem normally associated with linear systolic arrays; algorithms for square and hexagonal meshes are common, and provide an $O(n)$ execution time on $O(n^2)$ processing elements. Nevertheless, linear systolic arrays are an attractive solution to matrix multiplication because of their fixed I/O requirements. The mesh algorithms, as can be inferred from their asymptotic processing time, require $O(n^2)$ inputs per time step, a potential problem for sequential storage devices. On the other hand, linear systolic arrays have a fixed number of inputs or outputs regardless of the length of the array. Linear systolic solutions to the matrix multiplication problem require either $O(n)$ processing elements, each with $O(n)$ words of memory, or $O(n^2)$ processing elements, each with $O(1)$ words of memory; these bounds are a result of the complex data dependencies which require data streams with $O(n)$ delay element.

This section considers the general $O(n^2)$ matrix multiplication algorithm proposed by Ramakrishnan and Varma which is independent of the size of the input array. For any given value of $n$, it is possible to derive simpler systolic algorithms, one without such complex control, following the methods of Lee and Kedem. The algorithm presented herein, however, does not require remapping to the systolic array for each different input size, though additional processing elements are required. The algorithm will be given with minimal explanation; the reader is referred to the previously cited article for further information.

Figure 7.20 illustrates the basic systolic cell for this algorithm. In addition to the $A$, $B$, and $C$ matrix streams, there are two control streams and a second stream associated with the $A$ matrix. As illustrated, the $B$ and the Cutil streams move at a speed of 2, while the remaining streams move at one processing cell per step. The $A$Valid stream is either equal to the $A$ stream or to zero. When the control signals activate an $A$ value, it is
Figure 7.20: Systolic transitive closure cell.

```c
#include "msl.h"

void
inner (Stream & AValid, Stream & B, Stream & C,
       Stream & A, Stream & Ctrl1, Stream & Ctrl2)
{
    force_mask (Ctrl1 = Ctrl12);
    AValid.in() &= Ctrl1;  // Deactivate A input if required.
    use_mask (FALSE);     // Stop using mask
    C = C | (AValid & B);
    use_mask (TRUE);
    AValid.in() = & & Ctrl1;  // Possibly copy to active stream.
}
```

Figure 7.21: Transitive closure cell program.

transferred to the AValid track, and is later removed when the control signals deactivate
the value. Activation occurs whenever a Ctrl1 and Ctrl2 activate signal meet in the
same processor (FF is used in the NSL implementation), and similar conditions deactivate
the AValid stream (when 00 is present in both control streams). The basic algorithm is:

```c
if (Ctrl1 != 0xff && Ctrl2 != 0xff)
    Copy A to AValid
else if (Ctrl1 != 0x00 && Ctrl2 != 0x00)
    Clear AValid
else
    Output C = C | (A&B)
```

This is illustrated by the cell program of Figure 7.21, with a few slight modifications for
efficiency. NSL converts this cell program into a simple 8-line B-SYS program with two
phases. The main program (Figure 7.22) is straightforward, excepting the routines for
correctly timing the inputs and outputs.

In spite of the simple cell program, the macroscopic data movement of this algorithm

137
is quite complex. A linear array of size
\[ n^2 + (n - 1) \cdot \left( \left\lfloor \frac{n}{2} \right\rfloor + 1 \right) + 2 \]
is required for the algorithm; as mentioned, there are more efficient mappings of matrix multiplication or transitive closure to linear systolic arrays for specific values of \( n \). Lee and Kedem use as an example \( n = 4 \) matrix multiplication which, for their mapping, requires ten processing cells. The Ramakrishnan and Vaman algorithm requires 27 cells when \( n = 4 \), but has the advantage of data movement which is independent of problem size. Ideally, since the major area requirement of this algorithm is for delay elements, a mapping should be chosen which uses as much memory as possible in each processing element to keep the leading coefficient of the \( O(n^2) \) processor requirement as low as possible.

The spacing of inputs and outputs for the co-processor depends on the size of the matrices being multiplied. An inefficient implementation of these equations is displayed in Figure 7.23. Each cell iteration index is broken down into the proper \( i \) and \( j \) values for the matrix indices. The important thing to note is the ease with which the complex algorithm description has been converted into a systolic program. Although further analysis of the algorithm could lower the overhead associated with calculating each input, an NSL program for this application can be generated by copying the algorithm description with minimal change. Thus, using NSL published algorithms can be implemented with
const int n=8;
const int r = n/2;
const int array_size = (n+1) * r + n-1 * 2;
const int maxtime = array_size - 1;
int h[n][n];
int C[n][n];

#define inside(x,y,z) ((x) >= 0) && ((x) < n))

int control (const int t)
{
    if (t<n) return inside(t+(r-1)/2) && inside(t+(r-1)/2) + maxtime;
    // Activation signal
    else if (t>=n) return inside(t+(r-1)/2 + maxtime) && inside(t+(r-1)/2 + maxtime) + maxtime;
    // Deactivation signal
    else
    return 0x01;
    // Don't care signal
}

int control (const int t)
{
    if (t >= 3*n) return inside(t+(3*n) + n) + 0x01;
    // Activation signal
    else if (t<n) return inside(t+(3*n-2)) && inside(t+(3*n-2) + maxtime) + 0x02;
    // Deactivation signal
    else
    return 0x02;
    // The don't care signal
}

int input (const int t)
{
    int j = (t - (n+1)- (r-1)) % n;
    int i = (t - (n+1)- (r-1)) / n - j;
    if ( (j+2) + (n+1)+ (3*n-2) + (j+2)) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n)
    return 0x11;
    else
    return 0;
}

int output (const int t)
{
    int j = (t-1) / 3 * (n+1);
    int i = (t-1) / 3 * (n+1) + 1;
    if ( (j+2) + (n+1)+ (3*n-2) + (j+2)) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n)
    return 0x11;
    else
    return 0;
}

void Conjugate advant int t, const int val)
{
    int j = (array_size - t + 23) % 3*n;
    int i = (t - array_size + j - 2) / 3*n;
    if ( (j+2) + (n+1)+ (3*n-2) + (j+2)) % inside (j, 0, n) % inside (j, 0, n) % inside (j, 0, n)
    O[i][j] = val;
}

Figure 7.23: Transitive closure I/O functions.
<table>
<thead>
<tr>
<th>Application</th>
<th>Lines</th>
</tr>
</thead>
<tbody>
<tr>
<td>8-bit sorting</td>
<td>3</td>
</tr>
<tr>
<td>40-bit setting</td>
<td>15</td>
</tr>
<tr>
<td>8-bit polynomial evaluation</td>
<td>34</td>
</tr>
<tr>
<td>24-bit polynomial evaluation</td>
<td>166</td>
</tr>
<tr>
<td>Sequence comparison</td>
<td>6</td>
</tr>
<tr>
<td>Sequence comparison, reset</td>
<td>7</td>
</tr>
<tr>
<td>Sequence comparison, affine deletion cost</td>
<td>24</td>
</tr>
<tr>
<td>Sequence comparison, affine, reset</td>
<td>27</td>
</tr>
<tr>
<td>Sequence comparison, subsequence location</td>
<td>6</td>
</tr>
<tr>
<td>Protein encoding region, 1 cell per location</td>
<td>16</td>
</tr>
<tr>
<td>Protein encoding region, 2 cells per location</td>
<td>12</td>
</tr>
<tr>
<td>Protein encoding region, reset</td>
<td>18</td>
</tr>
<tr>
<td>Protein encoding region, affine</td>
<td>52</td>
</tr>
<tr>
<td>Protein encoding region, affine, reset</td>
<td>54</td>
</tr>
<tr>
<td>Protein alignment, 2 cells per character</td>
<td>41</td>
</tr>
<tr>
<td>Protein alignment, 3 cells per character</td>
<td>28</td>
</tr>
<tr>
<td>Protein alignment, 4 cells per character</td>
<td>24</td>
</tr>
<tr>
<td>General transitive closure</td>
<td>19</td>
</tr>
<tr>
<td>Data compression</td>
<td>36</td>
</tr>
<tr>
<td>Data decomposition</td>
<td>41</td>
</tr>
</tbody>
</table>

Table 7.2: B-SYS cell program lengths for various applications.

little modification.

7.3 Conclusions

Cell program lengths for several of the algorithms discussed in this chapter are presented in Table 7.2; the data compression and decompression lengths are based on Xu’s B-StM implementation of Storer and Reif’s algorithm. ASCII data compression is a combinatorial problem well suited to the Brown Systolic Array because only small amounts of memory and simple operations are required by the algorithm.

The software control of resources highlighted in the presentation of software fault detection methods (Section 5.3) is also applicable to general programming. In particular, extra processing elements can be readily used to perform protein encoding region searches and protein alignment computations more quickly. Programmable systolic arrays allow the user to decide the allocation of processing elements to a problem in addition to the algorithm to perform.

The Brown Systolic Arrays has proven its worth as a programmable systolic co-processor suitable for sequence comparison and other combinatorial applications. As the examples of this section have illustrated, a wide variety of sequence comparison algorithms can be implemented on the B-SYS system, a task greatly simplified by the New Systolic Language. Algorithms for simple sequence comparison, multiple sequence comparison, affine insertion and deletion costs for the introduction of gaps, alternate

140
transcription, and protein alignment, as well as subsequence and longest common subsequence searches, can be quickly and efficiently implemented on the Brown Systolic Array. Unlike single-purpose systolic co-processors, B-SYS can be programmed for different cost metrics depending on the problem being solved, making it an invaluable tool for experimental sequence analysis. Unlike programmable hardware such as Splash, the generation and modification of B-SYS programs does not require a considerable amount of time and usually does not lower the systolic cell density. As was seen in the performance analysis of many-against-one sequence comparison, the Brown Systolic Array has the best chip-second resource value of all the systems considered, the one potential exception being BioSCAN, a currently unfinished system which will not be generally programmable.
Chapter 8

Conclusions

Very large scale integration and wafer scale integration enable the placement of hundreds and thousands of processing elements on a single chip. Systolic arrays are one of the most efficient means of organizing these million-transistor chips; with their regular replication of simple cells, systolic arrays can outperform even the most powerful traditional supercomputer. This thesis has considered the design, implementation, and use of a programmable systolic array co-processor for combinatorial and other applications.

The design of any hardware system must evolve from a close examination of algorithms, hardware, and programming languages. The examination of systolic algorithms located the class of combinatorial problems as worthy of systolic solution. These problems require only simple operations and small amounts of memory, making them ideal for high-density VLSI solution. The study of systolic hardware found three approaches to systolic parallelism: the single-purpose array, the bit-serial SIMD array, and the systolic multiprocessor. Single-purpose arrays lack general programmability, restricting each one's use to a single hardwired algorithm. Bit-serial arrays generally have excessive memory for combinatorial problems and poorly designed systolic communication. Systolic multiprocessors are difficult to form into massively parallel systolic arrays because of their large processing elements, each with its own program store and instruction sequence. As with the bit-serial arrays, systolic multiprocessors lack a sufficiently concise expression of systolic communication. In the review of systolic programming languages and design systems, it was discovered that no language is particularly suited for systolic co-processor programming. Systolic design systems provide a clean separation of systolic cell function and systolic data movement but are too abstract for general use. Low-level systolic languages force the user to consider irrelevant implementation details of the target machine. Thus, is systolic hardware and programming systems, there existed a gap which this thesis has filled for a specific class of systolic algorithms.

The analysis of existing systolic hardware led to the introduction of the Systolic Shared Register architecture. This architecture combines the efficiency of special-purpose systolic communication with the versatility of a programmable systolic array. The four defining attributes of the SSR architecture are:

- SIMD broadcast instructions for control
• regular interconnections for systolic data flow

• shared registers for communication and computation

• data streams for programming.

Broadcast instructions and regular interconnections simplify design and control, enabling the construction of high-density arrays. Shared registers elegantly implement systolic communication by separating functional units and register banks, breaking apart the traditional definition of a systolic cell. Systolic data streams ease the task of programming any systolic co-processor, combining the neatness of mathematical mapping with the flexibility of a real language. Although most work presented in this thesis concerns linear systolic arrays, it was shown that the SSR architecture provides efficient systolic communication for arrays of any topology.

Rather than leave the SSR design on paper, the Brown Systolic Array was designed, simulated, and fabricated in a 2 micron CMOS process. Its architecture reflects the targeted combinatorial applications, in particular the sequence comparison problems of molecular biologists. Thus, it has an 8-bit word and 16 registers per shared register bank. The 6.8 mm × 6.9 mm, 84-pin, 85000-transistor chip worked on first fabrication; the success of such a complicated single-person VLSI design can be credited both to the simplicity of the SSR architecture and to the exhaustive and rigorous multi-level architecture and design simulation.

As with any VLSI implementation, several of the 24 fabricated chips had faults. The SSR architecture led to a fast test procedure for the B-SYS chips; functional units and register banks were quickly screened for errors before the construction of the 470-element B-SYS prototype system. This systolic co-processor can perform 470-character many-against-one sequence comparison over 80 times faster than its Intel 80386 host, and board modifications could magnify the speedup to 600. An aggressive redesign of the B-SYS chip could place 500 functional units and register banks on a single chip, producing an even more powerful systolic co-processor.

While fault testing is performed as a system is being constructed, fault detection takes place during array operation. For programmable systolic arrays, in particular those of the SSR design, software fault detection is the best general means of defending against transient faults. Fault-tolerant programs with little overhead can be automatically generated with the skewed replicated computation stream and interleaved replicated computation stream methods. Unlike hardware fault detection, these methods require no special circuitry and can be eliminated or strengthened as needed. Unlike algorithmic fault detection, these methods can be used with arbitrary programs.

After a system has been designed, tested, and constructed, thought must turn to its use. This thesis has proposed the New Systolic Language as a general framework for systolic programming. In addition to greatly simplifying B-SYS programming, NSL is a firm foundation for programming all types of systolic co-processors and simulators. Drawing on the merits and problems of existing systems, the New Systolic Language painlessly separates systolic cell function and macroscopic data movement to simplify the task of systolic programming.
With the New Systolic Language as a base, several B-SYS applications have been considered, including simple sequence comparison, sequence comparison with gap penalties, alternate transcription, protein alignment, and transitive closure. An analysis of 4-character sequence comparison on B-SYS and other supercomputers and co-processors lead to the conclusion that B-SYS does satisfy its design goals: it is a programmable systolic co-processor which greatly outperforms all other systems on target applications. Surprisingly, linear systolic arrays such as B-SYS can also perform traditionally 2-dimensional problems such as transitive closure. Although the transitive closure example was not the most efficient use of a linear systolic array, it illustrated the directness of coding a published algorithm in NSL.

Throughout this thesis, many directions for future research have been mentioned. The research area of most concern is the continued extension of the SR architecture and NSL programming language to 2-dimensional systolic arrays. In particular, the implementation of hexagonal- and octagonal-mesh Systolic Shared Register machines is an important goal. Because of the limited I/O requirements of linear systolic arrays, shared registers were ideal for the Brown Systolic Array. The pin limitations of current technology will lead to interesting design tradeoffs when constructing a 2-dimensional VLSI Systolic Shared Register machine. In conjunction with this, an NSL implementation for both linear and mesh systolic arrays should be developed. Such work will bring up questions concerning the efficient mapping of algorithms to general classes of systolic co-processors as well as the concise expression of machine-independent systolic programs.

This thesis has developed a three-pronged approach to hardware design. Critical evaluations of existing algorithms, hardware, and programming languages led to the discovery of many inefficiencies and needless complexities. These were stripped away, producing the architectural requirements for combinatorial algorithms, the Systolic Shared Register architecture, the Brown Systolic Array, and the New Systolic Language. The combination of these three aspects of systolic co-processing forms a solid foundation for the creation of a unified systolic co-processing environment that integrates systolic co-processor algorithm development, programming, simulation, animation, and execution.
Appendix A

B-SIM User’s Guide

The Brown Systolic Array simulator is located in the /pro/cad/syssim directory. This appendix presents an overview of the simulator, though the best documentation of the simulator is in the example programs of the simulator directory.

As illustrated by Figure A.1, there are three major parts to B-SIM, or two without the Tango animation extensions. In the figure, boxes indicate programs or processes, italics represent control and data, and the remaining text refers to file inputs and outputs required by the simulator. The loopsym program processes command line options as well as two program files. The first program file (.init) is executed once at the start of the simulation, and the second (.instx) is executed as many times as is required by the command line arguments. In addition to the size of the array and the length of the simulation, the command line can specify several input files. These are referred to by the read field of a B-SIM instruction, as was seen in Figure 4.4 on page 54. Default values for input files can also be specified on the command line.

The actual simulator is run as a separate UNIX process by the loopsym program, much in the manner that a host machine controls a co-processor board. Inputs and instructions are passed to the simulator while outputs are retrieved from the simulator and sent to files by the loopsym program. Apart from the standard B-SYS instructions, there are several control directives which can be passed to the simulator (they are identified by ‘impossible’ flag addresses). Thus, comments can be passed to the simulator, and the simulator can be instructed to write the array’s entire state to the sym_output file. This file may then be processed by the txf filter, which will produce TDFX snapshots similar to those of Figure 4.5 on page 55. A .convert file can be used to provide encodings of register values as signed integers, unsigned integers, or special symbols.

The separation of simulation and simulation control will simplify the task of creating new front-ends for the simulator and the physical array. An NSI interface to the simulator is an obvious next step.

The Tango version of the simulator uses Field’s message mechanism to pass simulation information to the Tango process. Several additional control directives can be used to shade specific registers or to temporarily halt the Tango animation. A slightly modified version of Tango (/pro/cad/syssim/src/tango/tango) must be used for B-SIM.
animations. The sysim/taort example illustrates the use of Tango in conjunction with the simulator. The Tango program must be executed from the tango directory with keys as an argument before using make to run the simulation. It is advisable to not run a window manager concurrently with the Tango animation.

Simulator execution is controlled with UNIX makefiles, examples of which may be found in the sysim example directories.
Appendix B

B-SYS User’s Guide

The Brown Systolic Array prototype board has been described in Section 5.2.2. This appendix identifies some of the practical issues involved in executing programs on the B-SYS prototype system.

As mentioned, the prototype system is controlled by both the PC, which sends instructions and data to the board, and the HP 16520A pattern generator, which performs instruction clocking. To commence operation, the user must ensure that the pattern generator’s D6 output pod is connected to header RA on the B-SYS board (Figure 5.2 on page 80), the pattern generator’s W6 input on pod D1 is connected to pin P11 on the prototype board (the complemented select line for address SOC), and that the “IBM2” configuration is loaded by the logic analysis system. The pattern generator may then be placed in continuous execution mode. A short pattern generator program watches the address select line and triggers a cycle of clock pulses whenever the third instruction word is accessed. The “IBM2” configuration allocates timing and state analyzer input pods for use with the prototype board, as can be investigated with the EP 16520A menus.

Programs associated with the B-SYS system are located in the C:\BSYS directory on the Intel 80386 PC. Test programs, written in a combination of assembly language and C, are located in the \BSYS\TEST directory, while several more complicated programs can be found in the \BSYS\C directory. The test routines, documented in the file README.DOC, follow the tests described in Section 5.2.

Typically, tests are invoked with the number of processing elements as an argument, and return no output if B-SYS passes the test. If B-SYS fails the test, the results are printed out for fault identification. Often, there are versions for performing tests from both sides of the array. For example, the aa55c and aa55l tests perform the InOut test of Section 5.2 from the right (east) and left (west) sides respectively. In addition to programs for performing the tests previously mentioned, a simple sorting routine is also available. It is suggested that the aa55 and sorting programs be run before any use of the B-SYS prototype board. The former will quickly identify the location of any chip seating errors, while the latter provides a more complete check on the system.

Programs in the style of Figure 7.2 on page 116 may be written using the nopc.h

149
macro definitions (Figure 7.3 on page 117) located in the \BYSYS directory. The `compare program performs modulo sequence comparison with reset and compares performance with the host machine. If the results match, only the algorithm timings are displayed, otherwise the final column of the C dynamic programming matrix and the B-SYS outputs (which should correspond) are presented for analysis.

Although not as efficient as possible, the B-SYS prototype system enables the exploration of the inherent power of massively parallel programmable systolic arrays.
Appendix C

Sequence Comparison on the Connection Machine

This appendix describes the C* Connection Machine program used to generate the CM-2 data used in Section 7.1.2. For 100 x 100-element sequence comparison, the results of Lopresti’s work were generated by performing each of the 100 complete comparisons on a single Connection Machine processing element, effectively turning the Connection Machine into a distributed processor.10,11 Unless the database has as many sequences as the Connection Machine has processing elements, this algorithm will waste much of the vast processing power of the Connection Machine.

It is more efficient to configure the Connection Machine as a mesh of systolic pipelines. For the 90 x 100 case, this will not produce a factor of 100 speedup due to the overhead of passing data between processing elements, however the pipelined algorithm does run over 27 times faster than the distributed version on 16K processing elements (0.17s versus 4.7s).

The driving routine for CM-2 sequence comparison is shown in Figure C.1. It selects a comparison routine based on the length of the sequences: numbers sufficient for the highest possible cost are used. (Since the final minimisation over the database is done in the CM-2, true costs must be maintained in the end processing element of each pipeline, so modulo comparison is not used.)

The Connection Machine features virtual processing elements far beyond the physical capabilities of the machine, thus a large number of pipelines can be processed at once, as is indicated by the computation of the pipes variable in Figure C.1. After an appropriate mesh is specified, the database strings are loaded into a parallel variable and the target string is compared with the database.

As can be seen, the mesh specification is closely tied to the physical hardware: both mesh dimensions must be a power of two and the size of the mesh must be a power-of-two multiple of the number of physical processing elements. Experimentation with various mesh layouts found that excess processing elements should be assigned to excess pipelines: the length of the systolic pipelines should be kept as small as possible. Note that with at least 128 systolic pipes allocated, the program could compare 128 sequences.
<table>
<thead>
<tr>
<th>Length</th>
<th>Time (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.023</td>
</tr>
<tr>
<td>20</td>
<td>0.041</td>
</tr>
<tr>
<td>50</td>
<td>0.090</td>
</tr>
<tr>
<td>100</td>
<td>0.17</td>
</tr>
<tr>
<td>200</td>
<td>0.46</td>
</tr>
<tr>
<td>500</td>
<td>2.32</td>
</tr>
<tr>
<td>1000</td>
<td>9.84</td>
</tr>
</tbody>
</table>

Table C.1: CM-2 100 sequence comparisons on 16384 processing elements (best time).

at about the same speed as 100 sequences. The complex constraints on grid size are a further indication that C^* is a low-level programming language by the definitions of Section 2.3.

The sequence comparison cell program is shown in Figure C.2. The routine is divided into two parts, depending on whether or not the sequence has entered the pipeline. The algorithm closely follows that described previously. (The "<" operation performs minimization.) Experimental results for several sequence lengths are displayed in Table C.1.

There are a few improvements which could speed the algorithm, in particular for long sequences. First, four-bit characters could be used. Since the majority of execution time is taken up by the grid communication, reducing the amount of data flowing between virtual processing elements is vital. This optimization would be best performed in Paris, which allows the specification of arbitrary-length data. In the C^* environment, the introduction of 4-bit characters would be tedious. For sequences with 126 or fewer characters, this would reduce grid communication by 25%.

When comparing long sequences, communication between processing elements is dominated by the transmission of cost data via 32-bit integers. Although it is important for the end processing elements to maintain true values at all times (for the final minimization across the entire database), modulo sequence comparison can still be used. If communication were restricted to 8-bit integers, reconstruction could be performed every sixty-third cell program iteration. Since local cost differences are at most 2, the values \( d_{c} \) and \( d_{a} \) cannot differ by more than 126, enabling modulo reconstruction of the true \( d_{a} \) value from \( d_{c} \) and an 8-bit difference. For long (32,768-character) sequences, such a change would cut communication by 75%, reduce the cell program length, and improve performance by perhaps fourfold.
#include <sim.h>
extern int load_file();
extern void compare_strings();
extern void compare_strings_char();
extern FILE *tgtfile;
extern FILE *dbfile;

void
compare (int n1, int n2, FILE *tgtfile, FILE *dbfile) {
    int pipes, plev, i;
    int best_match, best_cost;
    char *tgt;
    void (*refunction())();
    shape (*shape)[] ManyPipes;

    if (n1 < 128)
        refunction = compare_strings_char;
    else if (n1 < 32768)
        refunction = compare_strings_short;
    else
        refunction = compare_strings;

    /* Must maintain powers of 2 for plev and pipes, and */
    /* their product must be a multiple of the machine size. */
    for (pipes = 2; (pipes < (n1+1)); plev <<= 1);
    for (i = 1; (pipes - i) < physical_size()) < n; i<<=1);
    tgt = (char *) malloc (strlen (char1) + sizeof (char2));
    fgeta (tgt, src1, tgtfile);

    /* communication is much greater across the second axis */
    ManyPipes = allocate_shape (ManyPipes, 2, pipes, 1, CM_nrows, 0, 0, plev, 100, CM_nrows, 0, 0);

    with (ManyPipes) {
        everywhere (char current string2;
            load_file (dbfile, n, n, &string2);
            compare_strings (tgt, n, n, &string2, &best_match, &best_cost);
        )
    }

Figure C.1: CM sequence comparison driving routine

153
void compare_strings (char *tgt, int m, int n, 
    char:current *string1, 
    int *best_match, int *best_cost) 
{
    int i;
    unsigned int:current cost1, cost2, ls1cost1, ls1cost2, fill;
    char:current str1, cfill1; char:current *string1;

    strings = &str1;
    lastcost = &cost1; newcost = &cost2;
    leftcost = &cost1; diagonost = &cost2;

    /* Initial conditions */
    *string1 = 0; *newcost = 1; *lastcost = *leftcost = 0;
    for (i = 0; i < m; i++)
        /* Step input costs */
        swap (diagnost, leftcost, temp);
        swap (lastcost, newcost, temp);

    /* shift costs and characters */
    fill = *i;  
    to_grid_dia (&newcost, *lastcost, &fill, 1, 1);
    cfill = tgt[*j];  
    to_grid_dia (string1, *string1, &cfill, 1, 1);
    /* Take min of left and last and add i for non-match cost. */
    *newcost = (*leftcost < *lastcost);
    *newcost++;
    /* Use diagnost if characters match */
    where (*string1 == *string2)
        newcost = *diagnost;
    }
    for (i = 0; i < m; i++)
    {
        swap (diagnost, leftcost, temp);
        /* shift costs and characters in, ignore first pe */
        to_grid_dia (leftcost, *lastcost, &fill, 1, 1);
        to_grid_dia (string1, *string1, &cfill, 1, 1);
        *newcost = (*leftcost < *lastcost);
        *newcost++;
    where (*string1 == *string2)
        newcost = *diagnost;
    }
    /* laststep newcost now has sequence comparison results. */
    /* locate minimum of all plocs. */
    where (pcoord1) == (m-1) 
    { *best_cost = &newcost;
      where (*newcost == *best_cost) 
          *best_match = (int) pcoord0 (0);
    }
}

Figure C.2: CM sequence comparison cell program.
Bibliography


164


166