Chapter 4

advertisement
Chapter 4
SIMD Parallelism
Table of Contents
4.1 Introduction
4.1.1 The History of the SIMD Architecture
4.1.2 An Analogy - A Manufacturing Process
4.1 The “saxpy” Problem
4.2 A First Attempt at SAXPY Speedup - Pipelined ALUs
4.3 Vector Processing - One Version of SIMD
4.4.1 Pure Vector Processing
4.4.2 Pipelined Vector Processors
4.4.3 An Example - A Matrix/Vector Product
4.5 Beyond Vectors - Parallel SIMD
4.5.1 Using Subsets of Processors - A Prime Number Sieve
4.5.2 The Advantage of Being Connected - The Product of Two
Matrices
4.5.3 Working with Databases
4.5.3.1 Searching a List of Numbers
4.5.3.2 Searching a Database
4.6 An Overview of SIMD Processing
4.7 An Example of Vector Architecture - The Cray Family of Machines
4.8 An Example of SIMD Architecture - The Connection Machines
4.9 Summary
4.1 Introduction
In Section 2.3 we discussed the Flynn Taxonomy for Parallel Processing. One of
the first expressions of parallelism was via the SIMD (Single Instruction Multiple Data)
model. Conceptually, this model gives the first step away from the Von Neumann
Architecture discussed in Chapter 1. In one sense, we are still dealing with the classical
model of computing. The program is being processed one instruction at a time in the
sequential order that the instructions appear in the program. The difference is that instead
of one processing unit executing the instruction for one value of the data, several
processors are executing the instruction for different values of the data at the same time.
This is what we referred to as synchronous control in Section 2.4.3 of Chapter 2. This
mode of processing has obvious implications for an increase in processing speed.
As we will see in this chapter, there are two ways to implement the idea of having
multiple processors operating on different pieces of the data in a synchronous manner.
The first is to take the obvious opportunity which is to process vector operations, such as
vector addition or coordinatewise multiplication, that requires no communication
between the processors to perform the vector calculation, i.e. the result in one component
is independent of the results in other components. This type of SIMD parallelism is
1
known as Vector Processing. Many computer scientists distinguish this form of
parallelism from SIMD parallelism. In this text we will not make this distinction except
to note that it exists.
The other form of SIMD parallelism that we will discuss is what we call the Pure
SIMD model of parallelism. This model allows, and its most interesting applications
depend on, communication between processors. This means that one processor can use
the result generated in another (usually neighboring) processor during a previous step in
the calculation. The need for the data to be independent is removed. It also means that
processors can be “shut off” during a series of calculations. The ability for processors to
communicate and be shut off during processing is particularly useful in applications such
as database searches and matrix operations.
4.1.1 An Overview of SIMD Parallelism
The SIMD architecture is the first of the parallel architectures to be implemented
on a large scale. In the late 1970s the ILLIAC IV was the most powerful computer in the
world and handled many of NASA’s computation intensive problems. It cost 30 million
dollars and was extremely difficult to program. It had 64 processing elements that were
arranged in an 8X8 grid. Each processing element was considerably smaller than the
processing elements of existing serial (Von Neumann) computers. However, each
processor had its own local memory.
Although the ILLIAC IV was never a commercial success, it did have a
tremendous impact on hardware technology and design of other SIMD computers, such as
the Connection Machine designed by David Hillis in 1985 to conduct his research in
artificial intelligence at MIT. We will discuss the architecture of this machine later in this
chapter.
On another front Seymore Cray was designing a Numeric Super Computer that
was used for processing vector computations. Cray realized that the jumps that occur
during the serial processing of vectors (such as in the addition of vectors) are a waste of
computing time. He invented the world’s first supercomputer with multiple ALU’s that
were specialized for floating point vector operations implementing a simple RISC-like
architecture. The CRAY vector processors are register oriented and implement a basic
set of mostly short instructions. We will discuss this basic architecture in the section of
this chapter devoted to the CRAY series of machines.
4.1.2 An Analogy - A Manufacturing Process
A large confectionery manufacturing company is planning to enter the candy bar
market and introduce a “sampler” to be sold at candy counters and check out lines at
super markets. The sampler will have four candies: a chocolate covered cream, a peanut
cluster, a caramel, and an almond solitaire. The candies will be stacked in a paper tube
2
separated by a cardboard disk. The company is in the process of designing its production
line for this new product.
BINS
HOPPER
 chocolate cream
 peanut cluster
 caramel
 almond solitaire
LOADING STATION
There are three production designs under consideration. The first, and least
expensive is to have the paper tube come under a single loading station that is fed by each
of four candy holding bins and has a hopper of cardboard disks as part of its assembly.
The tube is filled by first placing a disk in the bottom then inserting the almond solitaire
followed by a cardboard disk and the caramel, then a disk and a peanut cluster, and finally
a cardboard disk, the chocolate cream, and a cardboard disk. The machine is very
efficient. The process of dropping a disk and a candy takes approximately one half
second. Thus a tube can be filled every 2 seconds for a total of 30 tubes a minute and a
production of approximately 43,000 tubes per day. Not really enough for a national
market.
If we make the analogy of the pieces of candy and cardboard separators to data
being processed by a computer, then the first production model described in the preceding
paragraph is similar to the Von Neumann sequential, or SISD, model of computation.
Modern processors may very efficiently process each individual piece of data, but the
sequential nature of the machine, makes it inappropriate for processing the large amounts
of data required for many of the modern, grand challenge scale, problems being addressed
by scientists today.
Continuing with our candy production analogy, consider an alternative idea for the
production line. It involves a packaging station that is roughly four times as large as the
one in the previous design. It has four feeding lines one to a bin of each type of candy.
Each line is accompanied by a hopper of disks. This design for the production line is
diagrammed in the following illustration.
Choc.
Peanut
Caramel
Almond
Cream
Cluster
Solitaire
3
If we assume that the time to move a tube from position to position within the
filling station is negligible, then after filling the first tube in 2 seconds, there are four
tubes within the station and a tube of candy is produced every half second. This gives a
total production of 120 tubes per minute and close to 173,000 per day. This is a more
acceptable production schedule, but it is a more expensive solution to the production
problem.
The above design is analogous to the pipeline design discussed in Chapters One
and Two. The pipeline design is used to process the instruction, “fill a tube with candy.”
However, the machine is made up of four smaller components and the instruction is
broken into its four component pieces: “put in a chocolate cream”; “put in a caramel”;
“put in a peanut cluster”; and “put in an almond solitaire”. Each part of the packaging
machine processes the same piece of the larger instruction over and over again. Thus,
four tubes of candy are being worked on at the same time.
Another possibility for designing the candy production line is to use ten of the less
expensive stations described in our first example and have them on a synchronized
production schedule. By this we mean that all ten machines will simultaneously load a
chocolate cream into a package, followed by all ten machines loading a peanut cluster,
etc. When the tubes are filled, they move out simultaneously to the next stage in the
production process. Thus, ten packages of candy are produced in 2 seconds.
Station 1
Station 2
Station 3
Station 9
Station 10
With this design 300 packages of candy are produced each minute for a total of
432,000 packages per day! This is a significant increase over the previous design and it
uses only the less expensive processors. It does, on the other hand, complicate the design
and set up problems. Decisions have to be made on how to supply the machines and
synchronize each step in the production process.
What we have described by the manufacturing analogy is a prologue to our
description of the SIMD (Single Instruction Multiple Data) computing process. We will
of course be processing data, not candy, and we will consider hundreds of processing
stations, not just ten. However, the basic ideas and problems are essentially the same.
4.2 The “saxpy” Problem
4
In this section, as with much of the rest of this chapter, we will be using two terms
with a great deal of regularity. They are vector and scalar. They are terms used by
professionals working in the area of Numerical Linear Algebra, a large client field for
high performance computing. Without going into all of the ramifications of these terms,
we will have the understanding that a vector is an array of floating point numbers and that
a scalar is a single floating point number. This definition is narrower than the generally
used definition, but it is acceptable for our purposes. Also within this context, we define
a matrix to be a two-dimensional array, or table, of floating point numbers.
When we consider many of the large problems involving applications of
computing, we see that they involve vector and matrix manipulations. For example,
almost all of the “Grand Challenge” scale problems that we discussed in Chapter 2 are of
this type. Prime examples of such problems are forecasting weather (see Figure 2.xx), the
ocean tides, and climatological changes. These problems all involve taking large
amounts of data from many locations, modifying that data by local conditions and passing
it on to the next location to interact with the conditions at that location.
As a simple example of this process, suppose that we have a metal wire that is
being heated on one end. Choose n points at equally spaced distances along the wire. Let
Y be a vector whose ith component contains the temperature of the wire at the ith point at a
given time, and let X be the temperature of the surrounding medium, including the points
adjacent to yi, at the same time. If is a factor that depends on the heat conductivity of
the wire, the time between measurements, and the distance between points, then the
temperature at the ith point at the next time a measurement is taken may be given by:
yi,t+1 = yi,t +  xi,t
A general description of this process can be expressed within a computer by
vector or array operations. A simple model, such as the above process, can be described
as follows:
Suppose that a vector Y gives the values for a particular condition at
location i =0, 1, 2, . . ., n at time, t, and a vector X gives the value of the
same condition at the location adjacent to location i at the same time. If a
force of fixed intensity,  is applied to all locations, the resulting value of
Y at time t + 1 is given by
X + Y
The computation of the resultant Y is called the saxpy or “scalar a (times) x plus y”
problem. This computation is so prevalent in scientific computations that it became an
important bench marking standard for evaluating the performance of computing systems.
The sequential coding of this process is very straight forward. For example, in C
the code it may be written as:
#include <stio.h>
5
#define n=1000
float a,x[n],y[n];
for(int i=0;i<n;i++)
y[i]+= a*x[i];
If we eliminate the computations involved with subscripting and look only at the
steps required to compute “c+=a*b” as opposed to “y[i]+=a*x[i]”, the last line of this
code is translated in the relatively simple assembly code of the Motorola 68040 chip as is
shown in Figure 4-1.
In Figure 4-1 we see that there are five lines of assembly code devoted to the basic
saxpy operation. If we view each assembly line command, as we did in Figure 1-2, as
being further broken down into machine operations of “decode the present instruction”,
_main:
link a6,#-12
fmoves a6@(-4),fp0
fsglmuls a6@(-8),fp0
fmovex fp0,fp1
The saxpy computation
fadds a6@(-12),fp1
fmoves fp1,a6@(-12)
L1:
unlk a6
Figure 4-1: Assembly Code for a Saxpy Operation
“fetch the operands”, and “execute the instruction” that each require one machine cycle
and assume that fetches and stores to memory take two machine cycles, then a single
saxpy computation (or single path through the loop) will take 19 machine cycles to
execute, as is shown in Figure 4-2.
decode fetch a load a
decode
fetch b
store c
Figure 4-2: Time for the Saxpy process
This is a highly idealized view of the process, but if we accept this model, we see that our
C code, which must execute 1000 saxpys, will execute in approximately 19,000 machine
cycles.
Since machine cycle times are measured in micro seconds(10-6 seconds) or even
nano seconds(10-9 seconds) 19,000 machine cycles does not seem to be a great deal of
time. However, the large problems that we are talking about deal with vectors of sizes
that are measured in the hundred of thousands and the program may do millions of such
calculations.
6
To expand on this idea, recall the discussion in Chapter One where the issue of
computation speed was discussed. One saxpy computation such as the one on a vector of
length one thousand will take:
T = 19,000 * C seconds
where C is the length of a machine cycle. Assuming a fairly typical value for C of 100
nanoseconds, we have
T = 19,000 * 100 * 10-9
= 1.9 * 10-3 seconds
= 1900  seconds per saxpy
This certainly is not a great deal of time, but in Chapter One it was shown that a typical
Grand Challenge scale problem may require 10 million saxpys of this size or larger. The
time to compute this number of saxpys is:
1,900 seconds or 5 hours and 17 minutes!
If we consider a problem that requires one billion saxpys of this size (not unheard of as
the demand for more accurate weather forecasting increases), the time required is:
527 hours and 47 minutes or almost 22 days
Now, we are talking about so much time that the result of the computation is useless for
any practical application. If we are going to attack problems that require computations of
this size, then speedup is a serious issue. We must develop a model of computation that
is distinctly different than the Von Neumann or sequential model.
4.3 A First Attempt at SAXPY Speedup - Pipelined ALU
The idea of a pipeline and a pipelined process was introduced in Chapters One
and Two. In fact, in Chapter One a brief mention of a pipelined ALU was made. In this
Section this idea will be covered in more detail. We begin with a description of the ALU.
Machine calculations are performed in a portion of the computer know as the
ALU, the Arithmetic/Logic Unit. Data transfers within this unit are very fast and will be
ignored in our analysis. The idea of pipelining is to separate the decode, fetch, execute
cycles of the assembly instructions as well as memory transfers into separate ALU units.
This architecture is diagrammed in Figure 4-3.
Fetch
Instruction
Decode
Instruction
Fetch Data
Execute
Instruction
Calculate
Address
Store Result
Figure 4-3: A diagram of a pipelined ALU
In Figure 4-3, we separated the fetch and store operations into two steps so that we can
make the idea of what is happening more clear. In order to process a single assembly line
instruction this processor model will take six machine cycles. This is actually slower than
our previous model, but the advantage is gained by the fact that the ALU processing units
operate independently. Thus, when unit six is working on the store portion of instruction
one, unit five is working on instruction two, etc. (Recall the second production design for
the candy factory.) In short, this design allows for six instructions to be in various stages
7
of execution at the same time. Thus, the first completed instruction requires six machine
cycles. After that time, one instruction will be completed with each machine cycle. In
our example C code, the process will take a total of 6 + 999 = 1005 machine cycles. This
compares to the 19,000 cycles required by a straight forward, non pipelined saxpy
computation.
In Figure 4-4 each complete C-code instruction is shown as a horizontal bar
representing six machine cycles. A new instruction is started after each machine cycle.
At the end of six machine cycles, one C-code instruction is completed with each machine
cycle. This interleaving of machine instructions causes our machine to operate efficiently
and, in the final analysis, more quickly than the machine of the previous model. This
gain, of course, comes at the expense of separating the execution of instructions and
synchronizing the processing of each stage.
First Instruction
finished
start
Computation Time
finish
C-Code
Saxpy Instructions
Figure 4-4: Executing the Saxpy C-Code on a Piplined ALU
4.4 Vector Processing - One Version of SIMD
There is not general agreement that Vector Processing is really SIMD Processing.
While we classify both as examples of SIMD Parallelism, we distinguish between the two
in our presentation. This section deals with Vector Processing, i.e. SIMD processing that
does not involve communication between the processing elements, but nonetheless does
its processing with the various processors synchronized to be doing the same operation at
the same time. This is appropriate for vector operations where the order in which the
operations are done on the vectors has no effect on the result of the computation. In
section 4.5 we will discuss what we call the Pure SIMD model for parallelism that allows
us to attack a more general class of problem than the vector operations discussed here.
8
4.4.1 Pure Vector Processors
While a pipelined ALU has some of the elements of parallelism, a machine built
on this principle is using an SISD architecture, even though more than one data element is
in the ALU at the same time. Only one instruction is completed at the conclusion of each
machine cycle. Because of the efficiency of the ALU, we can think of this architecture as
being, as we did in Chapter 2, as an example of an MISD architecture. However,
parallelism involves the processing of several instructions at the same time. The
difference is that in SISD processing, whether pipelined or not, the data is being
processed as a stream, i.e. at most one data element completing processing at the end of
each machine cycle. In a parallel architecture, it is processed as a wave, several data
elements completing processing at the end of the same machine cycle.
Our first view of this type of model will be that of a “vector” processor. The
architecture has a design where the ALU’s are vector ALU’s. They store and operate on
vectors of data, as opposed to single pieces of data. We will refer to vector registers and
scalar registers in discussing the design of this architecture. A scalar register is one which
stores a single floating point number. A vector register is really a collection of registers,
or one large register, that stores an entire vector of floating point numbers.
In addition to performing vector operations (vector add, subtract), the vector
computer must be able to perform scalar operations on vectors and reduce vectors to
scalars. Thus, the vector ALU must have hardware that can perform operations on vector
components, and also do operations on all the components of vectors such as sum
components, take the minimum of the components, etc. Finally, the ALU must allow for
‘mixed’ operations between scalars and vectors. The speed with which these operations
can be performed in the ALU affects the overall efficiency of the processor.
For the saxpy problem, X + Y, we illustrate in Figure 4-5 a model which
replicates the scalar, , in a vector so that it is available during the execution of the saxpy
command. In this Figure, we also display the vector registers as collections of floating
point registers.
.
.
.

.
*
.
+
.
X
.
.
.
Y
Figure 4-5: Executing a Saxpy Operation in a Vector Processor
9
A coding of this process that is representative of the pure vector machine handling of the
saxpy process is the FORTRAN 90 code.
Y[1:100] = a*X[1:100] + Y[1:100]
Even though scalars may be processed as vectors, as in Figure 4-5, they are written in
FORTRAN 90 as scalar, not vector, elements.
In the FORTRAN 90 language, it is possible to reference whole vectors or
portions of vectors by including the stating index and ending index in braces following
the name of the vector. Thus, if we wish to reference the elements of Y with indices 20
through 59, we write Y[20:59]. The use of a subset of the indices of a vector is handled
by a machine process. If we are dealing with a vector ALU, that has, say, 100 floating
point registers reserved for each vector, the entire vector Y may be loaded into the ALU;
however, a mask (a bit pattern representing a boolean vector) is also created to inform the
store instruction to ignore indices 1 through 19 and 61 through 100. If we have
FORTRAN 90 code that reads as follows
Z[20:59] = Y[20:59]
The mask may have zeros in all indices other than those numbered 20 through 59, and
ones in those indices. If mi is the value of the ith bit of the mask, the store instruction
corresponds to the boolean operation
zi = (zi  mi’) ( yi  mi)
A different problem is created if the range of the indices exceeds the number of
individual floating point registers that make up each vector register. In this case an
operation may be done in more than one machine cycle, or more than one register may be
used to hold the vector. For example, if Y has 150 elements and the vector registers are
equipped to handle only 100 elements, Y[1:100] may be loaded into one register and
Y[101:150] may be reserved for the next instruction cycle or placed in another register,
possibly an extension register. This determination is made during the compilation of the
instruction and based on the particular architecture of the machine. Of course, provision
is made to handle the indices whose values exceed 100 by using a modular reduction of
the index at the time they are loaded into the register.
As another example of FORTRAN 90 code consider the example of adding two
vectors, X and Y, that have 75 elements each and storing the result in a vector, Z.
Z[1:75] = X[1:75] + Y[1:75]
As another example, suppose that we want the first 25 and the last 25 indices of Z to be
twice the value of the corresponding indices of X and the middle 25 to be the sum of the
corresponding indices in X and Y. This can be done with the following 3 lines of code:
Z[1:25] = 2*X[1:25]
Z[51:75] = 2*X[51:75]
Z[26:50] = X[26:50]+Y[26:50]
A trace of the machine processing of these three lines would be as follows. A
vector register is filled with the scalar 2 replicated in each of its floating point registers.
The vector X is placed in another register together with a mask that blocks any store
instructions for indices 26 through 100. The multiplication is performed and the result is
10
stored in vector Z using the mask. A similar process is repeated except that the mask now
allows consideration of indices 51 through 75 only. Finally, vectors X and Y are loaded
in registers together with a mask. They are added and stored in Z as controlled by the
mask.
Obviously, there is some setup time required for the vector operations. After
setup, this model will execute the entire saxpy operation, X+Y for a vector of size 100,
in the same 19 machine cycles that were required by the simple SISD model to do a single
floating point multiply and add. If S is the number of machine cycles to create the masks
and use the mask to store the result, then the time to execute the entire instruction on a
vector of size 100 is 19 + S machine cycles. If we are processing a vector of size 1000 on
our example machine, the time could be as long as 190 + 10S machine cycles depending
on the number of vector registers in the ALU and the number of setup instructions that
need to be repeated with each segment of the vectors.
4.4.2 Pipelined Vector Processors
There are two immediate limitations that should come to mind when we consider
a pure vector processor. They were considered briefly when we considered vectors of
size 100 in or example machine that had vector registers made up of 100 floating point
registers. Both limitations involve the size of a vector that can be held in a vector
register. First of all, how is it possible to anticipate the size of the vectors that will be
submitted for the vector operations? Unlike, the candy factory analogy, it is not possible
to set a limit on the size of the “vector” that is to be processed or to know what its size
will be ahead of time. Some applications may deal with hundreds of data points while
others may consider thousands or millions. If one is building a general purpose computer,
it is impossible to envision all of its uses.
One solution to the variable and unlimited size problem is to have all vector
operations done on operands that reside in memory. Unfortunately, while this allows for
variable length vectors of almost arbitrary size, it slows down the computation
considerably. Memory access requires a great deal of time as opposed to the time
required for doing operations within the ALU registers.
The other factor is economic. If one is considering high performance computing
needs, then high speed, high accuracy ALUs are necessary to insure accurate results. To
build a machine that includes a significant number of such processing units within each of
the vector registers is very expensive. Generally speaking, those machines that boast a
large number of individual processing units are using processors that are lower in cost
and less efficient. Consequently, these machines are less reliable for numerical computing
that requires a high degree of accuracy. On the other hand, this economic restriction is
becoming less of a concern as the price of quality numerical processing chips is
decreasing.
11
The compromise imposed by these limitations is the obvious one, use pipelined
vector ALUs. Figure 4-6 illustrates this idea. Extra vector registers are supplied to store
the results of previous instructions, thus saving time by eliminating the number of fetch
operations from memory.
The machine implementation of pipelining a vector ALU is not at all as
straightforward as our diagram might suggest. On the other hand, while Figure 4-6 is
devoid of detail and idealizes the operation, the fact is that this arrangement gives a
sensible solution to the speed/size compromise. In future discussions, when we refer to a
vector processor, we may have a pure vector processor in mind, but in effect, we are
really referring to a pipelined vector processor.
Vector Registers
Vector Pipeline
1. Vector Register Read
2. Fetch Instruction
3. Decode & Execute
4.Vector Register Write
Figure 4-6: A Conceptual Diagram of a Pipelined Vector ALU
4.4.3 An Example - Matrix/Vector Product
The successful application of parallel computing to a problem frequently involves
a rethinking of the problem itself and the algorithms that are used to solve the problem.
This rethinking can be illustrated in the following application.
Suppose that we have an n by m matrix, A, and an n-dimensional vector, v
 a 11 a 12 ... a 1n 
 v1 
a
a 22 ... a 2 n 
21
 . 


.
.
.

A=  .
v=  . 
 
 .
.
.
. 
 


v n 
a m1 a m2 ... a mn 
In most cases, we think of a matrix/vector product as an m-dimensional vector whose
elements are made up of the scalar product of the rows of A, regarded as individual n12
dimensional vectors, with the n-dimensional vector, v. This conceptual formulation of
the product is shown below.
 a 11 v 1 ... a 1n v n 

.



Av =
.




a m1 v 1 ... a mn v n 
Written in C-code the product P (=Av) is generated by nested loops.
for(i=1; i<=m;i++){
P[i]=0;
for(j=1; j<=n,j++)
P[i]+=A[i][j]*v[j];
}
It is apparent that mn multiplications and m(n-1) additions are required to generate this
product using the traditional SISD architecture. The question is how can we efficiently
implement this algorithm on a vector processing machine? Part of the answer involves
the way in which matrices are stored. If we use a vector processing oriented language
such as FORTRAN 90, and write the code as
do i=1,m
P[i]=A[i,1:n]*v[i,1:n]
end do
The coding appears to be quite efficient, but when it is run, it does not perform as
efficiently as expected. The reason for the lack of efficiency is that a FORTRAN 90
stores matrices in memory in column order.
To see why the above coding of the matrix/vector product usually operates
inefficiently, consider how data is loaded into the ALU. Computer designers make the
assumption that if a program is reading an element of data, it very likely will need to read
other data elements that are stored near by that element in memory. Thus, memory is
organized into clusters of data, called pages. Different computer manufacturers use data
clusters of varying lengths. Let’s choose a fairly typical page length of 128 bytes. This
size page is capable of holding 32 numbers of four bytes in length. Now suppose, that we
are executing the above FORTRAN 90 code with a matrix, A, of size 100 by 100. Since
the matrix is stored in column order, A1,2 will be stored at least three pages away from
A1,1 within the computers memory. The same is true of any two consecutively numbered
row elements. Thus, each time through the loop the page containing the row element
must be located and added into memory. This process is called a page fault. One of the
goals of efficient coding is to minimize the number of page faults during the execution of
a program. Obviously, since the choice of A in our example contains 10,000 four byte
numbers, not all page faults can be eliminated.
Since the row elements of the matrix may be several memory pages away from
each other. The time for each fetch cycle is at a maximum, and the total execution time
for the algorithm, as coded above, is relatively slow. So how do we handle this new
issue? The answer involves changing the algorithm for evaluating the product.
13
Instead of looking at the matrix/vector product as the scalar product of the rows of
A with the vector, v, look at the product as a linear combination of the columns of A.
That is as an extended saxpy operation involving the columns of A as shown in Figure 47.
 a 11 
 a 12 
 a 1m 
a 
a 
a 
 21 
 22 
 2m 
.
.
Av = v1   + v2   + . . . + vm  . 
 . 
 . 
 . 
 
 
 
a n1 
a n 2 
a nm 
Figure 4-7: Computing the Matrix/Vector Product in Several Saxpy Operations
A moments reflection will show that the result of this operation is the same as the
previous, multiple scalar product operation. The difference is that the result can now be
easily adapted to an efficient FORTRAN 90 program running on a vector processor as is
shown in Figure 4-8.
P[1:n] = 0
Do i = 1,m
P[1:n] = v[i]*A[1:n,i] + P[1:n]
End Do
Figure 4-8: Doing the Matrix/Vector Product in Column Order
We assume that in this calculation, each column of A has been stored in a separate vector
register and that the result of each saxpy operation is stored in a register. The net result is
that the product computation requires m steps as opposed to nm steps on SISD
architecture. The timing for the individual steps is minimized because the coding forces
the computer to access memory in an efficient way.
4.5 Beyond Vectors - Parallel SIMD
The vector processors described above are extremely efficient for many applied
problems that involve vector manipulations. However, there are two important
characteristics that are not implemented by the above design - the ability to work only on
a subset of the array elements without additional manipulation and the ability to
communicate with neighboring elements. True SIMD computing requires both of these
abilities.
The example architecture we will use as a paradigm is an array processor modeled
after the earliest example of such a machine, the ILLIAC IV computer designed at the
14
University of Illinois (c. 1982). While this is quite old in terms of the lifetime of
computing machines, it is still an excellent model of the entire class of SIMD machines.
Within this model each processing unit has its own processor and memory. There
is a central control unit that broadcasts instructions to each processor. The processors are
then connected to other processors in some fashion. In the ILLIAC IV each processor
was connected to its nearest neighbor as shown in Figure 4-9. This is an example of the
grid topology that is discussed in Section 2.xx.
Control Unit
Processors
with
Memory
Connections
Figure 4-9: A Conceptual Diagram of a Pure SIMD Architecture
Other possible connections for the processors, in addition to the grid illustrated in Figure
4-9, are cube connected as in the Connection Machine and fully connected as in the
GF11. We will discuss particular SIMD architectures in a latter section of this chapter.
With this design it is clear how the vector machine can be extended. If a
particular subset of processors is to be shut off, as is the case when using a conditional
statement, then the Control Unit merely broadcasts a command that sets a flag bit to zero
in all processors contained within the subset and one for those processors that are to
remain active. This process is similar to the masking process that we discussed in
connection with the vector processors.
If a processor needs to have access to a result generated by another processor, then
the connecting links make this possible. In the grid layout, shown above, it is most
convenient to request the information from an immediate neighbor. Information from
other processors can be obtained, but it is necessary to have a routing program for virtual
connections.
15
4.5.1 Using Subsets of Processors - A Prime Number Sieve
The classic problem of locating all primes less than a particular number, n, is
solved by an algorithm called the Sieve of Eratosthenes. The basis of this algorithm is
one of those simple ideas that reflect the insight of genius. It is simply the observation
that if a number, n, is not prime then it is the product of positive integers each of which is
less than n. Thus, the algorithm begins with 2 and sets to zero all positions in the list of
consecutive integers from 3 to n that correspond to multiples of 2. Next, the starting
point is moved to the next non zero element in the list, say p, and sets to zero all the
elements between p+1 and n that correspond to multiples of p. The process continues in
this way until it arrives at nth position in the list. If this position is zero, then n is a
composite number. Otherwise, n is prime. In addition to making this determination, the
algorithm has generated a list whose only non zero elements are the prime numbers < n.
#include <stdio.h>
#define n 1000
/* We will find the primes <= 1000 */
int i,j,k,p[n];
/* Initialize the array */
for(i=0;i<n;i++)
p[i]=i;
/* Pick out the primes in the list */
for(i=2;i<n;i++)
/* find the next non zero element */
if (p[i]){
k=p[i];
/* Set the multiples of k to zero */
for(j=i+1;j<n;j++)
if(!(p[j]%k))
p[j]=0;
}
Figure 4-10: C-code for the Sieve of Eratosthenes
The sequential form of this basic algorithm is coded in C and displayed in Figure
4-10. This algorithm operates in O(n2) time plus the time that it takes to print the nonzero
elements of the array.
If we consider a list of integers from 2 to 15 then we can trace the operation of the
algorithm of Figure 4-10. This is done in Figure 4-11.
16
Notice that there are two conditional statements in the code of the algorithm.
These statements can be looked at as instructions that will turn off some of the
processors. In particular, any processor that does not meet the given condition will not
execute the statements contained within the scope of the conditional statement.
The code in Figure 4-12 assumes that there is contained in the instruction set of
the machine a command that will allow a processor to retrieve its own ID number. We
will also assume that these ID numbers start at 0 and are indexed by 1 as we move from
left to right and top to bottom. In order to have the processors in our machine correspond
to the elements of the array that was used in the sequential formulation of this algorithm,
we will test and, if necessary, change to 0 a variable j that is initially set to ID +2. (An
alternative strategy is to ignore the first two processors, 0 and 1, and then initialize j to be
Initial array p:
2 3 4 5 6 7 8 9 10 11 12 13 14 15
First non zero element: 2
After executing the inner loop of the algorithm (12 steps)
Status of array p: 2 3 0 5 0 7 0 9 0 11 0 13 0 15
Next non zero element: 3 (1 pass of the outer loop)
After executing the inner loop of the algorithm (11 steps)
Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0
Next non zero element: 5 (2 passes of the outer loop)
After executing the last loop of the algorithm (9 steps)
Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0
Next non zero element:7 (2 passes of the outer loop)
After executing the last loop of the algorithm (7 steps)
Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0
Next non zero element:11 (4 passes of the outer loop)
After executing the last loop of the algorithm (3 steps)
Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0
Next non zero element:13 (2 passes of the outer loop)
After executing the last loop of the algorithm (2 steps)
Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0
Next non zero element:** (2 passes of the outer loop)
Figure 4-11: A Trace of the Sieve of Eratosthenes for a Small Value of n
ID, itself). When reading the code in Figure 4-12, keep in mind that the code is broadcast
one step at a time by the Control Unit to each processor. The active processors then
execute that step before receiving the next instruction from the Control unit.
int j,k;
/* Set the value to be tested */
j=ID+2;
for(k=2;k<n;k++)
/* consider only active processors */
if (j != 0 && j > k) {
/* Should the processor remain
active? */
17
if ((j%k)==0)
j = 0; }
Figure 4-12: The Code That Is Executed on Each Processor of a SIMD Machine
Of course the Control Unit is not broadcasting C code. It is broadcasting machine
language instructions that will execute the program represented by the code. However,
we can understand what is happening with the sieve algorithm from the C version of the
program. Let’s examine what is happening in two processors of the machine during the
execution of the code in Figure 4-12. Assume that n is 15 and we are looking at the
activity in processors 11 and 13.
j:
k=2
j:
k=3
j:
k=4
j:
k=5
j:
Processor #11
13
Processor #13
15
13
15
13
0
Processor Inactive
13
13
Nothing changes in these processors
for k = 6 through 12
k=12
j:
k=13
j:
k=14
k=15
13
Processor inactive
Figure 4-13: The Status of Two Processors During Execution of the Sieve Algorithm
Note that since there is only one loop, this program operates in O(n) time. Each
processor is responsible for only one number and at then end of the algorithm when the
processor is reactivated will return either a 0 or its ID plus 2. In effect, the array of
processors replaced the innermost loop of the original C code.
A final note on this program. It is not as efficient as it might be. Many of the
mod operations on line 9 of Figure 4-12 are superfluous. For example, any number
divisible by 4 is also divisible by 2 and the corresponding processor was turned off when
divisibility by 2 was checked. If we make the further assumption that the Control Unit
has hardware that can quickly communicate with the processors that are “active” (not shut
off by the conditional) and can do operations such as sum, find minima, and maxima, etc.
18
quickly, then there is no need for the loop. We use the value of j contained in the active
processor with the minimum ID number to check the remaining processors. This value of
j must be a prime number (Why?) Using this strategy the number of steps necessary to
complete the algorithm can be reduced to a number equal to the number of primes < n,
n
which for large n is slightly larger than
.
ln( n)
Assume that the instruction set contains a command similar to the ReduceSum
command shown in Figure 3.xx. Assume that a call to this command is “ReduceMin(j)”`
and that it will quickly identify the minimum value of the variable j that is contained in
each of the active processors.
Referring back to Figure 4-9, we see that there are two distinctly different types of
computation going on in a machine built according to the SIMD architecture. The first
type is that which is being done in the Control Unit. This unit generates values and
broadcasts them to the Local Processors. The Control Unit also gathers results from the
Local Processors and handles I/O communications with external devices. On the other
hand, the Local Processors do the operations on the data that they receive from the
Control Unit. In order to distinguish between variables that reside in the Control Unit and
those that reside on the Local Processors, we use two new type declarations, Control and
Local. If a variable is designated as being of type Control, then it is defined in the
Control Unit and may be broadcast to all of the processors. A variable designated as
being of type Local may have a value that varies from local processor to local processor.
Its value stays within the processor until it is collected along with the other local values of
the variable via a special command from the Control Unit, such as a ReduceSum or
ReduceMin call.
Figure 4-14 is a coding that takes advantage of the conventions discussed in the
preceding paragraph for executing the Prime Sieve Algorithm.
Control: int a;
/* a is a variable whose values are generated
in the Control Unit. Its value may be
shared with the local processors, but they
can not change it. */
Local: int j,k;
/* j and k are variables that reside on the
local processors and can be manipulated by
them. Each processor may have a different
value for j and k. This is the usual case.
*/
j=ID+2; /* Initialize each Local j */
a=2; /* Initialize a in the Control Unit */
While (a>0){ /* 0 is the default min value
for the empty set */
k=Load(a); /* The value of a is broadcast
19
to all active processors */
if (j>k)
a=ReduceMin(j); /* Note that only those
values of j that exceed
the previous value of a
are considered */
if (j!= 0){ /* Shut off those processors
with zero j */
k=Load(a);
if(j>k and (j%k)=0)
j=0;
}
} /* The value of j is either 0 or a prime */
Figure 4-14: A SIMD Coding That Uses the ReduceMin Operator
In Figure 4-15, we show the status of the Control and Local variables during the
execution of this algorithm on 14 processors.
Control Unit
a
2
3
5
7
11
13
0
0
1
2
3
4
5
6
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
0
0
0
0
0
0
5
5
5
5
5
5
5
6
0
0
0
0
0
0
7
7
7
7
7
7
7
8
0
0
0
0
0
0
Processors
7 8 9 10
j
9 10 11 12
9 0 11 0
0 0 11 0
0 0 11 0
0 0 11 0
0 0 11 0
0 0 11 0
11
12 13
13
13
13
13
13
13
13
14 15
0 15
0
0
0
0
0
0
0
0
0
0
Figure 4-15: A Trace of Algorithm 4-15
While the code in Figure 4-14 eliminates the superfluous Mod (%) operations, this
savings must be weighed against the additional cost of the time for the ReduceMin and
Load operations to determine exactly how much time improvement is obtained by using
these operators.
4.5.2 The Advantage of Being Connected - The Product of Two Matrices
In the previous section the algorithm depended only on the fact that each
processing unit was connected to the Control Unit. Each individual processor’s actions
depended only on the information that it had or that was passed to it from the Control
Unit. It did not depend directly on the information contained in any other processor. In
the algorithm that we develop in this section it is absolutely necessary that processors be
able to communicate with each other.
20
In Figure 4-14, we assumed that we will specify variables as being being stored in
the memory of the local processing elements or that of the Control Unit. Now we add
another command that will also be handled in the preprocessing stage of the algorithm.
This new command will be called “Layout” and will specify the conceptual layout or
network configuration of the processing elements. The preprocessor will determine the
necessary mappings and routings of the conceptual design to the hardware design.
This new command will have the general form
Layout ArrayName [range1, range2, . . . , rangen ]
The name, ArrayName, may be any legal variable name. The values for each rangei are
the starting and ending values for the ith subscript. Conceptually we have a multiply
subscripted array of processors. For example, if n = 2, and we have an algorithm that will
be dealing with 8 by 12 matrices, we may specify a layout as follows:
Layout MatHolder [1:8, 1:12];
This layout will facilitate the distribution of matrices to the processors. For example if A
is an 8 by 12 matrix, then the command
el = Distribute(D);
places in the variable el of the processor associated with MatHolder[i, j], the value of the
matrix element A[i, j].
We will also make some conventions concerning the connection of the processors
in this layout statement. The first subscript will be associated with the directions north
and south. Thus, in processor MatHolder[i, j], el.north refers to the value of the element
el stored in MatHolder[i-1, j] and el.south refers to the value of el stored in
MatHolder[i+1, j]. If the value of i-1 is less than the starting value for the first index or if
i+1is greater than the ending value of the first index, then request for the value of el.north
and el.south is ignored. Likewise, the second subscript is associated with the directions
east and west and the third with up and down. If there are more than three dimensions,
we will use up4 and down4, up5 and down5, etc.
This architectural design will be used to speedup the computation of the matrix
product of two matrices. Assume that we have two n by n matrices, A and B. The
normal algorithm for the matrix product of A and B requires n3 steps. The code is shown
in Figure 4-16.
We will reduce the number of steps required for this process to n steps, but we
will make the classical computational trade off: time for space. In our case, the tradeoff
is better stated as time for processing elements. We will assume that we have n3
processors available on our system and that they are laid out as a three dimensional array.
Our layout statement will read:
Layout: Array[1:n,1:n,1:n]
Given this layout, we are assuming that each processor is directly connected to six other
processors, two in each of the array dimensions.
21
int i,j,k;
float A[n,n],B[n,n],C[n,n];
for(i=0;i<n;i++)
for(j=0;j<n;j++){ /* Calculate
C[i,j]=0; /* Initialize */
for(k=0;k<n;k++)
C[i,j]+=A[i,k]*B[k,j]; /*
scalar product of
and column j of B
}
C[I,j] */
Compute the
row i of A
*/
Figure 4-16: The Sequential Algorithm for The Product of Two Matrices
If we are dealing with two 100 by 100 matrices we will be requesting one million
processors! Obviously, some processors must accommodate more than one element of the
matrices, but the mapping of the matrix elements to the processors is handled during the
preprocessing phase of the program.
The beauty of the matrix product algorithm we will present is in its conception.
Think of the processors as being laid out within an n by n by n cube. At each of the n
vertical levels assume that there is a n by n table. Distribute n copies of the matrix A
vertically and n copies of the matrix B horizontally. The following figure illustrates this
for 2 by 2 matrices.
a11
A:
a11
a12
b11
a12
b12
b21
b22
B:
a21
a21
a22
b11
a22
b12
b21
b22
Figure 4-17: Illustration of the Distribution of Matrices for the Product
To see the full layout, simply slide the two parts of the figure together, maintaining the
spacing. Notice that in each of the cells, there are elements of the form aik and bkj , legal
terms for a matrix product element from the above code. For example, the processor
corresponding to Array[2,1,1] will contain the values of A[2, 1] and B[1, 1].
Recall that the cells each have their own processor. All that remains is to take the
product of terms and add each matrix column of the horizontal layers. This addition
requires n steps.
22
The following is the pseudo C-code for this process in our SIMD paradigm
machine. We assume the following directions for moving data among the processors:
north, south, east, west, up, and down. The compass directions refer to movement on the
horizontal layers, while the up and down refer to movement in the vertical direction. If a
request is made for a value in which there is no connection, a zero is returned.
Layout: Array[1:n,1:n,1:n]
Control: float A[1:n,1:n],B[1:n,1:n];
float C[1:n,1:n],D[1:n,1:n,1:n];
int i,j,k;
Local: float p,p1,p2;
int i1;
for(i=1;i<=n;i++) /* Place A vertically */
for(j=1;j<=n;j++) /* see Figure 4-17 */
for(k=1;k<=n;k++)
D(i,k,j)=A(i,j);
p1 = Distribute(D);
for(i=1;i<=n;i++) /* Place B Horizontally */
for(j=1;j<=n;j++)/* see Figure 4-17 */
for(k=1;k<=n;k++)
D(i,j,k)=B(i,j);
p2 = Distribute(D);
p=p1*p2; /* Compute the product of individual
elements */
p2=p; /* Store the product in p2 */
p=0; /* Reinitialize p */
for(i1=1;i1<n;i1++){/* This loop will have */
p=p2+p;
/* the effect of accumulating */
p2=p2.north;/* the desired sum in */
}
/* Array[n,i,j]*/
D=Store(p); /* collect the values in the */
/* Control Unit */
for(i=1;i<=n;i++)
for(j=1;j<=n;j++) /* Store the product in */
C[i,j]=D[n,i,j]; /* a matrix C */
Figure 4-18: An Efficient Parallel Matrix Product
This code is not designed to run on any particular machine. It has commands that are
designed to show the action of the machine, not as a language tutorial. Some explanation
is required for the Distribute and Store commands. Each is to deal with a structure in the
Control Unit that has the same design as the processor layout. The Distribute command
uses the associations that are made during the preprocessing of the Layout declaration to
distribute elements the Control Unit data structure to its corresponding processor. We
assume that each processor contains one element of the data structure. The Store
command is the inverse of the Distribute Command in that it moves data from the
processors to the designated Control Unit data structure.
23
The command “p2=p2.north” replaces the current value of p2 in each processor
with the value of p2 in the processor to the north. As was mentioned above, if there is no
processor to the north of the given processor, p2 is assigned a value of 0. After n steps all
of the elements in a given horizontal column have made it to the bottom of that column,
processor Array[n, i, j] and been accumulated into the sum which is stored in variable, p.
Thus, the value of p in processor Array[n, i, j] is the i,jth element of the product of A and
B.
In this algorithm, all n3 multiplications are done at one time with the command
“p=p1*p2”. In the loop n3 additions are done simultaneously with the command
“p=p+p2”. It takes n steps until the bottom row of each layer has the full sum required
for each term. Thus, the multiplication portion of the algorithm operates in O(n) time.
This does not include the set up and storage phases of the algorithm. Let’s do a more
complete analysis of the algorithm.
The real work of the algorithm is done in lines 18 through 24 of Figure 4-18. We
will repeat these lines here in Figure 4-19.
(18) p=p1*p2; /* Compute the product of individual
elements */
(19)
p2=p; /* Store the product in p2 */
(20)p=0; /* Reinitialize p */
(21)for(i1=1;i1<n;i1++){/* This loop will have */
(22) p=p2+p;
/* the effect of accumulating */
(23) p2=p2.north;/* the desired sum in */
(24)}
/* Array[n,i,j]*/
Figure 4-19: A Fragment of Figure 4-18
Let’s assume that f is the time required to do a floating point operation (add or multiply),
that c is the time required to do a processor to processor communication, and that  is the
time required to set up the communications channel. We will also assume that
assignments require 2 machine cycles, dented by mc. Now, we can compute the time for
this matrix product computation.
Step (18):
Steps (19)&(20):
Steps (21) - (23)
TOTAL
f
4mc
 + n*(f + c)
 + (n+1)*f + n*c + 4mc
On the other hand, if we examine the sequential algorithm in Figure 4-17 we see that n2
assignments are made to initialize the product matrix and 2n3 floating point operations are
required to compute the matrix product. Thus, the time for this computation is:
2n2mc + 2n3f
24
While it is true that and c may be significantly larger than the time for an in processor
floating point operation and the time of a machine cycle. It is obvious that for large n, the
savings of the algorithm of Figure 4-18 is a very considerable one. For example, if the
communication speed is as much as ten times slower than the time to do a floating point
computation, then the speedup from using the algorithm in Figure 4-18 on a matrix of
size 100 by 100 over the algorithm in Figure 4-17 is still on the order of one thousand.
4.5.3 Working with Databases
One of the major applications of computing is the use of a database management
system to store and retrieve information from computer files. For example, the registrar
of your institution has access to a rather large database that keeps a listing of all of the
students, the courses that they have taken, and their grades in those courses. There is
most likely more information in this database, but you can appreciate the need for the
database to be maintained in a timely and accurate fashion and for information from this
database to be easily accessible to some people and inaccessible to others.
The use of parallel processing for database applications is a subject for great
debate among the computing community. While a parallel SIMD machine is very
efficient for searching the database, there is a bottleneck that develops when I/O
operations must be performed or memory must be accessed. On the other hand, SIMD
architectures contain a large number of processors that can each do the relatively simple
operations required by a database search. Thus, several database tables can be searched
simultaneously in a very rapid manner. Since, the relations on each table that involve the
same database “key”, i.e. the field of a record that uniquely identifies the record, can be
found rapidly, operations such as a natural join can be performed in a very efficient
manner. For example, all records from different tables pertaining to an individual with a
given social security number can be quickly located and linked together.
Before we look at the possibilities for using parallel machines for database
applications, we will consider a list search routine and then review some definitions from
database systems theory.
4.5.3.1 Searching a List of Numbers
Suppose that we have a list of n numbers that may be keys, i.e. pointers to
additional information that is stored in the system. If the list is sorted, then it can be very
easily searched in O(ln n) time using a binary searching technique. Not only is O (ln n)
the time required for the binary search, it is the theoretical lower bound for any search on
a Von Neumann computing machine. If the list is unsorted, the time to search the list on
a Von Neumann machine O(n), a much slower timing result when n is large. We will
illustrate a result for SIMD machines that is, in theory, O(1). In practice, the result is
n
O(   ), where n is the number of elements in the list and p is the number of processors
 p
25
n
that make up the processing cluster (see Figure 4-9.) The term   , however, may easily
 p
be 1 since the usual number of processors, p, for a SIMD machine is quite large. For
example, the CM-2 machine has 16,384 processors available each of which is capable of
doing simple computations.
The search technique can best be understood by conducting the following simple
experiment in your classroom. Assuming that you have fewer than 52 students in your
class, distribute one card from a shuffled deck of cards to each student. You hold the
remainder of the deck. Then ask for a particular card. For example, say, “Who has the
ten of spades?” If one of the students has the ten of spades, she responds, “I do.” At this
point you know where the ten of spades is located. If no student has the ten of spades, no
reply will be made and you conclude that the ten of spades was not among the cards
distributed to the students. In either case, your search was completed by asking one
question. That is, it was done in O(1) time!
The actual process is almost exactly the same as the experiment that you just
conducted. Instead of students in your class, the computer has separate processing
elements. The deck of cards is replaced by the list of numbers or “keys.” Now the host
computer (the one that controls and synchronizes the processing elements) distributes the
list one number to each processing element. The host sets a location variable to a
negative number on the assumption that the processor ID’s are in the range of 0 to p-1,
where p is the number of processing elements in our machine.
Now, the search begins. The host broadcasts the number that is the object of the
search to all processors. If a processor does not have the number, it shuts off. The
processor that has the number returns its ID to the host who stores that ID in the location
variable. The host program either returns the location of the number or the message “Not
Found.” A possible coding for this process is given in Figure 4-20.
main(){
/* The preprocessing stage done in the
host computer */
loc = -1;
for(i=0;i<n;i++);
/* Done in each processing element */
if (i==ID)
val=list(i); /* Assign processor ID’s variable
val to the ith list element */
broadcast(key);
/* Returning to each processing element */
if(key==val){
loc=ID;
return loc;
/* Now back in the host */
if(loc != -1)
26
printf(“The key is in list element %d\n”,loc);
else
printf(“The key is not in the list\n”);
}
}
Figure 4-20: Pseudo code for the O(1) search
4.5.3.2 Searching a Database
The term database is a shortened form of the more descriptive database
management system. It refers to a collection of interrelated data together with programs
that access that data to provide information to a user. Requests for particular pieces of
information are called a queries.
If we are considering a database for a large banking system, we would want to
have information about the bank’s customers, the accounts that they have with the bank,
what branch of the bank hosts the account, where the branches are located, and other
information about the bank and its customers. It does not make sense that all of this
information should be stored in one large table where a listing for Mary Brown would
contain Brown’s home address, Brown’s account number, her balance in that account, the
name of the branch of the bank where she does her banking, the address of the branch,
and other related information. Such a scheme would be very costly, inefficient, and prone
to errors. For example, if Bill Smith also had an account in the same branch as Mary
Brown, all of the information Bill Smith’s record concerning the branch would duplicate
that in Mary Smith’s record. Worse yet, if Mary had both a savings account, a checking
account, and a loan at the same branch, all of the information about Mary and the branch
would be stored in triplicate! Think of the updating problem if the Branch were to move
to a new address.
For that reason, data is stored in tables that can be related. In this way there is a
minimal duplication of data and updating is done in a much more efficient way. In the
above example we could store the information in the following tables. The tables are
given a name and under each table is listed the attributes that are contained in the table.
Customer:
Name
Address
Social Security Number
Account
Type (Savings, Checking, Loan)
Account Number
Customer Name
Balance
Interest Rate
Host Branch
27
Branch
Branch Name
Branch Address
These tables are related by the use of certain attribute that uniquely describes each record
in the table. For example, Social Security Number will identify a customer; Account
Number an Account, and Branch Name a Branch. The relations are also stored as tables
Customer/Account
Social Security Number
Account Number
Account/Branch
Account Number
Branch Name
When a new customer comes to the bank. A record is created in the Customer
table is created using information supplied by the customer on the application blank.
When that customer opens an account, a record is created in both the Account table, the
Customer/Account table and the Account/Branch table. As deposits or withdrawals are
made from an account, the Account table is updated. No other information is accessed.
The same applies to loans and loan payments.
Now let’s ask a question, query, of the database. For example, what are the total
assets for Mary Brown in the Chambersburg branch of the bank. To answer this question,
we need to know the balance of every account that Mary Brown has in the Chambersburg
branch of the bank (she may have accounts in other branches) and add up the totals in her
savings accounts and checking accounts and subtract the total from any loan.
We assume that the host computer is large enough to store the five tables listed
above. The ith line of each table is the information given while making the ith entry into
that table. Corresponding lines of tables are not usually related. Now, assume that each
processor in the bank’s SIMD computer has enough memory to contain the following
information in this order: the name (nm) from the ith line of the Customer Table, the
account number (a1) and social security (ss1) number from the ith line of the
Customer/Accounts table, the Account number (a2) from the ith line of the Accounts
table, and the Branch Name (bn) from the Branch Table. The variable names for these
five pieces of information are given in parentheses after the attribute names.
Without coding, we describe the search to provide an answer to the query about
Mary Jones assets in the Chambersburg branch. We proceed much as we did in our key
search of section 4.5.3.1. The host broadcasts the name, “Mary Brown,” to all of the
processors. The processor containing that name in its nm variable returns its ID. The ith
line of the Customer Table is examined for Mary Brown’s social security number. This
number is broadcast. The processors containing that variable as ss1 return their ID’s.
Each of these ID’s is processed in order. The appropriate line of the Customer/Accounts
28
table is extracted and the Account number is broadcast. The processor that has that
number as its a2 variable returns its ID. The appropriate line of the Accounts table is
extracted and the Branch attribute is checked to see if it matches “Chambersburg.” If it
does, then the account type attribute is checked. If it is savings or checking, Mary
Brown’s assets are increased by the account balance. If it is a loan, her assets are
decreased by the amount. When all account numbers that were found above, the value of
Mary Brown’s assets is returned and the query is answered.
4.6 An Overview of SIMD Processing
SIMD Processing is inherently synchronous. That is, all operations are carried out
at the same time on all processors and we do not begin the next one until all processors
have completed the operation. It is true that, in general, the class of problems that can be
addressed by SIMD architectures is less general than those that can be addressed by
several independent processors operating asynchronously that is in an architecture that we
have called MIMD processing. On the other hand, it is exactly the synchronization that
makes SIMD an attractive computing alternative. At some point in any program there is a
need for processors to share information. This can not be done until all processors have
completed their computational task up to this point. Thus, the other processors must sit
idle after they have completed their task until the slowest processor finishes with its task.
The synchronization of all computing steps eliminates this need to wait.
Another characteristic of SIMD programs is that there is no asynchronous
contention for resources. Either no processor wants a resource or all processors want it.
In the latter case the control processor broadcasts it to all processors. This eliminates
hardware deadlocks since a resource is strictly private or else always shared.
Of course, the major characteristic of SIMD processing is that there is one
(common) instruction memory. In the MIMD paradigm, the entire program must be
downloaded into each processing element. From the programmers’ point of view, this
fact makes the testing and debugging of programs much more difficult within the MIMD
environment. If an error occurs in a SIMD program, many of the standard debugging
techniques can be applied and the offending instruction located. When a program
running on a MIMD architecture fails, it is extremely difficult to determine the offending
instruction since the processors are running independently and it is not clear at what exact
stage each processor is in the computation. A basic debugging technique in this case is to
insert synchronization points to try to narrow down the region of the program that
contains the error. In the SIMD paradigm, these points exist for every instruction.
In any large-scale numerical computation, there are portions that are inherently
best suited for SIMD architectures. This means that the SIMD design is much more
efficient. Parallel computing in the future may consist of a “hybrid machine” made up of
both SIMD and MIMD machines. The front end compiler may well decide which parts of
a program are best suited for which paradigm and dispatch those parts to the appropriate
29
machine. Thus, there may be a way to avoid making a decision about which machine we
plan to use for our entire program.
4.7 An Example Vector Architecture - The Cray Family of Machines
This family of vector machines includes the Cray-1, Cray-2, Cray X-MP, and Cray
Y-MP/C90, which were designed and built during the decade of the 1980’s through 1993.
The basic design was the brainchild of Seymore Cray, one of the founders of Control
Data Corporation who later formed his own company, Cray Research, that produced this
family of machines. The fundamental ideas for the series are found in the Cray-1, which
will be the primary example of this section. This architecture is illustrated in Figure 4-20.
Main Memory (64 bit words)
Eight 64-bit Vector
Registers with 64
Elements per Register
Instruction Processor
A
B
S
T
8 Registers
64
8 Registers
64
Holding
Holding
Address
Registers
Scalar Registers
Registers
Registers
(64 bit)
(24 bit)
Integer Add
Integer Mult.
Integer Add
Boolean
Shift
Pop. Count
Float Add
Float Multiply
Float Recip
Integer Add
Shift
Boolean
Figure 4-20: A Conceptual Diagram of the Cray-1 Architecture
This architecture consists of one main memory feeding data to the instructional unit and
the vector registers. The instructional unit performs all of the interpretation of
instructions and holds the scalar registers. The ALU consists of 12 independent
functional units that are bidirectionally connected to the registers as indicated in the
above diagram. Note that three units are devoted to vector operations on integer or
Boolean data. Three of the units are shared by the vector and scalar registers for floating
point data. An additional four units are for integer, Boolean, and bookkeeping in the
scalar registers, and two units are used for address calculation by the address registers.
All of the units are pipelined in various stages. There are 2-14 stages per unit with the
floating point reciprocal, i.e. computing 1.0/f.d where f.d 0.0, requiring 14 stages.
30
Memory consists of 64-bit words. In the Cray-1 it consisted of one million words
and was increased to 256 million words in the Cray-2. The Cray-1 memory is accessed
by a one way memory/data path that can be thought of as an 11-stage pipeline, 64 bits
wide. This bandwidth is rather low and becomes a bottleneck in some calculations. One
example of such a calculation is a memory-to-memory multiply. This operation requires
two memory reads and a memory write. Such an operation, although legal, violates the
basic design assumption that most arithmetic operations will be performed on data
resident in the registers. Unfortunately, higher level language compilers do not usually
make such assumptions. In the Cray X-MP, two extra ports were added to the main
memory, thus allowing three concurrent memory transfers. The Cray Y-MP C90 has four
memory ports that each can deliver 128 bits per processor cycle, which is itself
considerably decreased from earlier models.
There are no specific instructions in the Cray-1 for reduction operations on
vectors (sum the components, find the minimum, etc.) These operations, such as the
scalar product calculation, are of utmost importance in scientific calculations. For this
reason, the design permits one of the input vector registers to be the same as the one that
receives the result. Thus, any vector instruction can be converted into a reduction
operation.
An important feature of the Cray series is that “chaining” of operations is allowed.
For example, if we have  in register 1, X in register 2, and Y in register 3, we can
compute the saxpy operation described in Section 4.3,
X + Y
without storing the intermediate results in yet a fourth register. As soon as the first result
of the multiply is completed it can go directly to the adder together with the Y elements
from register 3 and begin to be processed. This technique produces a noticeable speedup
of the entire operation. The result is conceptually one long pipeline made up of two
shorter pipelines. Memory-register transfer operations and arithmetic operations can be
chained.
As a result of chaining the Cray-1 is listed as being able to do 80 million additions
and 80 million multiplications per second after pipeline startup. This figure is for
operations done in registers. When a memory-to-memory vector multiply is performed
the rate drops to one third of the multiply rate or about 27 million multiplications per
second. This rate increases to 70 million per second in the Cray X-MP configured with
one CPU. The X-MP can be configured with up to four CPU’s. The Cray Y-MP/C90 can
be configured with up to 16 processing elements is listed as having a peak rate of 15.2
billion operations per second with an observed rate of 9.69 billion operations per second.
31
4.8 An Example SIMD Architecture - The Connection Machines
In our conceptual model of a SIMD machine shown in Figure 4.9, we assumed
that the processors were connected via a two-dimensional grid. The Connection Machine
is cube- (and grid-) connected. It is, unlike the vector processing machine, designed
primarily for data manipulation. In this sense, it is a “symbol cruncher” as opposed to a
“number cruncher.” These machines, built during the decade spanning 1985 - 1994, are
the product of Thinking Machines Corporation, a corporation founded in partnership with
David Hillis who designed the CM-1 as part of his Ph.D. thesis at MIT. We will discuss
the CM-1 and its successor the CM-2. The latest machine from Thinking Machines, the
CM-5 is a MIMD machine and will not be discussed here. The Connection Machine is
not a complete system. It is designed to be a back end machine with up to four hosts.
The hosts can be either VAXes or Symbolics Lisp machines, or a combination of these.
Programs are compiled on the hosts and then downloaded for execution to the Connection
Machine via a 4 by 4 crossbar switch. The programs are executed “on-the-fly” as they are
being downloaded. The Connection Machine has no program storage. These execution
units are called sequencers. The design of the CM-1 is shown in Figure 4-21.
Host Units
4 by 4
Crossbar Switch
.
Sequencer
8K 8K
PE’s PE’s
I/O channels
Data Vaults
Figure 4-21: A Conceptual Design of the CM-1 Architecture
32
Each of the four quadrants has 16K one bit processors that can do a one bit add in 750
nanoseconds and the addition of 32-bit integers at a rate of 2 billion per second. These
processing elements are so small that 16 of them fit on a 1 cubic centimeter chip together
with a message router that is connected to the hypercube network and a decoder that
controls the processors and the router. This chip is surrounded by four 16-Kbit static
RAM chips with read/write ports 4 bits wide. Consequently each processor can be
thought of as having its own 4-Kbit memory.
As stated earlier the Connection Machine is designed for artificial intelligence
applications (semantic networks, forward chaining inference, backward chaining
inference, etc.) and database operations. It was not optimized for high speed calculations.
Processor #1
16 flag bits
Status
Self Address
News
Router
From Address
Daisy
FLAG ouput of
N,E,W, or S
Processor
Router
Output bus
DAISY
LINK
GP
4 Kbit Memory
A
1
A B
F
B
ALU
F
GP
AB+BF+AF
1
to NEWS flags of 4 neighbors and
router inout bus
Figure 4-22: A Schematic Diagram of a CM-1 Processor
The operation of each processor uses locations A and B in memory, and flag registers F
and Fdest (flag output can go to flags of neighboring registers.) The processor reads 2
bits from locations A and B and 1 bit from register F. It then performs some ALU
operation on them and stores one output bit in location A and the other in register Fdest.
Examples of ALU instructions are ADD, AND, OR, MOVE, and SWAP. Inputs can be
inverted and additional instructions, such as SUBTRACT, are possible. The ALU’s do
not really perform logic operations. They look up the proper outputs in a 16-bit register
that contains the two output columns of the ALU operation being performed. In addition
33
to this truth table, the microcontoller also sends the processors the A and B addresses, the
register containing the Read flag, the register to receive the Write flag, a Condition Flag
that specifies which flag to consult for synchronization information, a Condition Sense bit
telling whether or not to proceed with the operation, and 2 bits specifying the direction
(NEWS) to move the data during the instruction. This information is contained in 60 bits
and is sent to the processors during each ALU cycle.
A clever aspect of the processor design is the use of the 16 1-bit flag registers.
Some of them are used for communication via the grid, the daisy chain that serially links
the 16 processors located on each chip, or the router that links all of the chips together.
One is a read-only NEWS flag that receives flag output from the flag specified in the
instruction. Another is a read-write Router Data Flag that receives data from and send
data to the message router. The Daisy Chain Flag reads the flag output from the
preceding processor on the daisy chain connection. There is also a Global Flag that is the
logical NOR of all of the processors on a chip for communication with the host. Other
flags are unspecified so that they can do miscellaneous work such as hold the carry bit in
an ADD operation.
Since the router will need to deliver messages to particular processors, parts of
memory are set aside for the processors’ absolute addresses. Other parts of memory are a
status word that includes a bit that indicates when a message is ready to be sent to the
router, and still other parts are set aside to receive messages from the router. The router
itself also has its buffers in memory.
The NEWS connections and the daisy chain give quick access to a processor’s
neighbors and other processors in the cluster. Connections to any other processors are
handled more slowly through a 12-dimensional hypercube network formed by
connections between the routers (see Figure 2.xx and the following discussion). The
routing mechanism used by the Connection Machine is the “store-and-forward” method
that moves messages through a nearest neighbor protocol. This is not the fastest way
through the hypercube, but it allows routing to be handled automatically. The user does
not need to consider the steering of a message through the network. The routing
argument consists of twelve steps. The XOR of the source and destination nodes is
computed. If this XOR contains a 1 in the i-th dimesion, the message is sent to the
adjacent node in that dimension, otherwise it does not move.
The routers’ function is to receive messages from its 16 local processors and
prepare them for transmission to one of the other 12 routers to which it is directly
connected. It is also responsible for receiving messages from each one of the other
routers and either delivers or forwards them. Each message is 32-bits, plus the
destination address and a return address for a total of approximately 60 bits.
The design of the CM-2 machine is essentially the same as that of the CM-1
except that there is the addition of a floating-point coprocessor with every 32 of the 1-bit
integer processors. This created a collection of 2 K-bit floating point units and noted a
34
shift in the emphasis of Thinking Machines Corporation from symbolic to numeric
computing. The resulting machine could perform billions of floating point operations per
second on numeric problems favorable to SIMD operation. The newer model 200
spreads each single-precision value across 32 processors so that the floating point value
can be transmitted all at once. Thus, the CM-2 is viewed as a parallel computer
containing 2048 32-bit processors. This change in hardware has changed the CM-2 to an
effective performer on numerical problems. Teams using the CM-2 have won the Gordon
Bell Awards for performance.
4.9 Summary
The SIMD paradigm is not a panacea. There are many important models for
which other paradigms are more appropriate. For example, a medical student training
model that has several body systems working together on a set of symptoms would not be
suited for SIMD processing. Each system has different operating characteristics and is
doing different things than the other systems, but in harmony with them. This example is
best handled in an environment where different processors are modeling the different
systems and sharing information. This is not a SIMD environment.
On the other hand, the SIMD paradigm is extremely appropriate for many
problems that involve linear mathematical calculations or the database operations where
several processors can be doing the same operation on different data sets at the same
time. Linear mathematical calculations involve vector and matrix calculations. Database
operations involve searching trees for information that may be contained in several
different tables. This process involves the repetition of the same instructions on different
sets of data. That is, by definition, a SIMD process.
There are two primary architectural models within the class of SIMD machine.
These are the vector processing model with its vector registers and accurate floating point
hardware and the pure SIMD model that involves a master Control Unit and several lower
cost processors each having its own memory. The vector processing machines were
clearly designed for doing linear mathematical computations. They are generally
extremely fast and deliver a high degree of accuracy. A fundamental assumption of these
machines is that they will be operating on data contained within the ALU. Operations
such as memory to memory multiplications create a bottle neck that would force the
earlier vector processing machines to operate at less than their top rated speed. This
bottleneck has at least been widened, if not entirely eliminated, by the addition of more
memory ports and reduced machine cycle time in later model vector processing machines.
The pure SIMD machines have the advantage that processing elements can be
turned off when an operation is needed for only a subset of them. Another advantage is
that processing elements can communicate quickly with their neighbors.
The
disadvantage is that each processing element may handle only small pieces of data. In
some models this can be as small as one bit. In the early examples of this type of
machine, this meant that any significant numerical calculation would slow down a
35
process considerably. In addition, the results might not have the accuracy that can be
found on the vector hardware of a vector processing machine. That is because the earliest
models of this type of processor were designed to carry out symbolic operations, for
example to answer questions such as “Is Flipper a Whale?” as opposed to tracking the
course of hurricanes or determine tomorrow’s high temperature reading.
The addition of floating point hardware to operate on clusters of processors has
made the pure SIMD architecture a much more acceptable one for numerical calculations.
In fact, many large-scale numerical calculations have been successfully handled by
machines built on this paradigm.
Thus, while SIMD may not be the best answer for all problems, there are
problems for which it is the superior design. Almost any problem appropriate for high
performance computing has some parts that would be best handled by a SIMD machine.
The next step in high performance computing may involve networking several machines
operating under different paradigms to run a program by dispatching parts of it to the
appropriate hardware. Certainly, a SIMD machine will be part of this larger, “hybrid”
machine.
36
Download