Chapter 4 SIMD Parallelism Table of Contents 4.1 Introduction 4.1.1 The History of the SIMD Architecture 4.1.2 An Analogy - A Manufacturing Process 4.1 The “saxpy” Problem 4.2 A First Attempt at SAXPY Speedup - Pipelined ALUs 4.3 Vector Processing - One Version of SIMD 4.4.1 Pure Vector Processing 4.4.2 Pipelined Vector Processors 4.4.3 An Example - A Matrix/Vector Product 4.5 Beyond Vectors - Parallel SIMD 4.5.1 Using Subsets of Processors - A Prime Number Sieve 4.5.2 The Advantage of Being Connected - The Product of Two Matrices 4.5.3 Working with Databases 4.5.3.1 Searching a List of Numbers 4.5.3.2 Searching a Database 4.6 An Overview of SIMD Processing 4.7 An Example of Vector Architecture - The Cray Family of Machines 4.8 An Example of SIMD Architecture - The Connection Machines 4.9 Summary 4.1 Introduction In Section 2.3 we discussed the Flynn Taxonomy for Parallel Processing. One of the first expressions of parallelism was via the SIMD (Single Instruction Multiple Data) model. Conceptually, this model gives the first step away from the Von Neumann Architecture discussed in Chapter 1. In one sense, we are still dealing with the classical model of computing. The program is being processed one instruction at a time in the sequential order that the instructions appear in the program. The difference is that instead of one processing unit executing the instruction for one value of the data, several processors are executing the instruction for different values of the data at the same time. This is what we referred to as synchronous control in Section 2.4.3 of Chapter 2. This mode of processing has obvious implications for an increase in processing speed. As we will see in this chapter, there are two ways to implement the idea of having multiple processors operating on different pieces of the data in a synchronous manner. The first is to take the obvious opportunity which is to process vector operations, such as vector addition or coordinatewise multiplication, that requires no communication between the processors to perform the vector calculation, i.e. the result in one component is independent of the results in other components. This type of SIMD parallelism is 1 known as Vector Processing. Many computer scientists distinguish this form of parallelism from SIMD parallelism. In this text we will not make this distinction except to note that it exists. The other form of SIMD parallelism that we will discuss is what we call the Pure SIMD model of parallelism. This model allows, and its most interesting applications depend on, communication between processors. This means that one processor can use the result generated in another (usually neighboring) processor during a previous step in the calculation. The need for the data to be independent is removed. It also means that processors can be “shut off” during a series of calculations. The ability for processors to communicate and be shut off during processing is particularly useful in applications such as database searches and matrix operations. 4.1.1 An Overview of SIMD Parallelism The SIMD architecture is the first of the parallel architectures to be implemented on a large scale. In the late 1970s the ILLIAC IV was the most powerful computer in the world and handled many of NASA’s computation intensive problems. It cost 30 million dollars and was extremely difficult to program. It had 64 processing elements that were arranged in an 8X8 grid. Each processing element was considerably smaller than the processing elements of existing serial (Von Neumann) computers. However, each processor had its own local memory. Although the ILLIAC IV was never a commercial success, it did have a tremendous impact on hardware technology and design of other SIMD computers, such as the Connection Machine designed by David Hillis in 1985 to conduct his research in artificial intelligence at MIT. We will discuss the architecture of this machine later in this chapter. On another front Seymore Cray was designing a Numeric Super Computer that was used for processing vector computations. Cray realized that the jumps that occur during the serial processing of vectors (such as in the addition of vectors) are a waste of computing time. He invented the world’s first supercomputer with multiple ALU’s that were specialized for floating point vector operations implementing a simple RISC-like architecture. The CRAY vector processors are register oriented and implement a basic set of mostly short instructions. We will discuss this basic architecture in the section of this chapter devoted to the CRAY series of machines. 4.1.2 An Analogy - A Manufacturing Process A large confectionery manufacturing company is planning to enter the candy bar market and introduce a “sampler” to be sold at candy counters and check out lines at super markets. The sampler will have four candies: a chocolate covered cream, a peanut cluster, a caramel, and an almond solitaire. The candies will be stacked in a paper tube 2 separated by a cardboard disk. The company is in the process of designing its production line for this new product. BINS HOPPER chocolate cream peanut cluster caramel almond solitaire LOADING STATION There are three production designs under consideration. The first, and least expensive is to have the paper tube come under a single loading station that is fed by each of four candy holding bins and has a hopper of cardboard disks as part of its assembly. The tube is filled by first placing a disk in the bottom then inserting the almond solitaire followed by a cardboard disk and the caramel, then a disk and a peanut cluster, and finally a cardboard disk, the chocolate cream, and a cardboard disk. The machine is very efficient. The process of dropping a disk and a candy takes approximately one half second. Thus a tube can be filled every 2 seconds for a total of 30 tubes a minute and a production of approximately 43,000 tubes per day. Not really enough for a national market. If we make the analogy of the pieces of candy and cardboard separators to data being processed by a computer, then the first production model described in the preceding paragraph is similar to the Von Neumann sequential, or SISD, model of computation. Modern processors may very efficiently process each individual piece of data, but the sequential nature of the machine, makes it inappropriate for processing the large amounts of data required for many of the modern, grand challenge scale, problems being addressed by scientists today. Continuing with our candy production analogy, consider an alternative idea for the production line. It involves a packaging station that is roughly four times as large as the one in the previous design. It has four feeding lines one to a bin of each type of candy. Each line is accompanied by a hopper of disks. This design for the production line is diagrammed in the following illustration. Choc. Peanut Caramel Almond Cream Cluster Solitaire 3 If we assume that the time to move a tube from position to position within the filling station is negligible, then after filling the first tube in 2 seconds, there are four tubes within the station and a tube of candy is produced every half second. This gives a total production of 120 tubes per minute and close to 173,000 per day. This is a more acceptable production schedule, but it is a more expensive solution to the production problem. The above design is analogous to the pipeline design discussed in Chapters One and Two. The pipeline design is used to process the instruction, “fill a tube with candy.” However, the machine is made up of four smaller components and the instruction is broken into its four component pieces: “put in a chocolate cream”; “put in a caramel”; “put in a peanut cluster”; and “put in an almond solitaire”. Each part of the packaging machine processes the same piece of the larger instruction over and over again. Thus, four tubes of candy are being worked on at the same time. Another possibility for designing the candy production line is to use ten of the less expensive stations described in our first example and have them on a synchronized production schedule. By this we mean that all ten machines will simultaneously load a chocolate cream into a package, followed by all ten machines loading a peanut cluster, etc. When the tubes are filled, they move out simultaneously to the next stage in the production process. Thus, ten packages of candy are produced in 2 seconds. Station 1 Station 2 Station 3 Station 9 Station 10 With this design 300 packages of candy are produced each minute for a total of 432,000 packages per day! This is a significant increase over the previous design and it uses only the less expensive processors. It does, on the other hand, complicate the design and set up problems. Decisions have to be made on how to supply the machines and synchronize each step in the production process. What we have described by the manufacturing analogy is a prologue to our description of the SIMD (Single Instruction Multiple Data) computing process. We will of course be processing data, not candy, and we will consider hundreds of processing stations, not just ten. However, the basic ideas and problems are essentially the same. 4.2 The “saxpy” Problem 4 In this section, as with much of the rest of this chapter, we will be using two terms with a great deal of regularity. They are vector and scalar. They are terms used by professionals working in the area of Numerical Linear Algebra, a large client field for high performance computing. Without going into all of the ramifications of these terms, we will have the understanding that a vector is an array of floating point numbers and that a scalar is a single floating point number. This definition is narrower than the generally used definition, but it is acceptable for our purposes. Also within this context, we define a matrix to be a two-dimensional array, or table, of floating point numbers. When we consider many of the large problems involving applications of computing, we see that they involve vector and matrix manipulations. For example, almost all of the “Grand Challenge” scale problems that we discussed in Chapter 2 are of this type. Prime examples of such problems are forecasting weather (see Figure 2.xx), the ocean tides, and climatological changes. These problems all involve taking large amounts of data from many locations, modifying that data by local conditions and passing it on to the next location to interact with the conditions at that location. As a simple example of this process, suppose that we have a metal wire that is being heated on one end. Choose n points at equally spaced distances along the wire. Let Y be a vector whose ith component contains the temperature of the wire at the ith point at a given time, and let X be the temperature of the surrounding medium, including the points adjacent to yi, at the same time. If is a factor that depends on the heat conductivity of the wire, the time between measurements, and the distance between points, then the temperature at the ith point at the next time a measurement is taken may be given by: yi,t+1 = yi,t + xi,t A general description of this process can be expressed within a computer by vector or array operations. A simple model, such as the above process, can be described as follows: Suppose that a vector Y gives the values for a particular condition at location i =0, 1, 2, . . ., n at time, t, and a vector X gives the value of the same condition at the location adjacent to location i at the same time. If a force of fixed intensity, is applied to all locations, the resulting value of Y at time t + 1 is given by X + Y The computation of the resultant Y is called the saxpy or “scalar a (times) x plus y” problem. This computation is so prevalent in scientific computations that it became an important bench marking standard for evaluating the performance of computing systems. The sequential coding of this process is very straight forward. For example, in C the code it may be written as: #include <stio.h> 5 #define n=1000 float a,x[n],y[n]; for(int i=0;i<n;i++) y[i]+= a*x[i]; If we eliminate the computations involved with subscripting and look only at the steps required to compute “c+=a*b” as opposed to “y[i]+=a*x[i]”, the last line of this code is translated in the relatively simple assembly code of the Motorola 68040 chip as is shown in Figure 4-1. In Figure 4-1 we see that there are five lines of assembly code devoted to the basic saxpy operation. If we view each assembly line command, as we did in Figure 1-2, as being further broken down into machine operations of “decode the present instruction”, _main: link a6,#-12 fmoves a6@(-4),fp0 fsglmuls a6@(-8),fp0 fmovex fp0,fp1 The saxpy computation fadds a6@(-12),fp1 fmoves fp1,a6@(-12) L1: unlk a6 Figure 4-1: Assembly Code for a Saxpy Operation “fetch the operands”, and “execute the instruction” that each require one machine cycle and assume that fetches and stores to memory take two machine cycles, then a single saxpy computation (or single path through the loop) will take 19 machine cycles to execute, as is shown in Figure 4-2. decode fetch a load a decode fetch b store c Figure 4-2: Time for the Saxpy process This is a highly idealized view of the process, but if we accept this model, we see that our C code, which must execute 1000 saxpys, will execute in approximately 19,000 machine cycles. Since machine cycle times are measured in micro seconds(10-6 seconds) or even nano seconds(10-9 seconds) 19,000 machine cycles does not seem to be a great deal of time. However, the large problems that we are talking about deal with vectors of sizes that are measured in the hundred of thousands and the program may do millions of such calculations. 6 To expand on this idea, recall the discussion in Chapter One where the issue of computation speed was discussed. One saxpy computation such as the one on a vector of length one thousand will take: T = 19,000 * C seconds where C is the length of a machine cycle. Assuming a fairly typical value for C of 100 nanoseconds, we have T = 19,000 * 100 * 10-9 = 1.9 * 10-3 seconds = 1900 seconds per saxpy This certainly is not a great deal of time, but in Chapter One it was shown that a typical Grand Challenge scale problem may require 10 million saxpys of this size or larger. The time to compute this number of saxpys is: 1,900 seconds or 5 hours and 17 minutes! If we consider a problem that requires one billion saxpys of this size (not unheard of as the demand for more accurate weather forecasting increases), the time required is: 527 hours and 47 minutes or almost 22 days Now, we are talking about so much time that the result of the computation is useless for any practical application. If we are going to attack problems that require computations of this size, then speedup is a serious issue. We must develop a model of computation that is distinctly different than the Von Neumann or sequential model. 4.3 A First Attempt at SAXPY Speedup - Pipelined ALU The idea of a pipeline and a pipelined process was introduced in Chapters One and Two. In fact, in Chapter One a brief mention of a pipelined ALU was made. In this Section this idea will be covered in more detail. We begin with a description of the ALU. Machine calculations are performed in a portion of the computer know as the ALU, the Arithmetic/Logic Unit. Data transfers within this unit are very fast and will be ignored in our analysis. The idea of pipelining is to separate the decode, fetch, execute cycles of the assembly instructions as well as memory transfers into separate ALU units. This architecture is diagrammed in Figure 4-3. Fetch Instruction Decode Instruction Fetch Data Execute Instruction Calculate Address Store Result Figure 4-3: A diagram of a pipelined ALU In Figure 4-3, we separated the fetch and store operations into two steps so that we can make the idea of what is happening more clear. In order to process a single assembly line instruction this processor model will take six machine cycles. This is actually slower than our previous model, but the advantage is gained by the fact that the ALU processing units operate independently. Thus, when unit six is working on the store portion of instruction one, unit five is working on instruction two, etc. (Recall the second production design for the candy factory.) In short, this design allows for six instructions to be in various stages 7 of execution at the same time. Thus, the first completed instruction requires six machine cycles. After that time, one instruction will be completed with each machine cycle. In our example C code, the process will take a total of 6 + 999 = 1005 machine cycles. This compares to the 19,000 cycles required by a straight forward, non pipelined saxpy computation. In Figure 4-4 each complete C-code instruction is shown as a horizontal bar representing six machine cycles. A new instruction is started after each machine cycle. At the end of six machine cycles, one C-code instruction is completed with each machine cycle. This interleaving of machine instructions causes our machine to operate efficiently and, in the final analysis, more quickly than the machine of the previous model. This gain, of course, comes at the expense of separating the execution of instructions and synchronizing the processing of each stage. First Instruction finished start Computation Time finish C-Code Saxpy Instructions Figure 4-4: Executing the Saxpy C-Code on a Piplined ALU 4.4 Vector Processing - One Version of SIMD There is not general agreement that Vector Processing is really SIMD Processing. While we classify both as examples of SIMD Parallelism, we distinguish between the two in our presentation. This section deals with Vector Processing, i.e. SIMD processing that does not involve communication between the processing elements, but nonetheless does its processing with the various processors synchronized to be doing the same operation at the same time. This is appropriate for vector operations where the order in which the operations are done on the vectors has no effect on the result of the computation. In section 4.5 we will discuss what we call the Pure SIMD model for parallelism that allows us to attack a more general class of problem than the vector operations discussed here. 8 4.4.1 Pure Vector Processors While a pipelined ALU has some of the elements of parallelism, a machine built on this principle is using an SISD architecture, even though more than one data element is in the ALU at the same time. Only one instruction is completed at the conclusion of each machine cycle. Because of the efficiency of the ALU, we can think of this architecture as being, as we did in Chapter 2, as an example of an MISD architecture. However, parallelism involves the processing of several instructions at the same time. The difference is that in SISD processing, whether pipelined or not, the data is being processed as a stream, i.e. at most one data element completing processing at the end of each machine cycle. In a parallel architecture, it is processed as a wave, several data elements completing processing at the end of the same machine cycle. Our first view of this type of model will be that of a “vector” processor. The architecture has a design where the ALU’s are vector ALU’s. They store and operate on vectors of data, as opposed to single pieces of data. We will refer to vector registers and scalar registers in discussing the design of this architecture. A scalar register is one which stores a single floating point number. A vector register is really a collection of registers, or one large register, that stores an entire vector of floating point numbers. In addition to performing vector operations (vector add, subtract), the vector computer must be able to perform scalar operations on vectors and reduce vectors to scalars. Thus, the vector ALU must have hardware that can perform operations on vector components, and also do operations on all the components of vectors such as sum components, take the minimum of the components, etc. Finally, the ALU must allow for ‘mixed’ operations between scalars and vectors. The speed with which these operations can be performed in the ALU affects the overall efficiency of the processor. For the saxpy problem, X + Y, we illustrate in Figure 4-5 a model which replicates the scalar, , in a vector so that it is available during the execution of the saxpy command. In this Figure, we also display the vector registers as collections of floating point registers. . . . . * . + . X . . . Y Figure 4-5: Executing a Saxpy Operation in a Vector Processor 9 A coding of this process that is representative of the pure vector machine handling of the saxpy process is the FORTRAN 90 code. Y[1:100] = a*X[1:100] + Y[1:100] Even though scalars may be processed as vectors, as in Figure 4-5, they are written in FORTRAN 90 as scalar, not vector, elements. In the FORTRAN 90 language, it is possible to reference whole vectors or portions of vectors by including the stating index and ending index in braces following the name of the vector. Thus, if we wish to reference the elements of Y with indices 20 through 59, we write Y[20:59]. The use of a subset of the indices of a vector is handled by a machine process. If we are dealing with a vector ALU, that has, say, 100 floating point registers reserved for each vector, the entire vector Y may be loaded into the ALU; however, a mask (a bit pattern representing a boolean vector) is also created to inform the store instruction to ignore indices 1 through 19 and 61 through 100. If we have FORTRAN 90 code that reads as follows Z[20:59] = Y[20:59] The mask may have zeros in all indices other than those numbered 20 through 59, and ones in those indices. If mi is the value of the ith bit of the mask, the store instruction corresponds to the boolean operation zi = (zi mi’) ( yi mi) A different problem is created if the range of the indices exceeds the number of individual floating point registers that make up each vector register. In this case an operation may be done in more than one machine cycle, or more than one register may be used to hold the vector. For example, if Y has 150 elements and the vector registers are equipped to handle only 100 elements, Y[1:100] may be loaded into one register and Y[101:150] may be reserved for the next instruction cycle or placed in another register, possibly an extension register. This determination is made during the compilation of the instruction and based on the particular architecture of the machine. Of course, provision is made to handle the indices whose values exceed 100 by using a modular reduction of the index at the time they are loaded into the register. As another example of FORTRAN 90 code consider the example of adding two vectors, X and Y, that have 75 elements each and storing the result in a vector, Z. Z[1:75] = X[1:75] + Y[1:75] As another example, suppose that we want the first 25 and the last 25 indices of Z to be twice the value of the corresponding indices of X and the middle 25 to be the sum of the corresponding indices in X and Y. This can be done with the following 3 lines of code: Z[1:25] = 2*X[1:25] Z[51:75] = 2*X[51:75] Z[26:50] = X[26:50]+Y[26:50] A trace of the machine processing of these three lines would be as follows. A vector register is filled with the scalar 2 replicated in each of its floating point registers. The vector X is placed in another register together with a mask that blocks any store instructions for indices 26 through 100. The multiplication is performed and the result is 10 stored in vector Z using the mask. A similar process is repeated except that the mask now allows consideration of indices 51 through 75 only. Finally, vectors X and Y are loaded in registers together with a mask. They are added and stored in Z as controlled by the mask. Obviously, there is some setup time required for the vector operations. After setup, this model will execute the entire saxpy operation, X+Y for a vector of size 100, in the same 19 machine cycles that were required by the simple SISD model to do a single floating point multiply and add. If S is the number of machine cycles to create the masks and use the mask to store the result, then the time to execute the entire instruction on a vector of size 100 is 19 + S machine cycles. If we are processing a vector of size 1000 on our example machine, the time could be as long as 190 + 10S machine cycles depending on the number of vector registers in the ALU and the number of setup instructions that need to be repeated with each segment of the vectors. 4.4.2 Pipelined Vector Processors There are two immediate limitations that should come to mind when we consider a pure vector processor. They were considered briefly when we considered vectors of size 100 in or example machine that had vector registers made up of 100 floating point registers. Both limitations involve the size of a vector that can be held in a vector register. First of all, how is it possible to anticipate the size of the vectors that will be submitted for the vector operations? Unlike, the candy factory analogy, it is not possible to set a limit on the size of the “vector” that is to be processed or to know what its size will be ahead of time. Some applications may deal with hundreds of data points while others may consider thousands or millions. If one is building a general purpose computer, it is impossible to envision all of its uses. One solution to the variable and unlimited size problem is to have all vector operations done on operands that reside in memory. Unfortunately, while this allows for variable length vectors of almost arbitrary size, it slows down the computation considerably. Memory access requires a great deal of time as opposed to the time required for doing operations within the ALU registers. The other factor is economic. If one is considering high performance computing needs, then high speed, high accuracy ALUs are necessary to insure accurate results. To build a machine that includes a significant number of such processing units within each of the vector registers is very expensive. Generally speaking, those machines that boast a large number of individual processing units are using processors that are lower in cost and less efficient. Consequently, these machines are less reliable for numerical computing that requires a high degree of accuracy. On the other hand, this economic restriction is becoming less of a concern as the price of quality numerical processing chips is decreasing. 11 The compromise imposed by these limitations is the obvious one, use pipelined vector ALUs. Figure 4-6 illustrates this idea. Extra vector registers are supplied to store the results of previous instructions, thus saving time by eliminating the number of fetch operations from memory. The machine implementation of pipelining a vector ALU is not at all as straightforward as our diagram might suggest. On the other hand, while Figure 4-6 is devoid of detail and idealizes the operation, the fact is that this arrangement gives a sensible solution to the speed/size compromise. In future discussions, when we refer to a vector processor, we may have a pure vector processor in mind, but in effect, we are really referring to a pipelined vector processor. Vector Registers Vector Pipeline 1. Vector Register Read 2. Fetch Instruction 3. Decode & Execute 4.Vector Register Write Figure 4-6: A Conceptual Diagram of a Pipelined Vector ALU 4.4.3 An Example - Matrix/Vector Product The successful application of parallel computing to a problem frequently involves a rethinking of the problem itself and the algorithms that are used to solve the problem. This rethinking can be illustrated in the following application. Suppose that we have an n by m matrix, A, and an n-dimensional vector, v a 11 a 12 ... a 1n v1 a a 22 ... a 2 n 21 . . . . A= . v= . . . . . v n a m1 a m2 ... a mn In most cases, we think of a matrix/vector product as an m-dimensional vector whose elements are made up of the scalar product of the rows of A, regarded as individual n12 dimensional vectors, with the n-dimensional vector, v. This conceptual formulation of the product is shown below. a 11 v 1 ... a 1n v n . Av = . a m1 v 1 ... a mn v n Written in C-code the product P (=Av) is generated by nested loops. for(i=1; i<=m;i++){ P[i]=0; for(j=1; j<=n,j++) P[i]+=A[i][j]*v[j]; } It is apparent that mn multiplications and m(n-1) additions are required to generate this product using the traditional SISD architecture. The question is how can we efficiently implement this algorithm on a vector processing machine? Part of the answer involves the way in which matrices are stored. If we use a vector processing oriented language such as FORTRAN 90, and write the code as do i=1,m P[i]=A[i,1:n]*v[i,1:n] end do The coding appears to be quite efficient, but when it is run, it does not perform as efficiently as expected. The reason for the lack of efficiency is that a FORTRAN 90 stores matrices in memory in column order. To see why the above coding of the matrix/vector product usually operates inefficiently, consider how data is loaded into the ALU. Computer designers make the assumption that if a program is reading an element of data, it very likely will need to read other data elements that are stored near by that element in memory. Thus, memory is organized into clusters of data, called pages. Different computer manufacturers use data clusters of varying lengths. Let’s choose a fairly typical page length of 128 bytes. This size page is capable of holding 32 numbers of four bytes in length. Now suppose, that we are executing the above FORTRAN 90 code with a matrix, A, of size 100 by 100. Since the matrix is stored in column order, A1,2 will be stored at least three pages away from A1,1 within the computers memory. The same is true of any two consecutively numbered row elements. Thus, each time through the loop the page containing the row element must be located and added into memory. This process is called a page fault. One of the goals of efficient coding is to minimize the number of page faults during the execution of a program. Obviously, since the choice of A in our example contains 10,000 four byte numbers, not all page faults can be eliminated. Since the row elements of the matrix may be several memory pages away from each other. The time for each fetch cycle is at a maximum, and the total execution time for the algorithm, as coded above, is relatively slow. So how do we handle this new issue? The answer involves changing the algorithm for evaluating the product. 13 Instead of looking at the matrix/vector product as the scalar product of the rows of A with the vector, v, look at the product as a linear combination of the columns of A. That is as an extended saxpy operation involving the columns of A as shown in Figure 47. a 11 a 12 a 1m a a a 21 22 2m . . Av = v1 + v2 + . . . + vm . . . . a n1 a n 2 a nm Figure 4-7: Computing the Matrix/Vector Product in Several Saxpy Operations A moments reflection will show that the result of this operation is the same as the previous, multiple scalar product operation. The difference is that the result can now be easily adapted to an efficient FORTRAN 90 program running on a vector processor as is shown in Figure 4-8. P[1:n] = 0 Do i = 1,m P[1:n] = v[i]*A[1:n,i] + P[1:n] End Do Figure 4-8: Doing the Matrix/Vector Product in Column Order We assume that in this calculation, each column of A has been stored in a separate vector register and that the result of each saxpy operation is stored in a register. The net result is that the product computation requires m steps as opposed to nm steps on SISD architecture. The timing for the individual steps is minimized because the coding forces the computer to access memory in an efficient way. 4.5 Beyond Vectors - Parallel SIMD The vector processors described above are extremely efficient for many applied problems that involve vector manipulations. However, there are two important characteristics that are not implemented by the above design - the ability to work only on a subset of the array elements without additional manipulation and the ability to communicate with neighboring elements. True SIMD computing requires both of these abilities. The example architecture we will use as a paradigm is an array processor modeled after the earliest example of such a machine, the ILLIAC IV computer designed at the 14 University of Illinois (c. 1982). While this is quite old in terms of the lifetime of computing machines, it is still an excellent model of the entire class of SIMD machines. Within this model each processing unit has its own processor and memory. There is a central control unit that broadcasts instructions to each processor. The processors are then connected to other processors in some fashion. In the ILLIAC IV each processor was connected to its nearest neighbor as shown in Figure 4-9. This is an example of the grid topology that is discussed in Section 2.xx. Control Unit Processors with Memory Connections Figure 4-9: A Conceptual Diagram of a Pure SIMD Architecture Other possible connections for the processors, in addition to the grid illustrated in Figure 4-9, are cube connected as in the Connection Machine and fully connected as in the GF11. We will discuss particular SIMD architectures in a latter section of this chapter. With this design it is clear how the vector machine can be extended. If a particular subset of processors is to be shut off, as is the case when using a conditional statement, then the Control Unit merely broadcasts a command that sets a flag bit to zero in all processors contained within the subset and one for those processors that are to remain active. This process is similar to the masking process that we discussed in connection with the vector processors. If a processor needs to have access to a result generated by another processor, then the connecting links make this possible. In the grid layout, shown above, it is most convenient to request the information from an immediate neighbor. Information from other processors can be obtained, but it is necessary to have a routing program for virtual connections. 15 4.5.1 Using Subsets of Processors - A Prime Number Sieve The classic problem of locating all primes less than a particular number, n, is solved by an algorithm called the Sieve of Eratosthenes. The basis of this algorithm is one of those simple ideas that reflect the insight of genius. It is simply the observation that if a number, n, is not prime then it is the product of positive integers each of which is less than n. Thus, the algorithm begins with 2 and sets to zero all positions in the list of consecutive integers from 3 to n that correspond to multiples of 2. Next, the starting point is moved to the next non zero element in the list, say p, and sets to zero all the elements between p+1 and n that correspond to multiples of p. The process continues in this way until it arrives at nth position in the list. If this position is zero, then n is a composite number. Otherwise, n is prime. In addition to making this determination, the algorithm has generated a list whose only non zero elements are the prime numbers < n. #include <stdio.h> #define n 1000 /* We will find the primes <= 1000 */ int i,j,k,p[n]; /* Initialize the array */ for(i=0;i<n;i++) p[i]=i; /* Pick out the primes in the list */ for(i=2;i<n;i++) /* find the next non zero element */ if (p[i]){ k=p[i]; /* Set the multiples of k to zero */ for(j=i+1;j<n;j++) if(!(p[j]%k)) p[j]=0; } Figure 4-10: C-code for the Sieve of Eratosthenes The sequential form of this basic algorithm is coded in C and displayed in Figure 4-10. This algorithm operates in O(n2) time plus the time that it takes to print the nonzero elements of the array. If we consider a list of integers from 2 to 15 then we can trace the operation of the algorithm of Figure 4-10. This is done in Figure 4-11. 16 Notice that there are two conditional statements in the code of the algorithm. These statements can be looked at as instructions that will turn off some of the processors. In particular, any processor that does not meet the given condition will not execute the statements contained within the scope of the conditional statement. The code in Figure 4-12 assumes that there is contained in the instruction set of the machine a command that will allow a processor to retrieve its own ID number. We will also assume that these ID numbers start at 0 and are indexed by 1 as we move from left to right and top to bottom. In order to have the processors in our machine correspond to the elements of the array that was used in the sequential formulation of this algorithm, we will test and, if necessary, change to 0 a variable j that is initially set to ID +2. (An alternative strategy is to ignore the first two processors, 0 and 1, and then initialize j to be Initial array p: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 First non zero element: 2 After executing the inner loop of the algorithm (12 steps) Status of array p: 2 3 0 5 0 7 0 9 0 11 0 13 0 15 Next non zero element: 3 (1 pass of the outer loop) After executing the inner loop of the algorithm (11 steps) Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0 Next non zero element: 5 (2 passes of the outer loop) After executing the last loop of the algorithm (9 steps) Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0 Next non zero element:7 (2 passes of the outer loop) After executing the last loop of the algorithm (7 steps) Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0 Next non zero element:11 (4 passes of the outer loop) After executing the last loop of the algorithm (3 steps) Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0 Next non zero element:13 (2 passes of the outer loop) After executing the last loop of the algorithm (2 steps) Status of the array p: 2 3 0 5 0 7 0 0 0 11 0 13 0 0 Next non zero element:** (2 passes of the outer loop) Figure 4-11: A Trace of the Sieve of Eratosthenes for a Small Value of n ID, itself). When reading the code in Figure 4-12, keep in mind that the code is broadcast one step at a time by the Control Unit to each processor. The active processors then execute that step before receiving the next instruction from the Control unit. int j,k; /* Set the value to be tested */ j=ID+2; for(k=2;k<n;k++) /* consider only active processors */ if (j != 0 && j > k) { /* Should the processor remain active? */ 17 if ((j%k)==0) j = 0; } Figure 4-12: The Code That Is Executed on Each Processor of a SIMD Machine Of course the Control Unit is not broadcasting C code. It is broadcasting machine language instructions that will execute the program represented by the code. However, we can understand what is happening with the sieve algorithm from the C version of the program. Let’s examine what is happening in two processors of the machine during the execution of the code in Figure 4-12. Assume that n is 15 and we are looking at the activity in processors 11 and 13. j: k=2 j: k=3 j: k=4 j: k=5 j: Processor #11 13 Processor #13 15 13 15 13 0 Processor Inactive 13 13 Nothing changes in these processors for k = 6 through 12 k=12 j: k=13 j: k=14 k=15 13 Processor inactive Figure 4-13: The Status of Two Processors During Execution of the Sieve Algorithm Note that since there is only one loop, this program operates in O(n) time. Each processor is responsible for only one number and at then end of the algorithm when the processor is reactivated will return either a 0 or its ID plus 2. In effect, the array of processors replaced the innermost loop of the original C code. A final note on this program. It is not as efficient as it might be. Many of the mod operations on line 9 of Figure 4-12 are superfluous. For example, any number divisible by 4 is also divisible by 2 and the corresponding processor was turned off when divisibility by 2 was checked. If we make the further assumption that the Control Unit has hardware that can quickly communicate with the processors that are “active” (not shut off by the conditional) and can do operations such as sum, find minima, and maxima, etc. 18 quickly, then there is no need for the loop. We use the value of j contained in the active processor with the minimum ID number to check the remaining processors. This value of j must be a prime number (Why?) Using this strategy the number of steps necessary to complete the algorithm can be reduced to a number equal to the number of primes < n, n which for large n is slightly larger than . ln( n) Assume that the instruction set contains a command similar to the ReduceSum command shown in Figure 3.xx. Assume that a call to this command is “ReduceMin(j)”` and that it will quickly identify the minimum value of the variable j that is contained in each of the active processors. Referring back to Figure 4-9, we see that there are two distinctly different types of computation going on in a machine built according to the SIMD architecture. The first type is that which is being done in the Control Unit. This unit generates values and broadcasts them to the Local Processors. The Control Unit also gathers results from the Local Processors and handles I/O communications with external devices. On the other hand, the Local Processors do the operations on the data that they receive from the Control Unit. In order to distinguish between variables that reside in the Control Unit and those that reside on the Local Processors, we use two new type declarations, Control and Local. If a variable is designated as being of type Control, then it is defined in the Control Unit and may be broadcast to all of the processors. A variable designated as being of type Local may have a value that varies from local processor to local processor. Its value stays within the processor until it is collected along with the other local values of the variable via a special command from the Control Unit, such as a ReduceSum or ReduceMin call. Figure 4-14 is a coding that takes advantage of the conventions discussed in the preceding paragraph for executing the Prime Sieve Algorithm. Control: int a; /* a is a variable whose values are generated in the Control Unit. Its value may be shared with the local processors, but they can not change it. */ Local: int j,k; /* j and k are variables that reside on the local processors and can be manipulated by them. Each processor may have a different value for j and k. This is the usual case. */ j=ID+2; /* Initialize each Local j */ a=2; /* Initialize a in the Control Unit */ While (a>0){ /* 0 is the default min value for the empty set */ k=Load(a); /* The value of a is broadcast 19 to all active processors */ if (j>k) a=ReduceMin(j); /* Note that only those values of j that exceed the previous value of a are considered */ if (j!= 0){ /* Shut off those processors with zero j */ k=Load(a); if(j>k and (j%k)=0) j=0; } } /* The value of j is either 0 or a prime */ Figure 4-14: A SIMD Coding That Uses the ReduceMin Operator In Figure 4-15, we show the status of the Control and Local variables during the execution of this algorithm on 14 processors. Control Unit a 2 3 5 7 11 13 0 0 1 2 3 4 5 6 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 0 0 0 0 0 0 5 5 5 5 5 5 5 6 0 0 0 0 0 0 7 7 7 7 7 7 7 8 0 0 0 0 0 0 Processors 7 8 9 10 j 9 10 11 12 9 0 11 0 0 0 11 0 0 0 11 0 0 0 11 0 0 0 11 0 0 0 11 0 11 12 13 13 13 13 13 13 13 13 14 15 0 15 0 0 0 0 0 0 0 0 0 0 Figure 4-15: A Trace of Algorithm 4-15 While the code in Figure 4-14 eliminates the superfluous Mod (%) operations, this savings must be weighed against the additional cost of the time for the ReduceMin and Load operations to determine exactly how much time improvement is obtained by using these operators. 4.5.2 The Advantage of Being Connected - The Product of Two Matrices In the previous section the algorithm depended only on the fact that each processing unit was connected to the Control Unit. Each individual processor’s actions depended only on the information that it had or that was passed to it from the Control Unit. It did not depend directly on the information contained in any other processor. In the algorithm that we develop in this section it is absolutely necessary that processors be able to communicate with each other. 20 In Figure 4-14, we assumed that we will specify variables as being being stored in the memory of the local processing elements or that of the Control Unit. Now we add another command that will also be handled in the preprocessing stage of the algorithm. This new command will be called “Layout” and will specify the conceptual layout or network configuration of the processing elements. The preprocessor will determine the necessary mappings and routings of the conceptual design to the hardware design. This new command will have the general form Layout ArrayName [range1, range2, . . . , rangen ] The name, ArrayName, may be any legal variable name. The values for each rangei are the starting and ending values for the ith subscript. Conceptually we have a multiply subscripted array of processors. For example, if n = 2, and we have an algorithm that will be dealing with 8 by 12 matrices, we may specify a layout as follows: Layout MatHolder [1:8, 1:12]; This layout will facilitate the distribution of matrices to the processors. For example if A is an 8 by 12 matrix, then the command el = Distribute(D); places in the variable el of the processor associated with MatHolder[i, j], the value of the matrix element A[i, j]. We will also make some conventions concerning the connection of the processors in this layout statement. The first subscript will be associated with the directions north and south. Thus, in processor MatHolder[i, j], el.north refers to the value of the element el stored in MatHolder[i-1, j] and el.south refers to the value of el stored in MatHolder[i+1, j]. If the value of i-1 is less than the starting value for the first index or if i+1is greater than the ending value of the first index, then request for the value of el.north and el.south is ignored. Likewise, the second subscript is associated with the directions east and west and the third with up and down. If there are more than three dimensions, we will use up4 and down4, up5 and down5, etc. This architectural design will be used to speedup the computation of the matrix product of two matrices. Assume that we have two n by n matrices, A and B. The normal algorithm for the matrix product of A and B requires n3 steps. The code is shown in Figure 4-16. We will reduce the number of steps required for this process to n steps, but we will make the classical computational trade off: time for space. In our case, the tradeoff is better stated as time for processing elements. We will assume that we have n3 processors available on our system and that they are laid out as a three dimensional array. Our layout statement will read: Layout: Array[1:n,1:n,1:n] Given this layout, we are assuming that each processor is directly connected to six other processors, two in each of the array dimensions. 21 int i,j,k; float A[n,n],B[n,n],C[n,n]; for(i=0;i<n;i++) for(j=0;j<n;j++){ /* Calculate C[i,j]=0; /* Initialize */ for(k=0;k<n;k++) C[i,j]+=A[i,k]*B[k,j]; /* scalar product of and column j of B } C[I,j] */ Compute the row i of A */ Figure 4-16: The Sequential Algorithm for The Product of Two Matrices If we are dealing with two 100 by 100 matrices we will be requesting one million processors! Obviously, some processors must accommodate more than one element of the matrices, but the mapping of the matrix elements to the processors is handled during the preprocessing phase of the program. The beauty of the matrix product algorithm we will present is in its conception. Think of the processors as being laid out within an n by n by n cube. At each of the n vertical levels assume that there is a n by n table. Distribute n copies of the matrix A vertically and n copies of the matrix B horizontally. The following figure illustrates this for 2 by 2 matrices. a11 A: a11 a12 b11 a12 b12 b21 b22 B: a21 a21 a22 b11 a22 b12 b21 b22 Figure 4-17: Illustration of the Distribution of Matrices for the Product To see the full layout, simply slide the two parts of the figure together, maintaining the spacing. Notice that in each of the cells, there are elements of the form aik and bkj , legal terms for a matrix product element from the above code. For example, the processor corresponding to Array[2,1,1] will contain the values of A[2, 1] and B[1, 1]. Recall that the cells each have their own processor. All that remains is to take the product of terms and add each matrix column of the horizontal layers. This addition requires n steps. 22 The following is the pseudo C-code for this process in our SIMD paradigm machine. We assume the following directions for moving data among the processors: north, south, east, west, up, and down. The compass directions refer to movement on the horizontal layers, while the up and down refer to movement in the vertical direction. If a request is made for a value in which there is no connection, a zero is returned. Layout: Array[1:n,1:n,1:n] Control: float A[1:n,1:n],B[1:n,1:n]; float C[1:n,1:n],D[1:n,1:n,1:n]; int i,j,k; Local: float p,p1,p2; int i1; for(i=1;i<=n;i++) /* Place A vertically */ for(j=1;j<=n;j++) /* see Figure 4-17 */ for(k=1;k<=n;k++) D(i,k,j)=A(i,j); p1 = Distribute(D); for(i=1;i<=n;i++) /* Place B Horizontally */ for(j=1;j<=n;j++)/* see Figure 4-17 */ for(k=1;k<=n;k++) D(i,j,k)=B(i,j); p2 = Distribute(D); p=p1*p2; /* Compute the product of individual elements */ p2=p; /* Store the product in p2 */ p=0; /* Reinitialize p */ for(i1=1;i1<n;i1++){/* This loop will have */ p=p2+p; /* the effect of accumulating */ p2=p2.north;/* the desired sum in */ } /* Array[n,i,j]*/ D=Store(p); /* collect the values in the */ /* Control Unit */ for(i=1;i<=n;i++) for(j=1;j<=n;j++) /* Store the product in */ C[i,j]=D[n,i,j]; /* a matrix C */ Figure 4-18: An Efficient Parallel Matrix Product This code is not designed to run on any particular machine. It has commands that are designed to show the action of the machine, not as a language tutorial. Some explanation is required for the Distribute and Store commands. Each is to deal with a structure in the Control Unit that has the same design as the processor layout. The Distribute command uses the associations that are made during the preprocessing of the Layout declaration to distribute elements the Control Unit data structure to its corresponding processor. We assume that each processor contains one element of the data structure. The Store command is the inverse of the Distribute Command in that it moves data from the processors to the designated Control Unit data structure. 23 The command “p2=p2.north” replaces the current value of p2 in each processor with the value of p2 in the processor to the north. As was mentioned above, if there is no processor to the north of the given processor, p2 is assigned a value of 0. After n steps all of the elements in a given horizontal column have made it to the bottom of that column, processor Array[n, i, j] and been accumulated into the sum which is stored in variable, p. Thus, the value of p in processor Array[n, i, j] is the i,jth element of the product of A and B. In this algorithm, all n3 multiplications are done at one time with the command “p=p1*p2”. In the loop n3 additions are done simultaneously with the command “p=p+p2”. It takes n steps until the bottom row of each layer has the full sum required for each term. Thus, the multiplication portion of the algorithm operates in O(n) time. This does not include the set up and storage phases of the algorithm. Let’s do a more complete analysis of the algorithm. The real work of the algorithm is done in lines 18 through 24 of Figure 4-18. We will repeat these lines here in Figure 4-19. (18) p=p1*p2; /* Compute the product of individual elements */ (19) p2=p; /* Store the product in p2 */ (20)p=0; /* Reinitialize p */ (21)for(i1=1;i1<n;i1++){/* This loop will have */ (22) p=p2+p; /* the effect of accumulating */ (23) p2=p2.north;/* the desired sum in */ (24)} /* Array[n,i,j]*/ Figure 4-19: A Fragment of Figure 4-18 Let’s assume that f is the time required to do a floating point operation (add or multiply), that c is the time required to do a processor to processor communication, and that is the time required to set up the communications channel. We will also assume that assignments require 2 machine cycles, dented by mc. Now, we can compute the time for this matrix product computation. Step (18): Steps (19)&(20): Steps (21) - (23) TOTAL f 4mc + n*(f + c) + (n+1)*f + n*c + 4mc On the other hand, if we examine the sequential algorithm in Figure 4-17 we see that n2 assignments are made to initialize the product matrix and 2n3 floating point operations are required to compute the matrix product. Thus, the time for this computation is: 2n2mc + 2n3f 24 While it is true that and c may be significantly larger than the time for an in processor floating point operation and the time of a machine cycle. It is obvious that for large n, the savings of the algorithm of Figure 4-18 is a very considerable one. For example, if the communication speed is as much as ten times slower than the time to do a floating point computation, then the speedup from using the algorithm in Figure 4-18 on a matrix of size 100 by 100 over the algorithm in Figure 4-17 is still on the order of one thousand. 4.5.3 Working with Databases One of the major applications of computing is the use of a database management system to store and retrieve information from computer files. For example, the registrar of your institution has access to a rather large database that keeps a listing of all of the students, the courses that they have taken, and their grades in those courses. There is most likely more information in this database, but you can appreciate the need for the database to be maintained in a timely and accurate fashion and for information from this database to be easily accessible to some people and inaccessible to others. The use of parallel processing for database applications is a subject for great debate among the computing community. While a parallel SIMD machine is very efficient for searching the database, there is a bottleneck that develops when I/O operations must be performed or memory must be accessed. On the other hand, SIMD architectures contain a large number of processors that can each do the relatively simple operations required by a database search. Thus, several database tables can be searched simultaneously in a very rapid manner. Since, the relations on each table that involve the same database “key”, i.e. the field of a record that uniquely identifies the record, can be found rapidly, operations such as a natural join can be performed in a very efficient manner. For example, all records from different tables pertaining to an individual with a given social security number can be quickly located and linked together. Before we look at the possibilities for using parallel machines for database applications, we will consider a list search routine and then review some definitions from database systems theory. 4.5.3.1 Searching a List of Numbers Suppose that we have a list of n numbers that may be keys, i.e. pointers to additional information that is stored in the system. If the list is sorted, then it can be very easily searched in O(ln n) time using a binary searching technique. Not only is O (ln n) the time required for the binary search, it is the theoretical lower bound for any search on a Von Neumann computing machine. If the list is unsorted, the time to search the list on a Von Neumann machine O(n), a much slower timing result when n is large. We will illustrate a result for SIMD machines that is, in theory, O(1). In practice, the result is n O( ), where n is the number of elements in the list and p is the number of processors p 25 n that make up the processing cluster (see Figure 4-9.) The term , however, may easily p be 1 since the usual number of processors, p, for a SIMD machine is quite large. For example, the CM-2 machine has 16,384 processors available each of which is capable of doing simple computations. The search technique can best be understood by conducting the following simple experiment in your classroom. Assuming that you have fewer than 52 students in your class, distribute one card from a shuffled deck of cards to each student. You hold the remainder of the deck. Then ask for a particular card. For example, say, “Who has the ten of spades?” If one of the students has the ten of spades, she responds, “I do.” At this point you know where the ten of spades is located. If no student has the ten of spades, no reply will be made and you conclude that the ten of spades was not among the cards distributed to the students. In either case, your search was completed by asking one question. That is, it was done in O(1) time! The actual process is almost exactly the same as the experiment that you just conducted. Instead of students in your class, the computer has separate processing elements. The deck of cards is replaced by the list of numbers or “keys.” Now the host computer (the one that controls and synchronizes the processing elements) distributes the list one number to each processing element. The host sets a location variable to a negative number on the assumption that the processor ID’s are in the range of 0 to p-1, where p is the number of processing elements in our machine. Now, the search begins. The host broadcasts the number that is the object of the search to all processors. If a processor does not have the number, it shuts off. The processor that has the number returns its ID to the host who stores that ID in the location variable. The host program either returns the location of the number or the message “Not Found.” A possible coding for this process is given in Figure 4-20. main(){ /* The preprocessing stage done in the host computer */ loc = -1; for(i=0;i<n;i++); /* Done in each processing element */ if (i==ID) val=list(i); /* Assign processor ID’s variable val to the ith list element */ broadcast(key); /* Returning to each processing element */ if(key==val){ loc=ID; return loc; /* Now back in the host */ if(loc != -1) 26 printf(“The key is in list element %d\n”,loc); else printf(“The key is not in the list\n”); } } Figure 4-20: Pseudo code for the O(1) search 4.5.3.2 Searching a Database The term database is a shortened form of the more descriptive database management system. It refers to a collection of interrelated data together with programs that access that data to provide information to a user. Requests for particular pieces of information are called a queries. If we are considering a database for a large banking system, we would want to have information about the bank’s customers, the accounts that they have with the bank, what branch of the bank hosts the account, where the branches are located, and other information about the bank and its customers. It does not make sense that all of this information should be stored in one large table where a listing for Mary Brown would contain Brown’s home address, Brown’s account number, her balance in that account, the name of the branch of the bank where she does her banking, the address of the branch, and other related information. Such a scheme would be very costly, inefficient, and prone to errors. For example, if Bill Smith also had an account in the same branch as Mary Brown, all of the information Bill Smith’s record concerning the branch would duplicate that in Mary Smith’s record. Worse yet, if Mary had both a savings account, a checking account, and a loan at the same branch, all of the information about Mary and the branch would be stored in triplicate! Think of the updating problem if the Branch were to move to a new address. For that reason, data is stored in tables that can be related. In this way there is a minimal duplication of data and updating is done in a much more efficient way. In the above example we could store the information in the following tables. The tables are given a name and under each table is listed the attributes that are contained in the table. Customer: Name Address Social Security Number Account Type (Savings, Checking, Loan) Account Number Customer Name Balance Interest Rate Host Branch 27 Branch Branch Name Branch Address These tables are related by the use of certain attribute that uniquely describes each record in the table. For example, Social Security Number will identify a customer; Account Number an Account, and Branch Name a Branch. The relations are also stored as tables Customer/Account Social Security Number Account Number Account/Branch Account Number Branch Name When a new customer comes to the bank. A record is created in the Customer table is created using information supplied by the customer on the application blank. When that customer opens an account, a record is created in both the Account table, the Customer/Account table and the Account/Branch table. As deposits or withdrawals are made from an account, the Account table is updated. No other information is accessed. The same applies to loans and loan payments. Now let’s ask a question, query, of the database. For example, what are the total assets for Mary Brown in the Chambersburg branch of the bank. To answer this question, we need to know the balance of every account that Mary Brown has in the Chambersburg branch of the bank (she may have accounts in other branches) and add up the totals in her savings accounts and checking accounts and subtract the total from any loan. We assume that the host computer is large enough to store the five tables listed above. The ith line of each table is the information given while making the ith entry into that table. Corresponding lines of tables are not usually related. Now, assume that each processor in the bank’s SIMD computer has enough memory to contain the following information in this order: the name (nm) from the ith line of the Customer Table, the account number (a1) and social security (ss1) number from the ith line of the Customer/Accounts table, the Account number (a2) from the ith line of the Accounts table, and the Branch Name (bn) from the Branch Table. The variable names for these five pieces of information are given in parentheses after the attribute names. Without coding, we describe the search to provide an answer to the query about Mary Jones assets in the Chambersburg branch. We proceed much as we did in our key search of section 4.5.3.1. The host broadcasts the name, “Mary Brown,” to all of the processors. The processor containing that name in its nm variable returns its ID. The ith line of the Customer Table is examined for Mary Brown’s social security number. This number is broadcast. The processors containing that variable as ss1 return their ID’s. Each of these ID’s is processed in order. The appropriate line of the Customer/Accounts 28 table is extracted and the Account number is broadcast. The processor that has that number as its a2 variable returns its ID. The appropriate line of the Accounts table is extracted and the Branch attribute is checked to see if it matches “Chambersburg.” If it does, then the account type attribute is checked. If it is savings or checking, Mary Brown’s assets are increased by the account balance. If it is a loan, her assets are decreased by the amount. When all account numbers that were found above, the value of Mary Brown’s assets is returned and the query is answered. 4.6 An Overview of SIMD Processing SIMD Processing is inherently synchronous. That is, all operations are carried out at the same time on all processors and we do not begin the next one until all processors have completed the operation. It is true that, in general, the class of problems that can be addressed by SIMD architectures is less general than those that can be addressed by several independent processors operating asynchronously that is in an architecture that we have called MIMD processing. On the other hand, it is exactly the synchronization that makes SIMD an attractive computing alternative. At some point in any program there is a need for processors to share information. This can not be done until all processors have completed their computational task up to this point. Thus, the other processors must sit idle after they have completed their task until the slowest processor finishes with its task. The synchronization of all computing steps eliminates this need to wait. Another characteristic of SIMD programs is that there is no asynchronous contention for resources. Either no processor wants a resource or all processors want it. In the latter case the control processor broadcasts it to all processors. This eliminates hardware deadlocks since a resource is strictly private or else always shared. Of course, the major characteristic of SIMD processing is that there is one (common) instruction memory. In the MIMD paradigm, the entire program must be downloaded into each processing element. From the programmers’ point of view, this fact makes the testing and debugging of programs much more difficult within the MIMD environment. If an error occurs in a SIMD program, many of the standard debugging techniques can be applied and the offending instruction located. When a program running on a MIMD architecture fails, it is extremely difficult to determine the offending instruction since the processors are running independently and it is not clear at what exact stage each processor is in the computation. A basic debugging technique in this case is to insert synchronization points to try to narrow down the region of the program that contains the error. In the SIMD paradigm, these points exist for every instruction. In any large-scale numerical computation, there are portions that are inherently best suited for SIMD architectures. This means that the SIMD design is much more efficient. Parallel computing in the future may consist of a “hybrid machine” made up of both SIMD and MIMD machines. The front end compiler may well decide which parts of a program are best suited for which paradigm and dispatch those parts to the appropriate 29 machine. Thus, there may be a way to avoid making a decision about which machine we plan to use for our entire program. 4.7 An Example Vector Architecture - The Cray Family of Machines This family of vector machines includes the Cray-1, Cray-2, Cray X-MP, and Cray Y-MP/C90, which were designed and built during the decade of the 1980’s through 1993. The basic design was the brainchild of Seymore Cray, one of the founders of Control Data Corporation who later formed his own company, Cray Research, that produced this family of machines. The fundamental ideas for the series are found in the Cray-1, which will be the primary example of this section. This architecture is illustrated in Figure 4-20. Main Memory (64 bit words) Eight 64-bit Vector Registers with 64 Elements per Register Instruction Processor A B S T 8 Registers 64 8 Registers 64 Holding Holding Address Registers Scalar Registers Registers Registers (64 bit) (24 bit) Integer Add Integer Mult. Integer Add Boolean Shift Pop. Count Float Add Float Multiply Float Recip Integer Add Shift Boolean Figure 4-20: A Conceptual Diagram of the Cray-1 Architecture This architecture consists of one main memory feeding data to the instructional unit and the vector registers. The instructional unit performs all of the interpretation of instructions and holds the scalar registers. The ALU consists of 12 independent functional units that are bidirectionally connected to the registers as indicated in the above diagram. Note that three units are devoted to vector operations on integer or Boolean data. Three of the units are shared by the vector and scalar registers for floating point data. An additional four units are for integer, Boolean, and bookkeeping in the scalar registers, and two units are used for address calculation by the address registers. All of the units are pipelined in various stages. There are 2-14 stages per unit with the floating point reciprocal, i.e. computing 1.0/f.d where f.d 0.0, requiring 14 stages. 30 Memory consists of 64-bit words. In the Cray-1 it consisted of one million words and was increased to 256 million words in the Cray-2. The Cray-1 memory is accessed by a one way memory/data path that can be thought of as an 11-stage pipeline, 64 bits wide. This bandwidth is rather low and becomes a bottleneck in some calculations. One example of such a calculation is a memory-to-memory multiply. This operation requires two memory reads and a memory write. Such an operation, although legal, violates the basic design assumption that most arithmetic operations will be performed on data resident in the registers. Unfortunately, higher level language compilers do not usually make such assumptions. In the Cray X-MP, two extra ports were added to the main memory, thus allowing three concurrent memory transfers. The Cray Y-MP C90 has four memory ports that each can deliver 128 bits per processor cycle, which is itself considerably decreased from earlier models. There are no specific instructions in the Cray-1 for reduction operations on vectors (sum the components, find the minimum, etc.) These operations, such as the scalar product calculation, are of utmost importance in scientific calculations. For this reason, the design permits one of the input vector registers to be the same as the one that receives the result. Thus, any vector instruction can be converted into a reduction operation. An important feature of the Cray series is that “chaining” of operations is allowed. For example, if we have in register 1, X in register 2, and Y in register 3, we can compute the saxpy operation described in Section 4.3, X + Y without storing the intermediate results in yet a fourth register. As soon as the first result of the multiply is completed it can go directly to the adder together with the Y elements from register 3 and begin to be processed. This technique produces a noticeable speedup of the entire operation. The result is conceptually one long pipeline made up of two shorter pipelines. Memory-register transfer operations and arithmetic operations can be chained. As a result of chaining the Cray-1 is listed as being able to do 80 million additions and 80 million multiplications per second after pipeline startup. This figure is for operations done in registers. When a memory-to-memory vector multiply is performed the rate drops to one third of the multiply rate or about 27 million multiplications per second. This rate increases to 70 million per second in the Cray X-MP configured with one CPU. The X-MP can be configured with up to four CPU’s. The Cray Y-MP/C90 can be configured with up to 16 processing elements is listed as having a peak rate of 15.2 billion operations per second with an observed rate of 9.69 billion operations per second. 31 4.8 An Example SIMD Architecture - The Connection Machines In our conceptual model of a SIMD machine shown in Figure 4.9, we assumed that the processors were connected via a two-dimensional grid. The Connection Machine is cube- (and grid-) connected. It is, unlike the vector processing machine, designed primarily for data manipulation. In this sense, it is a “symbol cruncher” as opposed to a “number cruncher.” These machines, built during the decade spanning 1985 - 1994, are the product of Thinking Machines Corporation, a corporation founded in partnership with David Hillis who designed the CM-1 as part of his Ph.D. thesis at MIT. We will discuss the CM-1 and its successor the CM-2. The latest machine from Thinking Machines, the CM-5 is a MIMD machine and will not be discussed here. The Connection Machine is not a complete system. It is designed to be a back end machine with up to four hosts. The hosts can be either VAXes or Symbolics Lisp machines, or a combination of these. Programs are compiled on the hosts and then downloaded for execution to the Connection Machine via a 4 by 4 crossbar switch. The programs are executed “on-the-fly” as they are being downloaded. The Connection Machine has no program storage. These execution units are called sequencers. The design of the CM-1 is shown in Figure 4-21. Host Units 4 by 4 Crossbar Switch . Sequencer 8K 8K PE’s PE’s I/O channels Data Vaults Figure 4-21: A Conceptual Design of the CM-1 Architecture 32 Each of the four quadrants has 16K one bit processors that can do a one bit add in 750 nanoseconds and the addition of 32-bit integers at a rate of 2 billion per second. These processing elements are so small that 16 of them fit on a 1 cubic centimeter chip together with a message router that is connected to the hypercube network and a decoder that controls the processors and the router. This chip is surrounded by four 16-Kbit static RAM chips with read/write ports 4 bits wide. Consequently each processor can be thought of as having its own 4-Kbit memory. As stated earlier the Connection Machine is designed for artificial intelligence applications (semantic networks, forward chaining inference, backward chaining inference, etc.) and database operations. It was not optimized for high speed calculations. Processor #1 16 flag bits Status Self Address News Router From Address Daisy FLAG ouput of N,E,W, or S Processor Router Output bus DAISY LINK GP 4 Kbit Memory A 1 A B F B ALU F GP AB+BF+AF 1 to NEWS flags of 4 neighbors and router inout bus Figure 4-22: A Schematic Diagram of a CM-1 Processor The operation of each processor uses locations A and B in memory, and flag registers F and Fdest (flag output can go to flags of neighboring registers.) The processor reads 2 bits from locations A and B and 1 bit from register F. It then performs some ALU operation on them and stores one output bit in location A and the other in register Fdest. Examples of ALU instructions are ADD, AND, OR, MOVE, and SWAP. Inputs can be inverted and additional instructions, such as SUBTRACT, are possible. The ALU’s do not really perform logic operations. They look up the proper outputs in a 16-bit register that contains the two output columns of the ALU operation being performed. In addition 33 to this truth table, the microcontoller also sends the processors the A and B addresses, the register containing the Read flag, the register to receive the Write flag, a Condition Flag that specifies which flag to consult for synchronization information, a Condition Sense bit telling whether or not to proceed with the operation, and 2 bits specifying the direction (NEWS) to move the data during the instruction. This information is contained in 60 bits and is sent to the processors during each ALU cycle. A clever aspect of the processor design is the use of the 16 1-bit flag registers. Some of them are used for communication via the grid, the daisy chain that serially links the 16 processors located on each chip, or the router that links all of the chips together. One is a read-only NEWS flag that receives flag output from the flag specified in the instruction. Another is a read-write Router Data Flag that receives data from and send data to the message router. The Daisy Chain Flag reads the flag output from the preceding processor on the daisy chain connection. There is also a Global Flag that is the logical NOR of all of the processors on a chip for communication with the host. Other flags are unspecified so that they can do miscellaneous work such as hold the carry bit in an ADD operation. Since the router will need to deliver messages to particular processors, parts of memory are set aside for the processors’ absolute addresses. Other parts of memory are a status word that includes a bit that indicates when a message is ready to be sent to the router, and still other parts are set aside to receive messages from the router. The router itself also has its buffers in memory. The NEWS connections and the daisy chain give quick access to a processor’s neighbors and other processors in the cluster. Connections to any other processors are handled more slowly through a 12-dimensional hypercube network formed by connections between the routers (see Figure 2.xx and the following discussion). The routing mechanism used by the Connection Machine is the “store-and-forward” method that moves messages through a nearest neighbor protocol. This is not the fastest way through the hypercube, but it allows routing to be handled automatically. The user does not need to consider the steering of a message through the network. The routing argument consists of twelve steps. The XOR of the source and destination nodes is computed. If this XOR contains a 1 in the i-th dimesion, the message is sent to the adjacent node in that dimension, otherwise it does not move. The routers’ function is to receive messages from its 16 local processors and prepare them for transmission to one of the other 12 routers to which it is directly connected. It is also responsible for receiving messages from each one of the other routers and either delivers or forwards them. Each message is 32-bits, plus the destination address and a return address for a total of approximately 60 bits. The design of the CM-2 machine is essentially the same as that of the CM-1 except that there is the addition of a floating-point coprocessor with every 32 of the 1-bit integer processors. This created a collection of 2 K-bit floating point units and noted a 34 shift in the emphasis of Thinking Machines Corporation from symbolic to numeric computing. The resulting machine could perform billions of floating point operations per second on numeric problems favorable to SIMD operation. The newer model 200 spreads each single-precision value across 32 processors so that the floating point value can be transmitted all at once. Thus, the CM-2 is viewed as a parallel computer containing 2048 32-bit processors. This change in hardware has changed the CM-2 to an effective performer on numerical problems. Teams using the CM-2 have won the Gordon Bell Awards for performance. 4.9 Summary The SIMD paradigm is not a panacea. There are many important models for which other paradigms are more appropriate. For example, a medical student training model that has several body systems working together on a set of symptoms would not be suited for SIMD processing. Each system has different operating characteristics and is doing different things than the other systems, but in harmony with them. This example is best handled in an environment where different processors are modeling the different systems and sharing information. This is not a SIMD environment. On the other hand, the SIMD paradigm is extremely appropriate for many problems that involve linear mathematical calculations or the database operations where several processors can be doing the same operation on different data sets at the same time. Linear mathematical calculations involve vector and matrix calculations. Database operations involve searching trees for information that may be contained in several different tables. This process involves the repetition of the same instructions on different sets of data. That is, by definition, a SIMD process. There are two primary architectural models within the class of SIMD machine. These are the vector processing model with its vector registers and accurate floating point hardware and the pure SIMD model that involves a master Control Unit and several lower cost processors each having its own memory. The vector processing machines were clearly designed for doing linear mathematical computations. They are generally extremely fast and deliver a high degree of accuracy. A fundamental assumption of these machines is that they will be operating on data contained within the ALU. Operations such as memory to memory multiplications create a bottle neck that would force the earlier vector processing machines to operate at less than their top rated speed. This bottleneck has at least been widened, if not entirely eliminated, by the addition of more memory ports and reduced machine cycle time in later model vector processing machines. The pure SIMD machines have the advantage that processing elements can be turned off when an operation is needed for only a subset of them. Another advantage is that processing elements can communicate quickly with their neighbors. The disadvantage is that each processing element may handle only small pieces of data. In some models this can be as small as one bit. In the early examples of this type of machine, this meant that any significant numerical calculation would slow down a 35 process considerably. In addition, the results might not have the accuracy that can be found on the vector hardware of a vector processing machine. That is because the earliest models of this type of processor were designed to carry out symbolic operations, for example to answer questions such as “Is Flipper a Whale?” as opposed to tracking the course of hurricanes or determine tomorrow’s high temperature reading. The addition of floating point hardware to operate on clusters of processors has made the pure SIMD architecture a much more acceptable one for numerical calculations. In fact, many large-scale numerical calculations have been successfully handled by machines built on this paradigm. Thus, while SIMD may not be the best answer for all problems, there are problems for which it is the superior design. Almost any problem appropriate for high performance computing has some parts that would be best handled by a SIMD machine. The next step in high performance computing may involve networking several machines operating under different paradigms to run a program by dispatching parts of it to the appropriate hardware. Certainly, a SIMD machine will be part of this larger, “hybrid” machine. 36