4 Models of Parallel Processing Expand on the taxonomy of parallel processing from Chap. 1: • Abstract models of shared and distributed memory • Differences between abstract models and real hardware Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures 4.3 Global versus Distributed Memory 4.4 The PRAM Shared-Memory Model 4.5 Distributed-Memory or Graph Models 4.6 Circuit Model and Physical Realizations Winter 2014 Parallel Processing, Fundamental Concepts Slide 1 4.1 Development of Early Models Associative processing (AP) was perhaps the earliest form of parallel processing. Associative or content-addressable memories (AMs, CAMs), allow memory cells to be accessed based on contents rather than their physical locations within the memory array. AM/AP architectures are essentially based on incorporating simple processing logic into the memory array so as to remove the need for transferring large volumes of data through the limited-bandwidth interface between the memory and the processor (the von Neumann bottleneck). Winter 2014 Parallel Processing, Fundamental Concepts Slide 2 4.1 Development of Early Models Early associative memories provided two basic capabilities: 1. Masked search, or looking for a particular bit pattern in selected fields of all memory words and marking those for which a match is indicated. 2. Parallel write, or storing a given bit pattern into selected fields of all memory words that have been previously marked. 100111010110001101000 Comparand Mask Memory array with comparison logic Winter 2014 Parallel Processing, Fundamental Concepts Slide 3 The Flynn-Johnson Classification Revisited Data stream(s) “Array processor” MISD GMSV (Rarely used) DMSV Fig. 4.2 Fig. 4.1 Winter 2014 DMMP “Distributed “Distrib-memory shared memory” multicomputer Data Out I3 MIMD GMMP Shared variables Message passing Communication/ Synchronization I4 MIMD Memory “Uniprocessor” versus Distrib Global SIMD I5 Data In SIMD SISD I2 I1 Multiple “Shared-memory multiprocessor” Multiple Control stream(s) Single Single Global versus Distributed memory The Flynn-Johnson classification of computer systems. Parallel Processing, Fundamental Concepts Slide 4 The Flynn-Johnson Classification Revisited Winter 2014 Parallel Processing, Fundamental Concepts Slide 5 The Flynn-Johnson Classification Revisited I2 I5 I1 Data In Data Out I3 I4 Figure 4.2. Multiple Instruction streams operating on a single data stream • Various transformations are performed on each data item before it is passed on to the next processor(s). • Successive data items can go through different transformations • Data-dependent conditional statements Winter 2014 Parallel Processing, Fundamental Concepts Slide 6 4.2 SIMD versus MIMD Architectures Within the SIMD category, two fundamental design choices exist: 1. Synchronous versus asynchronous: In a SIMD machine, each processor can execute or ignore the instruction being broadcast based on its local state or datadependent conditions. This leads to some inefficiency in executing conditional computations. For example, an “if-then-else” statement is executed by first enabling the processors for which the condition is satisfied and then flipping the “enable” bit before getting into the “else” part. Winter 2014 Parallel Processing, Fundamental Concepts Slide 7 4.2 SIMD versus MIMD Architectures A possible cure is to use the asynchronous version of SIMD, known as SPMD (single-program, multiple data), where each processor runs its own copy of the common program. The advantage of SPMD is that in an “if-then-else” computation, each processor will only spend time on the relevant branch. The disadvantages include the need for occasional synchronization and the higher complexity of each processor, which must now have a program memory and instruction fetch/decode logic. Winter 2014 Parallel Processing, Fundamental Concepts Slide 8 4.2 SIMD versus MIMD Architectures SIMD: Processing the “If statement” Consider the following block of code, to be executed on four processors being run in SIMD mode. These are P0, P1, P2, and P3. if (x > 0) then y = y + 2 ; else y = y – 3; Suppose that the x values are as follows (1, –3, 2, –4). Here is what happens. Instruction y=y+2 y=y–3 P0 y=y+2 Nothing P1 Nothing y=y–3 P2 y=y+2 Nothing P3 Nothing y=y–3 What one wants to happen is possibly realized by the SPMD architecture. P0 y=y+2 Winter 2014 P1 y=y–3 P2 y=y+2 Parallel Processing, Fundamental Concepts P3 y=y–3 Slide 9 4.2 SIMD versus MIMD Architectures 2. Custom- versus commodity-chip SIMD: A SIMD machine can be designed based on commodity (offthe-shelf) components or with custom chips. In the first approach, components tend to be inexpensive because of mass production. However, such general-purpose components will likely contain elements that may not be needed for a particular design. Custom components generally offer better performance but lead to much higher cost in development. Winter 2014 Parallel Processing, Fundamental Concepts Slide 10 4.2 SIMD versus MIMD Architectures Within the MIMD class, three fundamental issues are: 1. Massively or moderately parallel processor. Is it more cost-effective to build a parallel processor out of a relatively small number of powerful processors or a massive number of very simple processors. Referring to Amdahl’s law, the first choice does better on the inherently sequential part of a computation while the second approach might allow a higher speed-up for the parallelizable part. Winter 2014 Parallel Processing, Fundamental Concepts Slide 11 4.2 SIMD versus MIMD Architectures 2. Tightly versus loosely coupled MIMD: Which is a better approach to high performance computing, that of using specially designed multiprocessors/ Multi computers or a collection of ordinary workstations that are interconnected by commodity networks ? 3. Explicit message passing versus virtual shared memory: Which scheme is better, that of forcing the users to explicitly specify all messages that must be sent between processors or to allow them to program in an abstract higher-level model, with the required messages automatically generated by the system software? Winter 2014 Parallel Processing, Fundamental Concepts Slide 12 4.3 Global versus Distributed Memory Memory modules Processors 0 0 1 1 Processorto-processor network . . . Processorto-memory network . . . p-1 m-1 ... Parallel I/O Fig. 4.3 Winter 2014 Options: Crossbar Bus(es) MIN Bottleneck Complex Expensive A parallel processor with global memory. Parallel Processing, Fundamental Concepts Slide 13 4.3 Global versus Distributed Memory Global memory may be visualized as being in a central location where all processors can access it with equal ease. Processors can access memory through a special processor-to-memory network. The interconnection network must have very low latency, because access to memory is quite frequent. Winter 2014 Parallel Processing, Fundamental Concepts Slide 14 4.3 Global versus Distributed Memory A global-memory multiprocessor is characterized by the type and number p of processors, the capacity and number m of memory modules, and the network architecture 1. Crossbar switch O(pm) complexity, and thus quite costly for highly parallel systems 2. Single or multiple buses the latter with complete or partial connectivity 3. Multistage interconnection network Winter 2014 Parallel Processing, Fundamental Concepts Slide 15 4.3 Global versus Distributed Memory One approach to reducing the amount of data that must pass through the processor-to memory interconnection network is to use a private cache memory of reasonable size within each processor. locality of data access, repeated access to the same data, and the greater efficiency of block, as opposed to word-at-a-time, data transfers. The use of multiple caches gives rise to the cache coherence problem. Winter 2014 Parallel Processing, Fundamental Concepts Slide 16 4.3 Global versus Distributed Memory Processors Processorto-processor network Memory modules Caches 0 0 1 1 Challenge: Cache coherence . . . Processorto-memory network p-1 . . . m-1 ... Parallel I/O Fig. 4.4 Winter 2014 A parallel processor with global memory and processor caches. Parallel Processing, Fundamental Concepts Slide 17 4.3 Global versus Distributed Memory With single cache, the write through policy can keep the two data copies consistent. For examples: Do not cache shared data at all or allow only a single cache copy Do not cache “writeable” shared data or allow only a single cache copy Use a cache coherence protocol Winter 2014 Parallel Processing, Fundamental Concepts Slide 18 4.3 Global versus Distributed Memory A collection of p processors, each with its own private memory, communicates through an interconnection network. The latency of the interconnection network may be less critical, as each processor is likely to access its own local memory most of the time. Because access to data stored in remote memory involves considerably more latency than access processor’s local memory, distributed-memory machines are sometimes described as non uniform access (NUMA) architectures. Winter 2014 Parallel Processing, Fundamental Concepts modules to the MIMD memory Slide 19 4.3 Global versus Distributed Memory Memories Processors Some Terminology: 0 NUMA Non uniform memory access (distributed shared memory) 1 . . . Interconnection network p-1 Parallel I/O Fig. 4.5 Winter 2014 . . . UMA Uniform memory access (global shared memory) COMA Cache-only memory arch A parallel processor with distributed memory. Parallel Processing, Fundamental Concepts Slide 20 4.4 The PRAM Shared-Memory Model The theoretical model used for conventional or sequential computers (SISD class) is known as the random-access machine (RAM) The abstraction consists of ignoring the details of the processorto-memory interconnection network and taking the view that each processor can access any memory location in each machine cycle, independent of what other processors are doing. The problem multiple processors attempting to write into a common memory location must be resolved in some way Winter 2014 Parallel Processing, Fundamental Concepts Slide 21 4.4 The PRAM Shared-Memory Model Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM). Winter 2014 Parallel Processing, Fundamental Concepts Slide 22 4.4 The PRAM Shared-Memory Model In the SIMD variant, all processors obey the same instruction in each machine cycle; however, because of indexed and indirect (register-based) addressing, they often execute the operation that is broadcast to them on different data. In view of the direct and independent access to every memory location allowed for each processor, the PRAM model depicted highly theoretical Because memory locations are too numerous to be assigned individual ports on an interconnection network, blocks of memory locations (or modules) would have to share a single network port. Winter 2014 Parallel Processing, Fundamental Concepts Slide 23 4.4 The PRAM Shared-Memory Model Pro cess ors 0 Pro cess or Con trol Memo ry Acces s Network & Con troller PRAM Cycle: Sh ared Memo ry 0 1 2 3 1 . . . . . . p– 1 Fig. 4.7 Winter 2014 All processors read memory locations of their choosing All processors compute one step independently All processors m–1 store results into memory locations of their choosing PRAM with some hardware details shown. Parallel Processing, Fundamental Concepts Slide 24 4.5 Distributed-Memory or Graph Models Given the internal processor and memory structures in each node, a distributed-memory architecture is characterized primarily by the network used to interconnect the nodes Important parameters of an interconnection network include: 1. Network diameter 2. Bisection (band)width 3. Vertex or node degree Winter 2014 Parallel Processing, Fundamental Concepts Slide 25 4.5 Distributed-Memory or Graph Models 1. Network diameter The network diameter is more important with store-and-forward routing 2. Bisection (band)width This is important when nodes communicate with each other in a random fashion 3. Vertex or node degree The node degree has a direct effect on the cost of each node with the effect being more significant for parallel ports containing several wires or when the node is required to communicate over all of its ports at once. Winter 2014 Parallel Processing, Fundamental Concepts Slide 26 4.5 Distributed-Memory or Graph Models Fig. 4.8 Winter 2014 The sea of interconnection networks. Parallel Processing, Fundamental Concepts Slide 27 Some Interconnection Networks (Table 4.2) ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Network name(s) Number of nodes Network diameter Bisection Node width degree Local links? ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1D mesh (linear array) k k–1 1 2 Yes 1D torus (ring, loop) k k/2 2 2 Yes 2D Mesh k2 2k – 2 k 4 Yes 2D torus (k-ary 2-cube) k2 k 2k 4 Yes1 3D mesh k3 3k – 3 k2 6 Yes 3D torus (k-ary 3-cube) k3 3k/2 2k2 6 Yes1 Pyramid (4k2 – 1)/3 2 log2 k 2k 9 No Binary tree 2l – 1 2l – 2 1 3 No 4-ary hypertree 2l(2l+1 – 1) 2l 2l+1 6 No Butterfly 2l(l + 1) 2l 2l 4 No Hypercube 2l l 2l–1 l No Cube-connected cycles 2l l 2l 2l–1 3 No Shuffle-exchange 2l 2l – 1 2l–1/l 4 unidir. No De Bruijn 2l l 2l /l 4 unidir. No –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1 With folded layout Winter 2014 Parallel Processing, Fundamental Concepts Slide 28 4.5 Distributed-Memory or Graph Models In LogP model, the communication architecture of parallel computer is captured in four parameters: Latency : Latency upper bound when a small message (of a few words) is sent from an arbitrary source node to an arbitrary destination node Overhead: The overhead, defined as the length of time when a processor is dedicated to the transmission or reception of a message, thus not being able to do any other computation gap : The gap, defined as the minimum time that must elapse between consecutive message transmissions or receptions by a single processor Processor: Processor multiplicity (p in our notation) Winter 2014 Parallel Processing, Fundamental Concepts Slide 29 4.5 Distributed-Memory or Graph Models Because a single bus can quickly become a performance bottleneck as the number of processors increase, a of multiple bus architecture are available for reducing bus traffic. Bus switch (Gateway) Low-level cluster Fig. 4.9 Winter 2014 Example of a hierarchical interconnection architecture. Parallel Processing, Fundamental Concepts Slide 30 4.6 Circuit Model and Physical Realizations The best thing is to model the machine at the circuit level, so that all computational and signal propagation delays can be taken into account. It is impossible for a complex supercomputer , both because generating and debugging detailed circuit specifications are not much easier than a fullblown implementation and because a circuit simulator would take eons to run the simulation. Winter 2014 Parallel Processing, Fundamental Concepts Slide 31 4.6 Circuit Model and Physical Realizations A more precise model, particularly if the circuit is to be implemented on a dense VLSI chip, would include the effect of wires, in terms of both the chip area they consume (cost) and the signal propagation delay between and within the interconnected blocks (time) It is seen that for the hypercube architecture, which has nonlocal links, the inter processor wire delays can dominate the intra processor delays, thus making the communication step time much larger than that of the mesh- and torus-based architectures. Winter 2014 Parallel Processing, Fundamental Concepts Slide 32 4.6 Circuit Model and Physical Realizations Fig. 4.10 Winter 2014 Intrachip wire delay as a function of wire length. Parallel Processing, Fundamental Concepts Slide 33 4.6 Circuit Model and Physical Realizations At times, we can determine bounds on area and wire-length parameters based on network properties, without having to resort to detailed specification and layout with VLSI design tools. For example in 2D VLSI implementation, the bisection width of a network yields a lower bound on its layout area in an asymptotic sense. If the bisection width is B, the smallest dimension of the chip should be at least Bw, where w is the minimum wire width (including the mandatory interwire spacing) Winter 2014 Parallel Processing, Fundamental Concepts Slide 34 4.6 Circuit Model and Physical Realizations Power consumption of digital circuits is another limiting factor. Power dissipation in modern microprocessors grows almost linearly with the product of die area and clock frequency (both steadily rising) and today stands at a few tens of watts in highperformance designs. Even if modern low-power design methods succeed in reducing this power by an order of magnitude, disposing of the heat generated by 1 M such processors is indeed a great challenge. Winter 2014 Parallel Processing, Fundamental Concepts Slide 35 Pitfalls of Scaling up (Fig. 4.11) If the weight of ant grows by a factor of one trillion, the thickness of its legs must grow by a factor of one million to support the new weight O(10O(104 )) 4 up ant on the rampage! Scaled upScaled ant on the rampage! What is wrong with this picture? What is wrong with this picture? Ant scaled up in length from 5 mm to 50 m Leg thickness must grow from 0.1 mm to 100 m Scaled ant collapses under own under weight. own weight. Scaled upupant collapses Winter 2014 Parallel Processing, Fundamental Concepts Slide 36