Chapter 7 Multicores, Multiprocessors, and Clusters FIGURE 7.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 2 FIGURE 7.2 Classic organization of a shared memory multiprocessor. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 3 FIGURE 7.3 The last four levels of a reduction that sums results from each processor, from bottom to top. For all processors whose number i is less than half, add the sum produced by processor number (i + half) to its sum. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 4 FIGURE 7.4 Classic organization of a multiprocessor with multiple private address spaces, traditionally called a message-passing multiprocessor. Note that unlike the SMP in Figure 7.2, the interconnection network is not between the caches and memory but is instead between processor-memory nodes. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 5 FIGURE 7.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipe line start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 6 FIGURE 7.6 Hardware categorization and examples based on number of instruction streams and data streams: SISD, SIMD, MISD, and MIMD. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 7 FIGURE 7.7 Comparing single core of a Sun UltraSPARC T2 (Niagara 2) to a single Tesla multiprocessor. The T2 core is a single processor and uses hardware-supported multithreading with eight threads. The Tesla multiprocessor contains eight streaming processors and uses hardware-supported multithreading with 24 warps of 32 threads (eight processors times four clock cycles). The T2 can switch every clock cycle, while the Tesla can switch only every two or four clock cycles. One way to compare the two is that the T2 can only multithread the processor over time, while Tesla can multithread over time and over space; that is, across the eight streaming processors as well as segments of four clock cycles. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 8 FIGURE 7.8 Hardware categorization of processor architectures and examples based on static versus dynamic and ILP versus DLP. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 9 FIGURE 7.9 Network topologies that have appeared in commercial parallel processors. The colored circles represent switches and the black squares represent processor-memory nodes. Even though a switch has many links, generally only one goes to the processor. The Boolean n-cube topology is an n-dimensional interconnect with 2n nodes, requiring n links per switch (plus one for the processor) and thus n nearest-neighbor nodes. Frequently, these basic topologies have been supplemented with extra arcs to improve performance and reliability. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 10 FIGURE 7.10 Popular multistage network topologies for eight nodes. The switches in these drawings are simpler than in earlier drawings because the links are unidirectional; data comes in at the bottom and exits out the right link. The switch box in c can pass A to C and B to D or B to C and A to D. The crossbar uses n2 switches, where n is the number of processors, while the Omega network uses 2n log2n of the large switch boxes, each of which is logically composed of four of the smaller switches. In this case, the crossbar uses 64 switches versus 12 switch boxes, or 48 switches, in the Omega network. The crossbar, how ever, can support any combination of messages between processors, while the Omega network cannot. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 FIGURE 7.11 Examples of parallel benchmarks. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 12 FIGURE 7.12 Arithmetic intensity, specified as the number of fl oat-point operations to run the program divided by the number of bytes accessed in main memory [Williams, Patterson, 2008]. Some kernels have an arithmetic intensity that scales with problem size, such as Dense Matrix, but there are many kernels with arithmetic intensities independent of problem size. For kernels in this former case, weak scaling can lead to different results, since it puts much less demand on the memory system. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 13 FIGURE 7.13 Roofline Model [Williams, Patterson, 2008]. This example has a peak floating-point performance of 16 GFLOPS/sec and a peak memory bandwidth of 16 GB/sec from the Stream benchmark. (Since stream is actually four measurements, this line is the average of the four.) The dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/byte. It is limited by memory bandwidth to no more than 8 GFLOPS/sec on this Opteron X2. The dotted vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited only computationally to 16 GFLOPS/s. (This data is based on the AMD Opteron X2 (Revision F) using dual cores running at 2 GHz in a dual socket system.) Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 FIGURE 7.14 Roofline models of two generations of Opterons. The Opteron X2 roofline, which is the same as Figure 7.11, is in black, and the Opteron X4 roofline is in color. The bigger ridge point of Opteron X4 means that kernels that where computationally bound on the Opteron X2 could be memory-performance bound on the Opteron X4. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 FIGURE 7.15 Roofline model with ceilings. The top graph shows the computational “ceilings” of 8 GFLOPs/sec if the floating-point operation mix is imbalanced and 2 GFLOPs/sec if the optimizations to increase ILP and SIMD are also missing. The bottom graph shows the memory bandwidth ceilings of 11 GB/sec without software prefetching and 4.8 GB/sec if memory affinity optimizations are also missing. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 16 FIGURE 7.16 Roofline model with ceilings, overlapping areas shaded, and the two kernels from Figure 7.13. Kernels whose arithmetic intensity land in the blue trapezoid on the right should focus on computation optimizations, and kernels whose arithmetic intensity land in the gray triangle in the lower left should focus on memory bandwidth optimizations. Those that land in the blue-gray parallelogram in the middle need to worry about both. As Kernel 1 falls in the parallelogram in the middle, try optimizing ILP and SIMD, memory affinity, and software prefetching. Kernel 2 falls in the trapezoid on the right, so try optimizing ILP and SIMD and the balance of floating-point operations. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 FIGURE 7.17 Four recent multiprocessors, each using two sockets for the processors. Starting from the upper left hand corner, the computers are: (a) Intel Xeon e5345 (Clovertown), (b) AMD Opteron X4 2356 (Barcelona), (c) Sun UltraSPARC T2 5140 (Niagara 2), and (d) IBM Cell QS20. Note that the Intel Xeon e5345 (Clovertown) has a separate north bridge chip not found in the other microprocessors. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 FIGURE 7.18 Characteristics of the four recent multicores. Although the Xeon e5345 and Opteron X4 have the same speed DRAMs, the Stream benchmark shows a higher practical memory bandwidth due to the inefficiencies of the front side bus on the Xeon e5345. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 19 FIGURE 7.19 Roofline model for multicore multiprocessors in Figure 7.15. The ceilings are the same as in Figure 7.13. Starting from the upper left hand corner, the computers are: (a) Intel Xeon e5345 (Clovertown), (b) AMD Opteron X4 2356 (Barcelona), (c) Sun UltraSPARC T2 5140 (Niagara 2), and (d) IBM Cell QS20. Note the ridge points for the four microprocessors intersect the X-axis at the arithmetic intensities of 6, 4, 1/3, and 3/4, respectively. The dashed vertical lines are for the two kernels of this section and the stars mark the performance achieved for these kernels after all the optimizations. SpMV is the pair of dashed vertical lines on the left. It has two lines because its arithmetic intensity improved from 0.166 to 0.255 based on register blocking optimizations. LBHMD is the dashed vertical lines on the right. It has a pair of lines in (a) and (b) because a cache optimization skips filling the cache block on a miss when the processor would write new data into the entire block. That optimization increases the arithmetic intensity from 0.70 to 1.07. It’s a single line in (c) at 0.70 because UltraSPARC T2 does not offer the cache optimization. It is a single line at 1.07 in (d) because Cell has local store loaded by DMA, so the program doesn’t fetch unnecessary data as do caches. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 20 FIGURE 7.20 Performance of SpMV on the four multicores. Copyright © 2009 Elsevier, Inc. Allrights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 21 FIGURE 7.21 Performance of LBMHD on the four multicores. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 22 FIGURE 7.22 Base versus fully optimized performance of the four cores on the two kernels. Note the high fraction of fully optimized performance delivered by the Sun UltraSPARC T2 (Niagara 2). There is no base performance column for the IBM Cell because there is no way to port the code to the SPEs without caches. While you could run the code on the Power core, it has an order of magnitude lower performance than the SPES, so we ignore it in this fi gure. Copyright © 2009 Elsevier, Inc. All rights reserved. Chapter 7 — Multicores, Multiprocessors, and Clusters — 23