Part VI Implementation Aspects Spring 2005 Parallel Processing, Implementation Aspects Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition First Spring 2005 Released Revised Revised Spring 2005 Parallel Processing, Implementation Aspects Slide 2 VI Implementation Aspects Study real parallel machines in MIMD and SIMD classes: • Examine parallel computers of historical significance • Learn from modern systems and implementation ideas • Bracket our knowledge with history and forecasts Topics in This Part Chapter 21 Shared-Memory MIMD Machines Chapter 22 Message-Passing MIMD Machines Chapter 23 Data-Parallel SIMD Machines Chapter 24 Past, Present, and Future Spring 2005 Parallel Processing, Implementation Aspects Slide 3 21 Shared-Memory MIMD Machines Learn about practical shared-variable parallel architectures: • Contrast centralized and distributed shared memories • Examine research prototypes and production machines Topics in This Chapter 21.1 Variations in Shared Memory 21.2 MIN-Based BBN Butterfly 21.3 Vector-Parallel Cray Y-MP 21.4 Latency-Tolerant Tera MTA 21.5 CC-NUMA Stanford DASH 21.6 SCI-Based Sequent NUMA-Q Spring 2005 Parallel Processing, Implementation Aspects Slide 4 21.1 Variations in Shared Memory Single Copy of Modifiable Data Central Main Memory Distributed Main Memory Multiple Copies of Modifiable Data UMA BBN Butterfly Cray Y-MP CC-UMA NUMA COMA CC-NUMA Tera MTA Stanford DASH Sequent NUMA-Q Fig. 21.1 Classification of shared-memory hardware architectures and example systems that will be studied in the rest of this chapter. Spring 2005 Parallel Processing, Implementation Aspects Slide 5 C.mmp: A Multiprocessor ofMemory Historical Significance modules Mem 0 Cache Map Mem 1 Proc Cache Map 1 16 16 crossbar . . . Mem 15 Proc 0 . . . Cache Map BI Clock BI Bus interface Proc 15 Bus control BI Fig. 21.2 Organization of the C.mmp multiprocessor. Spring 2005 Parallel Processing, Implementation Aspects Slide 6 Shared Memory Consistency Models Varying latencies makes each processor’s view of the memory different Sequential consistency (strictest and most intuitive); mandates that the interleaving of reads and writes be the same from the viewpoint of all processors. This provides the illusion of a FCFS single-port memory. Processor consistency (laxer); only mandates that writes be observed in the same order by all processors. This allows reads to overtake writes, providing better performance due to out-of-order execution. Weak consistency separates ordinary memory accesses from synch accesses that require memory to become consistent. Ordinary read and write accesses can proceed as long as there is no pending synch access, but the latter must wait for all preceding accesses to complete. Release consistency is similar to weak consistency, but recognizes two synch accesses, called “acquire” and “release”, that sandwich protected shared accesses. Ordinary read / write accesses can proceed only when there is no pending acquire access from the same processor and a release access must wait for all reads and writes to be completed. Spring 2005 Parallel Processing, Implementation Aspects Slide 7 21.2 MIN-Based BBN Butterfly Processing Node 0 MC 68000 Proces sor Proces sor Node Controller Memory Manager 44 44 44 44 44 44 44 44 Switch Interface EPROM 1 MB Memory To I/O Boards Daughter Board Connection for Memory Expans ion (3 MB) Processing Node 15 Fig. 21.3 Structure of a processing node in the BBN Butterfly. Spring 2005 Fig. 21.4 A small 16-node version of the multistage interconnection network of the BBN Butterfly. Parallel Processing, Implementation Aspects Slide 8 21.3 Vector-Parallel Cray Y-MP InterProcessor Commun. Fig. 21.5 Key elements of the Cray Y-MP processor. Address registers, address function units, instruction buffers, and control not shown. Vector Registers V7 V6 Add Shift Logic Weight/ Parity V5 V4 V3 V2 64 bit V1 0 1 2 3 FloatingPoint Add Units Multiply . . . Reciprocal Approx. V0 Central Memory 64 bit 62 63 Scalar Integer Units T Regis ters (8 64-bit) Input/Output Spring 2005 Vector Integer Units CPU S Regis ters (8 64-bit) From address registers/units Parallel Processing, Implementation Aspects Add Shift Logic Weight/ Parity 32 bit Slide 9 Cray Y-MP’s Interconnection Network Sections P0 Subsections 1 8 44 Memory Banks 0, 4, 8, ... , 28 32, 36, 40, ... , 92 8 8 P1 44 P2 4 4 1, 5, 9, ... , 29 8 8 P3 44 2, 6, 10, ... , 30 P4 44 8 8 P5 44 P6 4 4 3, 7, 11, ... , 31 8 8 P7 4 4 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP. Spring 2005 Parallel Processing, Implementation Aspects Slide 10 21.4 Latency-Tolerant Tera MTA Fig. 21.7 The instruction execution pipelines of Tera MTA. Spring 2005 Parallel Processing, Implementation Aspects Slide 11 21.5 CC-NUMA Stanford DASH Processor cluster Processor cluster Request mesh Reply mesh Main memory Level-2 cache I-cache D-cache Processor Processor cluster Fig. 21.8 Spring 2005 Processor cluster Directory & network interface Wormhole routers The architecture of Stanford DASH. Parallel Processing, Implementation Aspects Slide 12 21.6 SCI-Based Sequent NUMA-Q Quad 533 MB/s local bus (64 bits, 66 MHz) P6 Fig. 21.9 The physical placement of Sequent’s quad components on a rackmount baseboard (not to scale) P6 1 GB/s Interquad SCI link IQ-Link Orion PCI bridge OPB 100 MB/s Fibre Channel PCI/FC 4 x 33 MB/s PCI buses P6 OPB PCI/LAN Public LAN P6 Pentium Pro (P6) proc’s Memory IQ-Plus (SCI ring) MDP Quad . . . Quad Peripheral bay IQ-link Spring 2005 Fig. 21.10 The architecture of Sequent NUMA-Q 2000. Private Ethernet SCSI Bridge Console Parallel Processing, Implementation Aspects Bridge SCSI Peripheral bay Slide 13 Details of Network-side the IQ-Link in Sequent NUMA-Q Bus-side P6 bus snooping tags tags Local Remote directory tags Local Remote directory tags Fig. 21.11 Block diagram of the IQ-Link board. SCI in Bus interface controller Remote cache data Directory controller Interconnect controller 32 MB SCI out SCI in Assemble Fig. 21.12 Block diagram of IQ-Link’s interconnect controller. Spring 2005 Request receive Q Stripper Response receive Q Disassemble Request send Q Elastic buffer Response send Q Parallel Processing, Implementation Aspects Bypass FIFO SCI out Slide 14 22 Message-Passing MIMD Machines Learn about practical message-passing parallel architectures: • Study mechanisms that support message passing • Examine research prototypes and production machines Topics in This Chapter 22.1 Mechanisms for Message Passing 22.2 Reliable Bus-Based Tandem NonStop 22.3 Hypercube-Based nCUBE3 22.4 Fat-Tree-Based Connection Machine 5 22.5 Omega-Network-Based IBM SP2 22.6 Commodity-Driven Berkeley NOW Spring 2005 Parallel Processing, Implementation Aspects Slide 15 22.1 Mechanisms for Message Passing 1. Shared-Medium networks (buses, LANs; e.g., Ethernet, token-based) 2. Router-based networks (aka direct networks; e.g., Cray T3E, Chaos) 3. Switch-based networks (crossbars, MINs; e.g., Myrinet, IBM SP2) Crosspoint switch Inputs Outputs Through (straight) Crossed (exchange) Lower broadcast Upper broadcast Fig. 22.2 Example 4 4 and 2 2 switches Fig. 22.1 used as building The structure blocks foroflarger a generic networks. router. Spring 2005 Parallel Processing, Implementation Aspects Slide 16 Router-Based Networks Injection Channel Ejection Channel LC LC Link Controller Q Q Message Queue Routing/Arbitration Output Queue Input Queue LC Q Input Channels LC LC Q LC Switch LC Q Q LC LC Q Q LC Fig. 22.1 Spring 2005 Q Q Output Channels The structure of a generic router. Parallel Processing, Implementation Aspects Slide 17 Case Studies of Message-Passing Machines Fig. 22.3 Classification of message-passing hardware architectures and example systems that will be studied in this chapter. Shared-Medium Network CoarseGrain Router-Based Network Switch-Based Network Tandem NonStop (Bus) MediumBerkeley NOW Grain (LAN) nCUBE3 TMC CM-5 IBM SP2 FineGrain Spring 2005 Parallel Processing, Implementation Aspects Slide 18 22.2 Reliable Bus-Based Tandem NonStop Dynabus Proces so r an d Memo ry I/O I/O Proces so r an d Memo ry I/O I/O Proces so r an d Memo ry I/O I/O Proces so r an d Memo ry I/O I/O I/O Contro ller Fig. 22.4 Spring 2005 One section of the Tandem NonStop Cyclone system. Parallel Processing, Implementation Aspects Slide 19 High-Level Structure of the Tandem NonStop Four-processor sections Fig. 22.5 Four four-processor sections interconnected by Dynabus+. Dynabus+ Spring 2005 Parallel Processing, Implementation Aspects Slide 20 22.3 Hypercube-Based nCUBE3 I/O I/O Host I/O I/O Host Spring 2005 Parallel Processing, Implementation Aspects Fig. 22.6 An eight-node nCUBE architecture. Slide 21 22.4 Fat-Tree-Based Connection Machine 5 Control Network Data Network P P P M M M . . . P P CP CP M M M M Up to 16K Processing Nodes Fig. 22.7 Spring 2005 Diagnostic Network ... CP I/O M I/O Data Vault One or more Control Processors The overall structure of CM-5. HIPPI or VME Interface Parallel Processing, Implementation Aspects Graphics Output Slide 22 Processing Nodes in CM-5 Memory Memory Memory Memory Vector Unit Pipelined ALU Vector Unit Vector Unit 64 64 Register File Vector Unit Memory Control Ins tr. Decode Bus Interface 64-Bit Bus Fig. 22.8 The components of a processing node in CM-5. SPARC Processor Network Interface Data Network Spring 2005 Parallel Processing, Implementation Aspects Control Network Slide 23 The Interconnection Network in CM-5 Data Routers (4 2 Switches) 01 23 Fig. 22.9 The fat-tree (hyper-tree) data network of CM-5. Processors and I/O Nodes 63 Spring 2005 Parallel Processing, Implementation Aspects Slide 24 22.5 Omega-Network-Based IBM SP2 Ethernet Processor P Ethernet adapter E Micro Channel Controller Disk Console MCC Compute nodes . . . C C I/O bus M NIC Memory Netw ork interface To other systems G Gateways High-performance switch G I/O I/O . . . I/O Input/output nodes Fig. 22.10 Spring 2005 I/O H . . . H Host nodes The architecture of IBM SP series of systems. Parallel Processing, Implementation Aspects Slide 25 The IBM SP2 Network Interface Micro Channel Micro Channel interface Left DMA 2 x 2 KB FIFO buffers Right DMA 40 MHz i860 processor 8-MB DRAM 160 MB/s i860 bus Memory and switch management unit Input FIFO Output FIFO Fig. 22.11 The network interface controller of IBM SP2. Spring 2005 Parallel Processing, Implementation Aspects Slide 26 The Interconnection Network of IBM SP2 4x4 crossbars Only 1/4 of links at this level are shown 012 3 . . . . Fig. 22.12 Spring 2005 63 A section of the high-performance switch network of IBM SP2. Parallel Processing, Implementation Aspects Slide 27 22.6 Commodity-Driven Berkeley NOW NOW aims for building a distributed supercomputer from commercial workstations (high end PCs), linked via switch-based networks Components and challenges for successful deployment of NOWs (aka clusters of workstations, or COWs) Network interface hardware: System-area network (SAN), middle ground between LANs and specialized interconnections Fast communication protocols: Must provide competitive parameters for the parameters in the LogP model (see Section 4.5) Distributed file systems: Data files must be distributed among nodes, with no central warehouse or arbitration authority Global resource management: Berkeley uses GLUnix (global-layer Unix); other systems of comparable functionalities include Argonne Globus, NSCP metacomputing system, Princeton SHRIMP, Rice TreadMarks, Syracuse WWVM, Virginia Legion, Wisconsin Wind Tunnel Spring 2005 Parallel Processing, Implementation Aspects Slide 28 23 Data-Parallel SIMD Machines Learn about practical implementation of SIMD architectures: • Assess the successes, failures, and potentials of SIMD • Examine various SIMD parallel computers, old and new Topics in This Chapter 23.1 Where Have All the SIMDs Gone? 23.2 The First Supercomputer: ILLIAC IV 23.3 Massively Parallel Goodyear MPP 23.4 Distributed Array Processor (DAP) 23.5 Hypercubic Connection Machine 2 23.6 Multiconnected MasPar MP-2 Spring 2005 Parallel Processing, Implementation Aspects Slide 29 23.1 Where Have All the SIMDs Gone? Associative or content-addressable memories/processors constituted early forms of SIMD parallel processing Fig. 23.1 Functional view of an associative memory/processor. Spring 2005 Control Unit Global Operations Control & Response Read Lines Comparand Mask Data and Commands Broadcast Response Store (Tags) Cell 0 t0 Cell 1 t1 Cell 2 t2 . . . . . . Cell m–1 t m–1 Parallel Processing, Implementation Aspects Global Tag Operations Unit . . . Slide 30 Search Functions in Associative Devices Exact match: Locating data based on partial knowledge of contents Inexact match: Finding numerically or logically proximate values Membership: Identifying all members of a specified set Relational: Determining values that are less than, less than or equal, etc. Interval: Marking items that are between or outside given limits Extrema: Finding the maximum, minimum, next higher, or next lower Rank-based: Selecting kth or k largest/smallest elements Ordered retrieval: Repeated max- or min-finding with elimination (sorting) Spring 2005 Parallel Processing, Implementation Aspects Slide 31 Mixed-Mode SIMD / MIMD Parallelism Memory Man agement Sys tem Sys tem Co ntrol Unit Contro l Storage Mid-Level Co ntrollers Proc. 0 Proc. 1 Proc. 2 ... Parallel Computation Unit Proc. p– 1 Interco nnectio n Netwro k A B A B A B ... Memory Storage Sys tem Spring 2005 A B Memory Modu les Fig. 23.2 The architecture of Purdue PASM. Parallel Processing, Implementation Aspects Slide 32 23.2 The First Supercomputer: ILLIAC IV Data Mode To Proc. 63 Host (B6700) Contro l Unit Fetch data or ins tructions Proc. 0 Proc. 1 Proc. 2 Mem. 0 Mem. 1 Mem. 2 Contro l ... ... Proc. 63 To Proc. 0 Mem. 63 Fig. 23.3 The ILLIAC IV computer (the inter-processor routing network is only partially shown). Syn chro nized Disks (Main Memo ry) 3 2 1 0 299 298 Ban d 0 Spring 2005 3 2 1 0 299 298 Ban d 1 3 2 1 0 299 298 Ban d 2 Parallel Processing, Implementation Aspects ... 3 2 1 0 299 298 Ban d 51 Slide 33 23.3 Massively Parallel Goodyear MPP Tape Disk Printer Terminal Network Host processor (VA X 11/780) Program and data management unit Staging memory Control Switches 128-bit input interface Fig. 23.4 Spring 2005 Staging memory Arra y control Status Array unit 128 128 processors 128-bit output interface The architecture of Goodyear MPP. Parallel Processing, Implementation Aspects Slide 34 Processing Elements in the Goodyear MPP From left proces sor To right proces sor S Mas k G Comparator To global OR tree To/from neighbors NEWS C P Full Carry adder Sum Logic A Memory Fig. 23.5 The single-bit processor of MPP. Data bus Addres s Spring 2005 Variablelength shift register Parallel Processing, Implementation Aspects B Slide 35 23.4 Distributed Array Processor (DAP) To n eig hboring proces so rs N E W S Condition A From neighb oring proces so rs From co ntrol unit Fig. 23.6 The bit-serial processor of DAP. Spring 2005 { { NN EE SS WW Row Col C Mux Q To ro w/col res pon ses Full Carry ad der Sum Mux Memory D S From s outh neighbo r Parallel Processing, Implementation Aspects To n orth n eig hbor Slide 36 DAP’s High-Level Structure N W Column j E S Program memory Row i Master control unit Processors Q plane C plane Fast I/O Register Q in processor ij A plane D plane Host interface unit One plane of memory Host work station Array memory (at least 32K planes) Spring 2005 Local memory for processor ij Fig. 23.7 The high-level architecture of DAP system. Parallel Processing, Implementation Aspects Slide 37 23.5 Hypercubic Connection Machine 2 Nexus 3 2 Bus Interface 3 2 Front End 0 1 Fig. 23.8 The architecture of CM-2. Data Vault 1 Sequencer 0 16K Processors Spring 2005 Data Vault Data Vault I/O System Graphic Display Parallel Processing, Implementation Aspects Slide 38 The Simple Processing Elements of CM-2 From Memory f opcode a b c 0 1 2 3 4 5 6 7 Mux g opcode Fig. 23.9 Spring 2005 Flags f(a, b, c) 0 1 2 3 4 5 6 7 Mux g(a, b, c) To Memory The bit-serial ALU of CM-2. Parallel Processing, Implementation Aspects Slide 39 23.6 Multiconnected MasPar MP-2 Ethernet Front End Disk Array Array Control Unit Processor Array (X-net connected) I/O Channel Controller & Memory Fig. 23.10 The architecture of MasPar MP-2. Global Router Spring 2005 Parallel Processing, Implementation Aspects Slide 40 The Interconnection Network of MasPar MP-2 Processor cluster . . . Stage 3 . . . . . . . . . . . . . . . . . . . . . Stage 2 Stage 1 Fig. 23.11 The physical packaging of processor clusters and the 3-stage global router in MasPar MP-2 . Spring 2005 Parallel Processing, Implementation Aspects Slide 41 The Processing Nodes of MasPar MP-2 X-net In Comm. Inp ut X-net Out Comm. Output To Ro uter Stage 1 To Ro uter Stage 3 Flags Barrel Shifter Sig nifican d Unit Expon ent Un it ALU Log ic Word Bu s 32 bits Bit Bu s Fig. 23.12 Processor architecture in MasPar MP-2. Memory Address Un it Memo ry Data/ECC Unit Reg ister File Control Red uction External Memory Ins truction Spring 2005 Parallel Processing, Implementation Aspects Broadcast Slide 42 24 Past, Present, and Future Put the state of the art in context of history and future trends: • Quest for higher performance, and where it is leading us • Interplay between technology progress and architecture Topics in This Chapter 24.1 Milestones in Parallel Processing 24.2 Current Status, Issues, and Debates 24.3 TFLOPS, PFLOPS, and Beyond 24.4 Processor and Memory Technologies 24.5 Interconnection Technologies 24.6 The Future of Parallel Processing Spring 2005 Parallel Processing, Implementation Aspects Slide 43 24.1 Milestones in Parallel Processing 1840s Desirability of computing multiple values at the same time was noted in connection with Babbage’s Analytical Engine 1950s von Neumann’s and Holland’s cellular computers were proposed 1960s The ILLIAC IV SIMD machine was designed at U Illinois based on the earlier SOLOMON computer 1970s Vector machines, with deeply pipelined function units, emerged and quickly dominated the supercomputer market; early experiments with shared memory multiprocessors, message-passing multicomputers, and gracefully degrading systems paved the way for further progress 1980s Fairly low-cost and massively parallel supercomputers, using hypercube and related interconnection schemes, became available 1990s Massively parallel computers with mesh or torus interconnection, and using wormhole routing, became dominant 2000s Commodity and network-based parallel computing took hold, offering speed, flexibility, and reliability for server farms and other areas Spring 2005 Parallel Processing, Implementation Aspects Slide 44 24.2 Current Status, Issues, and Debates Design choices and key debates: Related user concerns: Architecture General- or special-purpose system? SIMD, MIMD, or hybrid? Cost per MIPS Scalability, longevity Interconnection Shared-medium, direct, or multistage? Custom or commodity? Cost per MB/s Expandability, reliability Routing Oblivious or adaptive? Wormhole, packet, or virtual cut-through? Speed, efficiency Overhead, saturation limit Programming Shared memory or message passing? New languages or libraries? Ease of programming Software portability Spring 2005 Parallel Processing, Implementation Aspects Slide 45 24.3 TFLOPS, PFLOPS, and Beyond Performance (TFLOPS) 1000 Plan Develop Use 100+ TFLOPS, 20 TB ASCI Purple 100 30+ TFLOPS, 10 TB ASCI Q 10+ TFLOPS, 5 TB ASCI White 10 ASCI 3+ TFLOPS, 1.5 TB ASCI Blue 1+ TFLOPS, 0.5 TB 1 1995 ASCI Red 2000 2005 2010 Calendar year Fig. 24.1 Milestones in the Accelerated Strategic Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level. Spring 2005 Parallel Processing, Implementation Aspects Slide 46 24.4 Processor and Memory Technologies Dedicated to memory access (address generation units, etc) Port-0 Units Port 2 86 Port 3 Port 4 Port 0 86 86 Reservation Station Reorder Buffer and Retirement Register File Shift FLP Mult FLP Div Integer Div FLP Ad d Integer Executio n Unit 0 Port-1 Units Port 1 Jump Exec Integer Executio n Unit Unit 1 Fig. 24.2 Key parts of the CPU in the Intel Pentium Pro microprocessor. Spring 2005 Parallel Processing, Implementation Aspects Slide 47 Ratio of wire delay to switching time 24.5 Interconnection Technologies 8 7 Continued use of Al/Oxide 6 Changeover to lower-resis tivity wires & low-dielectric-constant insulator; e.g., Cu/Polyimide 5 4 0.5 0.35 0.25 0.18 0.13 Feature size (m) Fig. 24.3 Changes in the ratio of a 1-cm wire delay to device switching time as the feature size is reduced. Spring 2005 Parallel Processing, Implementation Aspects Slide 48 Intermodule and Intersystem Connection Options 10 12 Bandwidth (b/s) Processor bus Geographically distributed I/O network System-area network (SAN) Local-area network (LAN) 10 9 Metro-area network (MAN) 10 6 Same geographic location 10 3 10 9 (ns) 10 6 (s) 10 3 (ms) 1 Wide-area network (WAN) (min) 10 3 (h) Latency (s) Fig. 24.4 Spring 2005 Various types of intermodule and intersystem connections. Parallel Processing, Implementation Aspects Slide 49 System Interconnection Media Twisted pair Plastic Insulator Copper core Coaxial cable Outer conductor Reflection Silica Light source Optical fiber Fig. 24.5 The three commonly used media for computer and network connections. Spring 2005 Parallel Processing, Implementation Aspects Slide 50 24.6 The Future of Parallel Processing In the early 1990s, it was unclear whether the TFLOPS performance milestone would be reached with a thousand or so GFLOPS nodes or with a million or so simpler MFLOPS processors Now, a similar question is in front of us for PFLOPS performance Asynchronous design for its speed and greater energy economy: Because pure asynchronous circuits create some difficulties, designs that are locally synchronous and globally asynchronous are flourishing Intelligent memories to help remove the memory wall: Memory chips have high internal bandwidths, but are limited by pins in how much data they can handle per clock cycle Reconfigurable systems for adapting hardware to application needs: Raw hardware that can be configured via “programming” to adapt to application needs may offer the benefits of special-purpose processing Network and grid computing to allow flexible, ubiquitous access: Resource of many small and large computers can be pooled together via a network, offering supercomputing capability to anyone Spring 2005 Parallel Processing, Implementation Aspects Slide 51 Parallel Processing for Low-Power Systems TIPS DSP performance per watt Absolute proce ssor performance Performance GIPS GP processor performance per watt MIPS kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 of Parhami’s computer architecture textbook. Spring 2005 Parallel Processing, Implementation Aspects Slide 52 Peak Performance of Supercomputers PFLOPS Earth Simulator 10 / 5 years ASCI White Pacific ASCI Red TFLOPS TMC CM-5 Cray X-MP Cray T3D TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Spring 2005 Parallel Processing, Implementation Aspects Slide 53 Evolution of Computer Performance/Cost From: “Robots After All,” by H. Moravec, CACM, pp. 90-97, October 2003. Mental power in four scales Spring 2005 Parallel Processing, Implementation Aspects Slide 54 ABCs of Parallel Processing in One Slide A Amdahl’s Law (Speedup Formula) Bad news – Sequential overhead will kill you, because: Speedup = T1/Tp 1/[f + (1 – f)/p] min(1/f, p) Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS. B Brent’s Scheduling Theorem Good news – Optimal scheduling is very difficult, but even a naive scheduling algorithm can ensure: T1/p Tp T1/p + T = (T1/p)[1 + p/(T1/T)] Result: For a reasonably parallel task (large T1/T), or for a suitably small p (say, p T1/T), good speedup and efficiency are possible. C Cost-Effectiveness Adage Real news – The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats. Spring 2005 Parallel Processing, Implementation Aspects Slide 55 Gordon Bell Prize for High-Performance Computing Established in 1988 by Gordon Bell, one of the pioneers in the field Awards are given annually in several categories (there are small cash awards, but the prestige of winning is the main motivator) The two key categories are as follows; others have varied over time Peak performance: The entrant must convince the judges that the submitted program runs faster than any other comparable application Price/Performance: The entrant must show that performance of the application divided by the list price of the smallest system needed to achieve the reported performance is better than that of any other entry Different hardware/software combinations have won the competition Supercomputer performance improvement has outpaced Moore’s law; however, progress shows signs of slowing down due to lack of funding In past few years, improvements have been due to faster uniprocessors (2004 NRC Report Getting up to Speed: The Future of Supercompuing) Spring 2005 Parallel Processing, Implementation Aspects Slide 56