Optimizing Matrix Multiply - Computer Science Division

CS 267: Introduction to Parallel Machines and Programming Models James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 01/28/2009 CS267 Lecture 3 1 Outline • Overview of parallel machines (~hardware) and programming models (~software) • Shared memory • Shared address space • Message passing • Data parallel • Clusters of SMPs • Grid • Parallel machine may or may not be tightly coupled to programming model • Historically, tight coupling • Today, portability is important • Trends in real machines CS267 Lecture 3 01/28/2009 2 A generic parallel architecture Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory • Where is the memory physically located? • Is it connect directly to processors? • What is the connectivity of the network? 01/28/2009 CS267 Lecture 3 3 Parallel Programming Models • Programming model is made up of the languages and libraries that create an abstract view of the machine • Control • How is parallelism created? • What orderings exist between operations? • Data • What data is private vs. shared? • How is logically shared data accessed or communicated? • Synchronization • What operations can be used to coordinate parallelism? • What are the atomic (indivisible) operations? • Cost • How do we account for the cost of each of the above? 01/28/2009 CS267 Lecture 3 4 Simple Example • Consider applying a function f to the elements of an array A and then computing its sum: n 1  f ( A[i ]) i 0 • Questions: • Where does A live? All in single memory? Partitioned? • What work will be done by each processors? • They need to coordinate to get a single result, how? A = array of all data fA = f(A) s = sum(fA) A: f fA: sum s: 01/28/2009 CS267 Lecture 3 5 Programming Model 1: Shared Memory • Program is a collection of threads of control. • Can be created dynamically, mid-execution, in some languages • Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. • Threads communicate implicitly by writing and reading shared variables. • Threads coordinate by synchronizing on shared variables Shared memory s s = ... y = ..s ... 01/28/2009 i: 2 i: 5 P0 P1 i: 8 Private memory CS267 Lecture 3 Pn 6 Simple Example • Shared memory strategy: • small number p << n=size(A) processors • attached to single memory • Parallel Decomposition: n 1  f ( A[i ]) i 0 • Each evaluation and each partial sum is a task. • Assign n/p numbers to each of p procs • Each computes independent “private” results and partial sum. • Collect the p partial sums and compute a global sum. Two Classes of Data: • Logically Shared • The original n numbers, the global sum. • Logically Private • The individual function evaluations. • What about the individual partial sums? 01/28/2009 CS267 Lecture 3 7 Shared Memory “Code” for Computing a Sum static int s = 0; Thread 1 for i = 0, n/2-1 s = s + f(A[i]) Thread 2 for i = n/2, n-1 s = s + f(A[i]) • Problem is a race condition on variable s in the program • A race condition or data race occurs when: - two processors (or two threads) access the same variable, and at least one does a write. - The accesses are concurrent (not synchronized) so they could happen simultaneously 01/28/2009 CS267 Lecture 3 8 Shared Memory “Code” for Computing a Sum A= 3 5 f (x) = x2 static int s = 0; Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … 9 0 9 9 Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … 25 0 25 25 • Assume A = [3,5], f(x) = x2, and s=0 initially • For this program to work, s should be 32 + 52 = 34 at the end • but it may be 34,9, or 25 • The atomic operations are reads and writes • Never see ½ of one number, but no += operation is not atomic • All computations happen in (private) registers 01/28/2009 CS267 Lecture 3 9 Improved Code for Computing a Sum static int s = 0; static lock lk; Thread 1 Thread 2 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) lock(lk); s = s + local_s1 unlock(lk); local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) lock(lk); s = s +local_s2 unlock(lk); • Since addition is associative, it’s OK to rearrange order • Most computation is on private variables - Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s - The race condition can be fixed by adding locks (only one thread can hold a lock at a time; others wait for it) 01/28/2009 CS267 Lecture 3 10 Machine Model 1a: Shared Memory • Processors all connected to a large shared memory. • Typically called Symmetric Multiprocessors (SMPs) • SGI, Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP) • Multicore chips, except that all caches are shared • Difficulty scaling to large numbers of processors • <= 32 processors typical • Advantage: uniform memory access (UMA) • Cost: much cheaper to access data in cache than main memory. P2 P1 $ Pn $ $ bus shared $ memory 01/28/2009 CS267 Lecture 3 Note: $ = cache 11 Problems Scaling Shared Memory Hardware • Why not put more processors on (with larger memory?) • The memory bus becomes a bottleneck • Caches need to be kept coherent • Example from a Parallel Spectral Transform Shallow Water Model (PSTSWM) demonstrates the problem • Experimental results (and slide) from Pat Worley at ORNL • This is an important kernel in atmospheric models • • 99% of the floating point operations are multiplies or adds, which generally run well on all processors But it does sweeps through memory with little reuse of operands, so uses bus and shared memory frequently • These experiments show serial performance, with one “copy” of the code running independently on varying numbers of procs • • 01/28/2009 The best case for shared memory: no sharing But the data doesn’t all fit in the registers/cache CS267 Lecture 3 12 Example: Problem in Scaling Shared Memory • Performance degradation is a “smooth” function of the number of processes. • No shared data between them, so there should be perfect parallelism. • (Code was run for a 18 vertical levels with a range of horizontal sizes.) 01/28/2009 CS267 Lecture 3 From Pat Worley, ORNL 13 Machine Model 1b: Multithreaded Processor • Multiple thread “contexts” without full processors • Memory and some other state is shared • Sun Niagra processor (for servers) • Up to 64 threads all running simultaneously (8 threads x 8 cores) • In addition to sharing memory, they share floating point units • Why? Switch between threads for long-latency memory operations • Cray MTA and Eldorado processors (for HPC) T0 T1 Tn shared $, shared floating point units, etc. Memory 01/28/2009 CS267 Lecture 3 14 Eldorado Processor (logical view) 1 2 i =n . . . 3 i =2 i =1 Programs running in parallel i =n Subproblem A i =3 4 . Su bproblem B . . Serial Code i =1 i =0 Concurrent threads of computation Subproblem A .... Hardware stream s (128) Unused streams Instruction Ready Pool; Pipeline of executing instructions 01/28/2009 CS267 Lecture 3 Source: John Feo, Cray 15 Machine Model 1c: Distributed Shared Memory • Memory is logically shared, but physically distributed • Any processor can access any address in memory • Cache lines (or pages) are passed around machine • SGI Origin is canonical example (+ research machines) • Scales to 512 (SGI Altix (Columbia) at NASA/Ames) • Limitation is cache coherency protocols – how to keep cached copies of the same address consistent P2 P1 $ Pn $ $ network memory memory 01/28/2009 memory CS267 Lecture 3 Cache lines (pages) must be large to amortize overhead  locality still critical to performance 16 Programming Model 2: Message Passing • Program consists of a collection of named processes. • Usually fixed at program startup time • Thread of control plus local address space -- NO shared data. • Logically shared data is partitioned over local processes. • Processes communicate by explicit send/receive pairs • Coordination is implicit in every communication event. • MPI (Message Passing Interface) is the most commonly used SW Private memory s: 12 s: 14 s: 11 receive Pn,s y = ..s ... i: 2 i: 3 P0 P1 i: 1 send P1,s Pn Network 01/28/2009 CS267 Lecture 3 17 Computing s = A[1]+A[2] on each processor ° First possible solution – what could go wrong? Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xlocal = A[2] send xlocal, proc1 receive xremote, proc1 s = xlocal + xremote ° If send/receive acts like the telephone system? The post office? ° Second possible solution Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xlocal = A[2] receive xremote, proc1 send xlocal, proc1 s = xlocal + xremote ° What if there are more than 2 processors? 01/28/2009 CS267 Lecture 3 18 MPI – the de facto standard MPI has become the de facto standard for parallel computing using message passing Pros and Cons of standards • MPI created finally a standard for applications development in the HPC community  portability • The MPI standard is a least common denominator building on mid-80s technology, so may discourage innovation Programming Model reflects hardware! “I am not sure how I will program a Petaflops computer, but I am sure that I will need MPI somewhere” – HDS 2001 01/28/2009 CS267 Lecture 3 19 Machine Model 2a: Distributed Memory • Cray T3E, IBM SP2 • PC Clusters (Berkeley NOW, Beowulf) • IBM SP-3, Millennium, CITRIS are distributed memory machines, but the nodes are SMPs. • Each processor has its own memory and cache but cannot directly access another processor’s memory. • Each “node” has a Network Interface (NI) for all communication and synchronization. P0 memory NI P1 memory NI Pn ... NI memory interconnect 01/28/2009 CS267 Lecture 3 20 PC Clusters: Contributions of Beowulf • An experiment in parallel computing systems • Established vision of low cost, high end computing • Demonstrated effectiveness of PC clusters for some (not all) classes of applications • Provided networking software • Conveyed findings to broad community (great PR) • Tutorials and book • Design standard to rally community! • Standards beget: books, trained people, software … virtuous cycle Adapted from Gordon Bell, presentation at Salishan 2000 CS267 Lecture 3 01/28/2009 21 Tflop/s and Pflop/s Clusters The following are examples of clusters configured out of separate networks and processor components • 82% of Top 500 (Nov 2008, up from 72% in 2005), • 3 of top 10 • IBM Cell cluster at Los Alamos (Roadrunner) is #1 • 12,960 Cell chips + 6,948 dual-core AMD Opterons; • 129600 cores altogether • 1.45 PFlops peak, 1.1PFlops Linpack, 2.5MWatts • Infiniband connection network • In 2005 Walt Disney Feature Animation (The Hive) was #96 • 1110 Intel Xeons @ 3 GHz • Gigabit Ethernet • For more details use “database/sublist generator” at www.top500.org 01/28/2009 CS267 Lecture 3 22 Machine Model 2b: Internet/Grid Computing • SETI@Home: Running on 500,000 PCs • ~1000 CPU Years per Day • 485,821 CPU Years so far • Sophisticated Data & Signal Processing Analysis • Distributes Datasets from Arecibo Radio Telescope Next StepAllen Telescope Array 01/28/2009 CS267 Lecture 4 23 Programming Model 2a: Global Address Space • Program consists of a collection of named threads. • • • • Usually fixed at program startup time Local and shared data, as in shared memory model But, shared data is partitioned over local processes Cost models says remote data is expensive • Examples: UPC, Titanium, Co-Array Fortran • Global Address Space programming is an intermediate point between message passing and shared memory Shared memory s[0]: 26 s[1]: 32 i: 1 i: 5 P0 P1 s[n]: 27 y = ..s[i] ... 01/28/2009 Private memory CS267 Lecture 3 i: 8 Pn s[myThread] = ... 24 Machine Model 2c: Global Address Space • Cray T3D, T3E, X1, and HP Alphaserver cluster • Clusters built with Quadrics, Myrinet, or Infiniband • The network interface supports RDMA (Remote Direct Memory Access) • NI can directly access memory without interrupting the CPU • One processor can read/write memory with one-sided operations (put/get) • Not just a load/store as on a shared memory machine • Continue computing while waiting for memory op to finish • Remote data is typically not cached locally P0 memory NI P1 memory NI Pn ... memory NI Global address space may be supported in varying degrees interconnect 01/28/2009 CS267 Lecture 3 25 Programming Model 3: Data Parallel • Single thread of control consisting of parallel operations. • Parallel operations applied to all (or a defined subset) of a data structure, usually an array • • • • Communication is implicit in parallel operators Elegant and easy to understand and reason about Coordination is implicit – statements executed synchronously Similar to Matlab language for array operations • Drawbacks: • Not all problems fit this model • Difficult to map onto coarse-grained machines A = array of all data fA = f(A) s = sum(fA) A: f fA: sum s: 01/28/2009 CS267 Lecture 3 26 Machine Model 3a: SIMD System • A large number of (usually) small processors. • A single “control processor” issues each instruction. • Each processor executes the same instruction. • Some processors may be turned off on some instructions. • Originally machines were specialized to scientific computing, few made (CM2, Maspar) • Programming model can be implemented in the compiler • mapping n-fold parallelism to p processors, n >> p, but it’s hard (e.g., HPF) control processor P1 memory NI P1 memory NI P1 memory NI ... P1 memory NI P1 NI memory interconnect 01/28/2009 CS267 Lecture 3 27 Machine Model 3b: Vector Machines • Vector architectures are based on a single processor • Multiple functional units • All performing the same operation • Instructions may specific large amounts of parallelism (e.g., 64way) but hardware executes only a subset in parallel • Historically important • Overtaken by MPPs in the 90s • Re-emerging in recent years • At a large scale in the Earth Simulator (NEC SX6) and Cray X1 • At a small scale in SIMD media extensions to microprocessors • • • SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc) • At a larger scale in GPUs • Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to CS267 Lecture 3 28 01/28/2009 Vector Processors • Vector instructions operate on a vector of elements • These are specified as operations on vector registers r1 r2 … vr1 + vr2 + r3 … (logically, performs # elts adds in parallel) … vr3 • A supercomputer vector register holds ~32-64 elts • The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 • The hardware performs a full vector operation in • #elements-per-vector-register / #pipes vr1 … + 01/28/2009 … vr2 + + ++ + CS267 Lecture 3 (actually, performs #pipes adds in parallel) 29 Cray X1: Parallel Vector Architecture Cray combines several technologies in the X1 • • • • • 12.8 Gflop/s Vector processors (MSP) Shared caches (unusual on earlier vector machines) 4 processor nodes sharing up to 64 GB of memory Single System Image to 4096 Processors Remote put/get between nodes (faster than MPI) 01/28/2009 CS267 Lecture 3 31 Earth Simulator Architecture Parallel Vector Architecture • High speed (vector) processors • High memory bandwidth (vector architecture) • Fast network (new crossbar switch) Rearranging commodity parts can’t match this performance 01/28/2009 CS267 Lecture 3 32 Programming Model 4: Hybrids • These programming models can be mixed • Message passing (MPI) at the top level with shared memory within a node is common • New DARPA HPCS languages mix data parallel and threads in a global address space • Global address space models can (often) call message passing libraries or vice verse • Global address space models can be used in a hybrid mode • Shared memory when it exists in hardware • Communication (done by the runtime system) otherwise • For better or worse…. 01/28/2009 CS267 Lecture 3 33 Machine Model 4: Clusters of SMPs • SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network • Common names: • CLUMP = Cluster of SMPs • Hierarchical machines, constellations • Many modern machines look like this: • Millennium, IBM SPs, ASCI machines • What is an appropriate programming model #4 ??? • Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). • Shared memory within one SMP, but message passing outside of an SMP. 01/28/2009 CS267 Lecture 3 34 Outline • Overview of parallel machines and programming models • Shared memory • Shared address space • Message passing • Data parallel • Clusters of SMPs • Trends in real machines (www.top500.org) 01/28/2009 CS267 Lecture 3 35 TOP500 - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from Linpack Ax=b, dense problem - Updated twice a year: Rate TPP performance ISC‘xy in Germany, June xy SC‘xy in USA, November xy Size - All data available from www.top500.org 01/28/2009 CS267 Lecture 3 36 EXTRA SLIDES (TOP 500 FROM NOV 2007) 01/28/2009 CS267 Lecture 3 37 30th List: The TOP10 Manufacturer Computer BlueGene/L Rmax Installation Site Country 478.2 DOE/NNSA/LLNL USA 2007 212,992 167.3 Forschungszentrum Juelich Germany 2007 65,536 USA 2007 14,336 India 2007 14,240 Sweden 2007 13,728 [TF/s] 1 IBM 2 IBM 3 SGI 4 HP Cluster Platform 3000 BL460c 117.9 5 HP Cluster Platform 3000 BL460c 102.8 6 Sandia/Cray 7 Cray 8 IBM 9 Cray 10 IBM eServer Blue Gene JUGENE BlueGene/P Solution SGI Altix ICE 8200 Red Storm Cray XT3 Jaguar Cray XT3/XT4 BGW eServer Blue Gene Franklin Cray XT4 New York Blue eServer Blue Gene 126.9 New Mexico Computing Applications Center Computational Research Laboratories, TATA SONS Swedish Government Agency Year #Cores 102.2 DOE/NNSA/Sandia USA 2006 26,569 101.7 DOE/ORNL USA 2007 23,016 91.29 IBM Thomas Watson USA 2005 40,960 85.37 NERSC/LBNL USA 2007 19,320 82.16 Stony Brook/BNL USA 2007 36,864 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 38 Competition between Manufacturers, Countries and Sites Top500 Status 1st Top500 List in June 1993 at ISC’93 in Mannheim • • • • 29th Top500 List on June 27, 2007 at ISC’07 in Dresden 30th Top500 List on November 13, 2007 at SC07 in Reno 31st Top500 List on June 18, 2008 at ISC’08 in Dresden 32nd Top500 List on November 18, 2008 at SC08 in Austin 33rd Top500 List on June 24, 2009 at ISC’09 in Hamburg Acknowledged by HPC-users, manufacturers and media Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 39 Competition between Manufacturers, Countries and Sites 1st List as of 06/1993 Countries 30th List as of 11/2007 Count Share % Countries Count Share % USA 225 45.00 % USA 283 56.60% Japan 111 22.20 % Japan 20 4.00 % Germany 59 11.80 % Germany 31 6.20 % France 26 5.20 % France 17 3.40 % United Kingdom 25 5.00 % United Kingdom 48 9.60 % Australia 9 1.80 % Australia 1 0.20 % Italy 6 1.20 % Italy 6 1.20 % Netherlands 6 1.20 % Netherlands 6 1.20 % Switzerland 4 0.80 % Switzerland 7 1.40 % Canada 3 0.60 % Canada 5 1.00 % Denmark 3 0.60 % Denmark 1 0.20 % Korea 3 0.60 % Korea 1 0.20 % Others 20 4.00 % China 10 2.00 % Totals 500 100 % India 9 1.80 % Others 55 11.00 % Totals 500 100 % Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 40 Competition between Manufacturers, Countries and Sites Countries/Systems 500 Others China 400 Korea Italy 300 France 200 UK Germany 100 Japan US 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 41 Competition between Manufacturers, Countries and Sites 1st List as of 06/1993 30th List as of 11/2007 Manufacturers Count Share % Manufacturers Count Share % Cray Research 205 41.00 % Cray Inc. 14 2.80 % Fujitsu 69 13.80 % Fujitsu 3 0.60 % Thinking Machines 54 10.80 % Thinking Machines Intel 44 8.80 % Intel Convex 36 7.20 % Hewlett-Packard NEC Kendall Square Research MasPar 32 6.40 % 21 4.20 % — — 18 3.60 % NEC Kendall Square Research MasPar — — Meiko 9 1.80 % Meiko — — Hitachi 6 1.20 % Parsytec 3 0.60 % nCube 3 0.60 % Totals 500 100 % — Hitachi/Fujitsu — 1 0.20 % 166 33.20 % 2 0.40 % 1 0.20 % Parsytec — — nCube — — IBM 232 46.40 % SGI 22 4.40 % Dell 24 4.80 % Others 35 7.00 % Totals 500 100 % Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 42 Competition between Manufacturers, Countries and Sites Manufacturer/Systems 500 others Hitachi NEC Fujitsu Intel TMC HP Sun IBM SGI Cray 400 300 200 100 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 43 Competition between Manufacturers, Countries and Sites Top Sites through 30 Lists Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Site Lawrence Livermore National Laboratory Sandia National Laboratories Los Alamos National Laboratory Government The Earth Simulator Center National Aerospace Laboratory of Japan Oak Ridge National Laboratory NCSA NASA/Ames Research Center/NAS University of Tokyo NERSC/LBNL Pittsburgh Supercomputing Center Semiconductor Company (C) Naval Oceanographic Office (NAVOCEANO) ECMWF ERDC MSRC IBM Thomas J. Watson Research Center Forschungszentrum Juelich (FZJ) Japan Atomic Energy Research Institute Minnesota Supercomputer Center Country United States United States United States United States Japan Japan United States United States United States Japan United States United States United States United States United Kingdom United States United States Germany Japan United States Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 % over time 5.39 3.70 3.41 3.34 1.99 1.70 1.39 1.31 1.25 1.21 1.19 1.15 1.11 1.08 1.02 0.91 0.86 0.84 0.83 0.74 page 44 My Supercomputer Favorite in the Top500 Lists Number of Teraflop/s Systems in the Top500 500 100 10 1 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 45 The 30th List as of November 2007 Processor Architecture/Systems 500 400 SIMD 300 Vector 200 Scalar 100 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 46 The 30th List as of November 2007 Operating Systems/Systems 500 Windows 400 Mac OS N/A 300 Mixed 200 BSD Based Linux 100 Unix Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 47 The 30th List as of November 2007 Processor Generation/Systems (November 2007) Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 48 The 30th List as of November 2007 Interconnect Family/Systems 500 Others Quadrics 400 Proprietary Fat Tree 300 Infiniband Cray Interconnect Myrinet 200 SP Switch Gigabit Ethernet 100 Crossbar N/A 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 49 The 30th List as of November 2007 Architectures / Systems 500 SIMD 400 Single Proc. 300 SMP 200 Const. Cluster 100 MPP 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 50 Performance Development and Projection Performance Development 6.97 PF/s 1 Pflop/s 478.2 TF/s 100 Tflop/s 10 Tflop/s 1 Tflop/s BlueGene/L NEC 1.167 TF/s N=1 Earth Simulator 5.9 TF/s IBM ASCI White 59.7 GF/s Intel ASCI Red LLNL Sandia 100 Gflop/s Jack‘s Notebook Fujitsu 'NWT' NAL 10 Gflop/s 1 Gflop/s IBM SUM N=500 0.4 GF/s Jack‘s Notebook Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 100 Mflop/s page 51 Performance Development and Projection Performance Projection 1 Eflop/s 1 Pflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s SUM Jack‘s Notebook 6-8 years N=1 8-10 years N=500 Jack‘s Notebook 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 52 Analysis of TOP500 Data Annual performance growth about a factor of 1.82 Two factors contribute almost equally to the annual total performance growth Processor number grows per year on the average by a factor of 1.30 and the Processor performance grows by 1.40 compared to 1.58 of Moore's Law Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp 1517-1544. Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 53 Bell‘s Law Bell's Law of Computer Class formation was discovered about 1972. It states that technology advances in semiconductors, storage, user interface and networking advance every decade enable a new, usually lower priced computing platform to form. Once formed, each class is maintained as a quite independent industry structure. This explains mainframes, minicomputers, workstations and Personal computers, the web, emerging web services, palm and mobile devices, and ubiquitous interconnected networks. We can expect home and body area networks to follow this path. From Gordon Bell (2007), http://research.microsoft.com/~GBell/Pubs.htm Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 54 Bell‘s Law Bell’s Law states, that: Important classes of computer architectures come in cycles of about 10 years. It takes about a decade for each phase : Early research Early adoption and maturation Prime usage Phase out past its prime Can we use Bell’s Law to classify computer architectures in the TOP500? Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 55 Bell‘s Law Gordon Bell (1972): 10 year cycles for computer classes Computer classes in HPC based on the TOP500: Data Parallel Systems: Vector (Cray Y-MP and X1, NEC SX, …) SIMD (CM-2, …) Custom Scalar Systems: MPP (Cray T3E and XT3, IBM SP, …) Scalar SMPs and Constellations (Cluster of big SMPs) Commodity Cluster: NOW, PC cluster, Blades, … Power-Efficient Systems (BG/L or BG/P as first example of lowpower / embedded systems = potential new class ?) Tsubame with Clearspeed, Roadrunner with Cell ? Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 56 Bell‘s Law Computer Classes / Systems 500 BG/L or BG/P 400 Commodity Cluster 300 200 Custom Scalar 100 Vector/SIMD Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 57 Bell‘s Law Computer Classes - refined / Systems 500 BG/L or BG/P Commodity Cluster 400 300 200 SMP/ Constellat. MPP 100 SIMD Vector Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 58 Bell‘s Law HPC Computer Classes Class Early Adoption starts: Prime Use starts: Past Prime Usage starts: Data Parallel Systems Mid 70’s Mid 80’s Mid 90’s Custom Scalar Systems Mid 80’s Mid 90’s Mid 2000’s Commodity Cluster Mid 90’s Mid 2000’s Mid 2010’s ??? BG/L or BG/P Mid 2000’s Mid 2010’s ??? Mid 2020’s ??? Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 59 Supercomputing, quo vadis? Countries/Systems (November 2007) Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 60 Supercomputing, quo vadis? Countries/Performance (November 2007) Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 61 Supercomputing, quo vadis? Countries/Systems (November 2007) TOP50 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 62 Supercomputing, quo vadis? Continents/Systems 500 400 Others 300 Asia Europe 200 Americas 100 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 63 Supercomputing, quo vadis? Asian Countries / Systems 120 100 80 Others 60 India 40 China 20 Korea, South Japan Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 64 Supercomputing, quo vadis? Producing Regions / Systems Others Europe 500 Japan 400 300 USA 200 100 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 0 page 65 Top500, quo vadis? www.top500.org Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 66 Top500, quo vadis? Summary after Fifteen Years of Experience TOP500 proofed itself by correcting the Mannheim Supercomputer Statistics. 150 100 50 Nov 07 Nov 06 Nov 05 Nov 04 Nov 03 Nov 02 Nov 01 Nov 00 Nov 99 Nov 98 Nov 97 0 Nov 96 This smoothes seasonal fluctuation. 200 Nov 95 It’s inventory based. 250 Nov 94 It does not easily allow to track market size 300 Nov 93 It is simplistic, but (or because of it) it gets trends right. Replacement Rate Turn-over very high, so it still reflects recent developments. Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 67 Top500, quo vadis? Motivation for Additional Benchmarks Linpack Benchmark Pros One number Simple to define and rank Allows problem size to change with machine and over time Allowing Competitions Clearly need something more than Linpack HPC Challenge Benchmark and others Cons Emphasizes only “peak” CPU speed and number of CPUs Does not stress local bandwidth Does not stress the network No single number can reflect overall performance Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 68 Top500, quo vadis? Presented at ISC’06 June 27-30 2006 Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 69 Top500, quo vadis? Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 page 70 Top500, quo vadis? http://www.green500.org Green500 Rank 1 MFLOPS/W Site Computer 357.23 Blue Gene/P Solution 2 352.25 3 346.95 Science and Technology Facilities Council - Daresbury Laboratory Max-Planck-Gesellschaft MPI/IPP IBM - Rochester 4 336.21 5 310.93 6 210.56 7 210.56 8 210.56 9 210.56 10 210.56 Forschungszentrum Juelich (FZJ) Oak Ridge National Laboratory Harvard University High Energy Accelerator Research Organization /KEK IBM - Almaden Research Center IBM Research IBM Thomas J. Watson Research Center Blue Gene/P Solution Blue Gene/P Solution Blue Gene/P Solution Blue Gene/P Solution eServer Blue Gene Solution eServer Blue Gene Solution eServer Blue Gene Solution eServer Blue Gene Solution eServer Blue Gene Solution Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007 Total Power (kW) 31.10 TOP500 Rank 121 62.20 40 124.40 24 497.60 2 70.47 41 44.80 170 44.80 171 44.80 172 44.80 173 44.80 174 page 71 Summary • Historically, each parallel machine was unique, along with its programming model and programming language. • It was necessary to throw away software and start over with each new kind of machine. • Now we distinguish the programming model from the underlying machine, so we can write portably correct codes that run on many machines. • MPI now the most portable option, but can be tedious. • Writing portably fast code requires tuning for the architecture. • Algorithm design challenge is to make this process easy. • Example: picking a blocksize, not rewriting whole algorithm. 01/28/2009 CS267 Lecture 4 72 Reading Assignment • Extra reading for today • Cray X1 http://www.sc-conference.org/sc2003/paperpdfs/pap183.pdf • Clusters http://www.mirror.ac.uk/sites/www.beowulf.org/papers/ICPP95/ • "Parallel Computer Architecture: A Hardware/Software Approach" by Culler, Singh, and Gupta, Chapter 1. • Next week: Current high performance architectures • Shared memory (for Monday) • Memory Consistency and Event Ordering in Scalable SharedMemory Multiprocessors, Gharachorloo et al, Proceedings of the International symposium on Computer Architecture, 1990. • Or read about the Altix system on the web (www.sgi.com) • Blue Gene L (for Wednesday) • http://sc-2002.org/paperpdfs/pap.pap207.pdf 01/28/2009 CS267 Lecture 4 73 Open Source Software Model for HPC • Linus's law, named after Linus Torvalds, the creator of Linux, states that "given enough eyeballs, all bugs are shallow". • All source code is “open” • Everyone is a tester • Everything proceeds a lot faster when everyone works on one code (HPC: nothing gets done if resources are scattered) • Software is or should be free (Stallman) • Anyone can support and market the code for any price • Zero cost software attracts users! • Prevents community from losing HPC software (CM5, T3E) 01/28/2009 CS267 Lecture 4 74 Cluster of SMP Approach • A supercomputer is a stretched high-end server • Parallel system is built by assembling nodes that are modest size, commercial, SMP servers – just put more of them together Image from LLNL 01/28/2009 CS267 Lecture 4 75

Optimizing Matrix Multiply - Computer Science Division

Related documents

Products

Support

Optimizing Matrix Multiply - Computer Science Division

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib