High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive Supercomputing, Chief Science Officer) Not said: many powerful computer owners prefer low profiles Some historical machines Earth Simulator was #1 now #30 Moore’s Law • The number of people who point out that “Moore’s Law” is dead is doubling every year. • Feb 2008: NSF requests $20M for "Science and Engineering Beyond Moore's Law" – Ten years out, Moore’s law itself may be dead • Moore’s law has various forms and versions never stated by Moore but roughly doubling every 18 months-2 years – Number of transistors – Computational Power – Parallelism! Still good for a while! At Risk! AMD Opteron quadcore 8350 Sept 2007 Eight core in 2009? 2.0? 2.0? Intel Clovertown and Dunnington Six Core: Later in 2008? FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers 42.7GB/s (read), 21.3 GB/s (write) 1.4gHz 16 core in 2008? Fully Buffered DRAM 4MB Shared L2 (16 way) 8K D$ 179 GB/s (fill) MT UltraSparc 90 GB/s (writethru) FPU Crossbar Switch Sun Niagara 2 Accelerators 512K L2 PPE PPE 512K L2 SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF XDR 25.6GB/s XDR DRAM <<20GB/s each direction BIF XDR Global Thread Scheduler 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K EIB (Ring Network) EIB (Ring Network) MFC 256K SPE SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM Address units Address units Address units Address units Address units Address units Address units Address units 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ Crossbar?? Ring?? 128KB L2 const$ & texture$ (shared across SMs) 25.6GB/s XDR DRAM IBM Cell Blade DRAM controllers (6 x 64b) 86.4 GB/s 768MB GDDR3 Device DRAM NVIDIA Sicortex • Teraflops from Milliwatts Software Give me software leverage and a supercomputer, and I shall solve the world’s problems (apologies to) Archimedes What’s wrong with this story? • I can’t get my five year old son off my (serial) computer • I have access to the world’s fastest machines and have nothing cool to show him! Engineers and Scientists (The leading indicators) • Mostly work in serial (still!) (Just like my 5 year old) • Those working in parallel Go to conferences, show off speedups • Software: MPI – – – – – (Message Passing Interface) Really thought of as the only choice Some say the assembler of parallel computing Some say has allowed code to be portable Others say has held back progress and performance Old Homework (emphasized for effect) • Download a parallel program from somewhere. – Make it work • Download another parallel program – Now, …, make them work together! Apples and Oranges • A: row distributed array (or worse) • B: column distributed array(or worse) • C=A+B MPI Performance vs PThreads Professional Performance Study by Sam Williams AMD Opteron 4.0 3.5 3.5 3.0 3.0 2.5 2.5 Naïve Single Thread MPI(autotuned) Pthreads(autotuned) MPI may introduce speed bumps on current architectures Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 0.0 FEM-Har 0.0 Tunnel 0.5 FEM-Cant 0.5 FEM-Sphr 1.0 Protein 1.0 FEM-Cant 1.5 FEM-Sphr 1.5 2.0 Protein 2.0 Dense GFlop/s 4.0 Dense GFlop/s Intel Clovertown MPI Based Libraries Typical sentence: … we enjoy using parallel computing libraries such as Scalapack • What else? … you know, such as scalapack • And …? Well, there is scalapack • (petsc, superlu, mumps, trilinos, …) • Very few users, still many bugs, immature • Highly Optimized Libraries? Yes and No Natural Question may not be the most important • How do I parallelize x? – First question many students ask – Answer often either one of • Fairly obvious • Very difficult – Can miss the true issues of high performance • These days people are often good at exploiting locality for performance • People are not very good about hiding communication and anticipating data movement to avoid bottlenecks • People are not very good about interweaving multiple functions to make the best use of resources – Usually misses the issue of interoperability • Will my program play nicely with your program? • Will my program really run on your machine? Real Computations have Dependencies (example FFT) Time wasted on the telephone Modern Approaches • Allow users to “wrap up” computations into nice packages often denoted threads Express dependencies among threads Threads need not be bound to a processor Not really new at all: see Arvind Dataflow etc Industry not yet caught up with the damage SPMD and MPI has done See Transactional Memories, Streaming Languages etc. • • • • • • • • • Advantages Easier on Programmer More productivity Allows for autotuning Can Overlap Communication with Computation LU Example Software Give me software leverage and a supercomputer, and I shall solve the world’s problems (apologies to) Archimedes New Standards for Quality of Computation • Associative Law: (a+b)+c=a+(b+c) • Not true in roundoff • Mostly didn’t matter in serial • Parallel computation reorganizes computation • Lawyers get very upset!