PACT 98 Http://www.research.microsoft.com/barc/gbell/pact.ppt PACT What Architectures? Compilers? Run-time environments? Programming models? … Any Apps? Parallel Architectures and Compilers Techniques Paris, 14 October 1998 Gordon Bell PACT Microsoft Talk plan Where are we today? History… predicting the future – – – – Ancient Strategic Computing Initiative and ASCI Bell Prize since 1987 Apps & architecture taxonomy Petaflops: when, … how, how much New ideas: Grid, Globus, Legion Bonus: Input to Thursday panel PACT 1998: ISVs, buyers, & users? Technical: supers dying; DSM (and SMPs) trying – – – – – Mainline: user & ISV apps ported to PCs & workstations Supers (legacy code) market lives ... Vector apps (e.g ISVs) ported to DSM (&SMP) MPI for custom and a few, leading edge ISVs Leading edge, one-of-a-kind apps: Clusters of 16, 256, ...1000s built from uni, SMP, or DSM Commercial: mainframes, SMPs (&DSMs), and clusters are interchangeable (control is the issue) – Dbase & tp: SMPs compete with mainframes if central control is an issue else clusters – Data warehousing: may emerge… just a Dbase – High growth, web and stream servers: PACT Clusters have the advantage c2000 Architecture Taxonomy Xpt connected SMPS Xpt-SMPvector Xpt-multithread (Tera) mainline “multi” SMP Multicomputers aka Clusters … MPP 16-(64)- 10K processors Xpt-”multi” hybrid DSM- SCI (commodity) DSM (high bandwidth_ Commodity “multis” & mainline switches Proprietary “multis” & switches Proprietary DSMs PACT TOP500 Technical Systems by Vendor (sans PC and mainframe clusters) 500 Other Japanese 400 DEC Intel TMC Sun HP 300 IBM Convex 200 SGI 100 PACT un-98 ov-97 un-97 ov-96 un-96 ov-95 un-95 ov-94 un-94 ov-93 0 un-93 CRI Parallelism of Jobs 20 Weeks of Data, March 16 - Aug 2, 1998 On NCSA15,028 Origin JobsCluster / 883,777 CPU-Hrs by # of Jobs 6% by CPU Delivered 3%1% 9% 7% 2% 9% 8% 19% 40% 21% 16% 5% 1 2 3-4 5-8 9-16 17-32 33-64 65-128 19% 17% 18% # CPUs PACT How are users using the Origin Array? 120,000 100,000 80,000 CPU Hrs 60,000 Delivered 40,000 20,000 0 Mem/CPU (MB) # CPUs PACT National Academic Community Large Project Requests September 1998 Over 5 Million NUs Requested Vector DSM MPP One NU = One XMP Processor-Hour Source: National Resource Allocation Committee PACT log (# apps) GB's Estimate of Parallelism in Engineering & Scientific Applications ----scalable multiprocessors----PCs WSs Supers Clusters aka MPPs aka multicomputers dusty decks for supers new or scaled-up apps Gordon’s WAG scalar 60% vector 15% Vector One-of Embarrassingly & & // >>// perfectly parallel 5% 5% 15% PACT granularity & degree of coupling (comp./comm.) Application Taxonomy Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters General purpose, nonparallelizable codes (PCs have it!) Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs...) Database Database/TP Web Host Stream Audio/Video PACT One procerssor perf. as % of Linpack 1800 1600 22% 1400 Linpack 1200 Apps. Ave. 25% 1000 800 14% 19% 600 33% 26% CFD Biomolec. Chemistry Materials QCD 400 200 0 T90 C90 SPP2000 SP2160 Origin 195 PCA PACT 10 Processor Linpack (Gflops); 10 P appsx10; Apps % 1 P Linpack; Apps %10 P Linpack 35 30 25 Gordon’s WAG 20 15 10 5 0 T90 C90 SPP SP2/160 Origin 195 PCA PACT Ancient history PACT Growth in Computational Resources Used for UK Weather Forecasting 10T • 1T • 1010/ 50 yrs = 1.5850 100G • 10G • 205 1G • 100M • YMP 195 10M • 1M • KDF9 100K • 10K • Mercury 1K • 100 • 10 • • 1950 Leo • 2000 PACT Harvard Mark I aka IBM ASCC PACT “ market for maybe five computers. ” I think there is a world Thomas Watson Senior, Chairman of IBM, 1943 PACT The scientific market is still about that size… 3 computers When scientific processing was 100% of the industry a good predictor $3 Billion: 6 vendors, 7 architectures DOE buys 3 very big ($100-$200 M) machines every 3-4 years PACT NCSA Cluster of 6 x 128 processors SGI Origin PACT Our Tax Dollars At Work ASCI for Stockpile Stewardship Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center – 512x1 SP2 PACT “LARC doesn’t need 30,000 words!” --Von Neumann, 1955. “During the review, someone said: “von Neumann was right. 30,000 word was too much IF all the users were as skilled as von Neumann ... for ordinary people, 30,000 was barely enough!” -- Edward Teller, 1995 The memory was approved. Memory solves many problems! PACT “ Parallel processing computer architectures will be in use by 1975. ” Navy Delphi Panel 1969 PACT “ In Dec. 1995 computers with 1,000 processors will do most of the scientific processing. ” Danny Hillis 1990 (1 paper or 1 company) PACT The Bell-Hillis Bet Massive Parallelism in 1995 TMC TMC TMC World-wide Supers World-wide Supers World-wide Supers Applications Petaflops / mo. Revenue PACT Bell-Hillis Bet: wasn’t paid off! My goal was not necessarily to just win the bet! Hennessey and Patterson were to evaluate what was really happening… Wanted to understand degree of MPP progress and programmability PACT “ A 50 X LISP machine ” Tom Knight, Symbolics “ A Teraflops by 1995 ” “ A 1,000 node multiprocessor ” Gordon Bell, Encore DARPA, 1985 Strategic Computing Initiative (SCI) All of ~20 HPCC projects failed! PACT SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, PACT Thinking Machines (Connection Machine), Those who gave up their lives in SCI’s search for parallellism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC (independent of ETA), Cogent, Culler, Cydrome, Dennelcor, Elexsi, ETA, Evans & Sutherland Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, Multiflow, Myrias, Pixar, Prisma, SAXPY, SCS, Supertek (part of Cray), Suprenum (German National effort), Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Vitec, Vitesse, Wavetracer. PACT Worlton: "Bandwagon Effect" explains massive parallelism Bandwagon: A propaganda device by which the purported acceptance of an idea ...is claimed in order to win further public acceptance. Pullers: vendors, CS community Pushers: funding bureaucrats & deficit Riders: innovators and early adopters 4 flat tires: training, system software, applications, and "guideposts" Spectators: most users, 3rd party ISVs PACT Parallel processing is a constant distance away. “ “ Our vision ... is a system of millions of hosts… in a loose confederation. ” Users will have the illusion of a very powerful desktop computer through which they can manipulate objects. ” Grimshaw, Wulf, et al “Legion” CACM Jan.PACT 1997 Progress "Parallelism is a journey.*" *Paul Borrill PACT Let us not forget: “The purpose of computing is insight, not numbers.” R. W. Hamming PACT Progress 1987-1998 PACT Bell Prize Peak Gflops vs time 1000 100 10 1 0.1 1986 1988 1990 1992 1994 1996 1998 PACT2000 Bell Prize: 1000x 1987-1998 1987 Ncube 1,000 computers: showed with more memory, apps scaled 1987 Cray XMP 4 proc. @200 Mflops/proc 1996 Intel 9,000 proc. @200 Mflops/proc 1998 600 RAP Gflops Bell prize Parallelism gains – 10x in parallelism over Ncube – 2000x in parallelism over XMP Spend 2- 4x more Cost effect.: 5x; ECL CMOS; Sram Dram Moore’s Law =100x Clock: 2-10x; CMOS-ECL speed cross-over PACT No more 1000X/decade. We are now (hopefully) only limited by Moore’s Law and not limited by memory access. 1 GF to 10 GF took 2 years 10 GF to 100 GFtook 3 years 100 GFto 1 TF took >5 years 2n+1 or 2^(n-1)+1? PACT $ /tp m C v s tim e Commercial Perf/$ $/tpm C $ 1 ,0 0 0 $100 2 5 0 % / y e a r i m p r o v e m e n t! $10 M a r-9 4 S e p -9 4 A p r-9 5 O c t-9 5 d ate M a y-9 6 D e c -9 6 J un-9 7 PACT tp m C v s tim e Commercal Perf. 1 0 0 ,0 0 0 2 5 0 % / y e a r i m p r o v e m e n t! tpm C 1 0 ,0 0 0 1 ,0 0 0 100 M a r-9 4 S e p -9 4 A p r-9 5 O c t-9 5 d ate M a y-9 6 D e c -9 6 J un-9 7 PACT 1998 Observations vs 1989 Predictions for technical Got a TFlops PAP 12/1996 vs 1995. Really impressive progress! (RAP<1 TF) More diversity… results in NO software! – Predicted: SIMD, mC, hoped for scalable SMP – Got: Supers, mCv, mC, SMP, SMP/DSM, SIMD disappeared $3B (un-profitable?) industry; 10 platforms PCs and workstations diverted users MPP apps DID NOT materialize PACT Observation: CMOS supers replaced ECL in Japan 2.2 Gflops vector units have dual use – – In traditional mPv supers as basis for computers in mC Software apps are present Vector processor out-performs n micros for many scientific apps It’s memory bandwidth, cache prediction, and inter-communication PACT Observation: price & performance Breaking $30M barrier increases PAP Eliminating “state computers” increased prices, but got fewer, more committed suppliers, less variation, and more focus Commodity micros aka Intel are critical to improvement. DEC, IBM, and SUN are ?? Conjecture: supers and MPPs may be equally cost-effective despite PAP – Memory bandwidth determines performance & price – “You get what you pay for ” aka PACT “there’s no free lunch” Observation: MPPs 1, Users <1 MPPs with relatively low speed micros with lower memory bandwidth, ran over supers, but didn’t kill ‘em. Did the U.S. industry enter an abyss? - Is crying “Unfair trade” hypocritical? - Are users denied tools? - Are users not “getting with the program” Challenge we must learn to program clusters... - Cache idiosyncrasies - Limited memory bandwidth - Long Inter-communication delays - Very large numbers of computers PACT Strong recommendation: Utilize in situ workstations! NoW (Berkeley) set sort record, decrypting Grid, Globus, Condor and other projects Need “standard” interface and programming model for clusters using “commodity” platforms & fast switches Giga- and tera-bit links and switches allow geo-distributed systems Each PC in a computational environment should have an additional 1GB/9GB! PACT “ Petaflops by 2010 ” DOE Accelerated Strategic Computing Initiative (ASCI) PACT DOE’s 1997 “PathForward” Accelerated Strategic Computing Initiative (ASCI) 1997 1999-2001 2004 2010 1-2 Tflops: $100M 10-30 Tflops $200M?? 100 Tflops Petaflops PACT “ When is a Petaflops possible? What price? ” Gordon Bell, ACM 1997 Moore’s Law 100x But how fast can the clock tick? Increase parallelism 10K>100K 10x Spend more ($100M $500M) 5x Centralize center or fast network 3x PACT Commoditization (competition) 3x Micros gains if 20, 40, & 60% / year 60%= Exaops 1.E+21 1.E+18 40%= Petaops 1.E+15 20%= Teraops 1.E+12 1.E +9 1.E+6 1995 2005 2015 2025 2035 2045 PACT Processor Limit: DRAM Gap “Moore’s Law” 100 10 1 µProc 60%/yr. . Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.. CPU 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 • Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors PACT *Taken from Patterson-Keeton Talk to SigMod Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile is required across generations of computers Reliability scaling Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency, PACT The Law of Massive Parallelism (mine) is based on application scaling There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other. A ... any parallel problem can be scaled to run efficiently on an arbitrary network of computers, given enough memory and time… but it may be completely impractical Challenge to theoreticians and tool builders: How well will or will an algorithm run? Challenge for software and programmers: Can package be scalable & portable? Are there models? Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just total flop or flops? Gordon’s Challenge to funders: PACT WAG Is the cost justified? Manyflops for Manybucks: what are the goals of spending? Getting the most flops, independent of how much taxpayers give to spend on computers? Building or owning large machines? Doing a job (stockpile stewardship)? Understanding and publishing about parallelism? Making parallelism accessible? Forcing other labs to follow? PACT Petaflops Alternatives c2007-14 from 1994 DOE Workshop SMP Cluster Active Mem Grid 400 Proc.; 4-40 K Proc.; 400 K Proc.; 1 Tflops 10-100 Gflops 1Gflops 400 TB SRAM 400 TB DRAM 0.8 TB embed. 250K chips 60K-100K chips 4K chips 1 ps/result… 10-100 ps/result multi-threading cache heirarchy 100 10 Gflops thread is likely No definition of storage, network, or PACT programming model Or more parallelism… and use installed machines 10,000 nodes in 1998 or 10x Increase Assume 100K nodes 10 Gflops/10GBy/100GB nodes or low end c2010 PCs Communication is first problem… use the network Programming is still the major barrier Will any problems fit it PACT Next, short steps PACT The Alliance LES NT Supercluster “Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft • Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA • Myrinet Network, HPVM, Fast Msgs • Microsoft NT OS, MPI API 192 HP 300 MHz 64 Compaq 333 MHz PACT 2D Navier-Stokes Kernel - Performance 7 Preconditioned Conjugate Gradient Method With Multi-levelOrigin-DSM Additive Schwarz Richardson Pre-conditioner Origin-MPI 6 NT-MPI Gigaflops 5 SP2-MPI T3E-MPI 4 SPP2000-DSM 3 Sustaining 7 GF on 128 Proc. NT Cluster 2 1 Processors Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD) 60 50 40 30 20 10 0 0 PACT The Grid: Blueprint for a New Computing Infrastructure Ian Foster, Carl Kesselman (Eds), Morgan Kaufmann, 1999 Published July 1998; ISBN 1-55860-475-8 22 chapters by expert authors including: – – – – – – – – – – – – – – Andrew Chien, Jack Dongarra, Tom DeFanti, Andrew Grimshaw, Roch Guerin, Ken Kennedy, “A source book for the history Paul Messina, of the future” -- Vint Cerf Cliff Neuman, Jon Postel, Larry Smarr, Rick Stevens, Charlie Catlett John Toole and many others http://www.mkp.com/grids PACT The Grid “Dependable, consistent, pervasive access to [high-end] resources” Dependable: Can provide performance and functionality guarantees Consistent: Uniform interfaces to a wide variety of resources Pervasive: Ability to “plug in” from anywhere PACT Alliance Grid Technology Roadmap: It’s just not flops or records/se User Interface Cave5D Webflow Virtual Director VRML NetMeeting H.320/323 Java3D ActiveX Java Middleware CAVERNsoft Workbenches Tango RealNetworks Visualization SCIRun Habanero Globus LDAP QoS OpenMP MPI Compute HPF DSM Clusters Clusters HPVM/FM Condor JavaGrande Symera (DCOM) Abilene vBNS MREN Data svPablo XML SRB HDF-5 Emerge (Z39.50) SANs DMF ODBC PACT Globus Approach Applications Focus on architecture issues – Propose set of core services Diverse global as basic infrastructure svcs – Use to construct high-level, domain-specific solutions Core Globus Design principles services – Keep participation cost low – Enable local control – Support for adaptation Local OS PACT Globus Toolkit: Core Services Scheduling (Globus Resource Alloc. Manager) – Low-level scheduler API Information (Metacomputing Directory Service) – Uniform access to structure/state information Communications (Nexus) – Multimethod communication + QoS management Security (Globus Security Infrastructure) – Single sign-on, key management Health and status (Heartbeat monitor) Remote file access (Global Access to PACT Secondary Storage) Summary of some beliefs 1000x increase in PAP has not been accompanied with RAP, insight, infrastructure, and use. What was the PACT/$? “The PC World Challenge” is to provide commodity, clustered parallelism to commercial and technical communities Only comes true of ISVs believe and act Grid etc. using world-wide resources, PACT including in situ PCs is the new idea PACT 98 Http://www.research.microsoft.com/barc/gbell/pact.ppt PACT The end PACT