NRC Review Panel on High Performance Computing 11 March 1994 Gordon Bell 1 © Gordon Bell Position Dual use: Exploit parallelism with in situ nodes & networks Leverage WS & mP industrial HW/SW/app infrastructure! No Teraflop before its time -- its Moore's Law It is possible to help fund computing: Heuristics from federal funding & use (50 computer systems and 30 years) Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -apps must leverage, be cross-platform & self-sustaining •let "Challenges" choose apps, not mono use computers •"industry" offers better computers & these are jeopardized •users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why? Summary recommendations 2 © Gordon Bell Principle computing Environments ASCII & PC terminals circa 1994 ->4 networks to support POTs net for switching mainframes, terminals minis, Wide-area Data UNIX servers, inter-site comm. network workstations & worlds PCs X mainframes mainframes clusters IBM & propritary mainframe world '50s 3270 (&PC) terminals clusters minicomputers minicomputers ASCII & PC terminals terminals '80s Unix distributed workstations & servers world UNIX workstations NFS servers Compute & dbase uni- & mP servers Token-ring Ethernet (gateways, bridges, (gateways, bridges, routers, hubs, etc.) routers, hubs, etc.) LANs LANs PCs (DOS, Windows, NT) Novell & NT servers 70's mini (prop.) world & '90s UNIX mini world UNIX Multiprocessor servers operated as traditional minicomputers >4 Interconnect & comm. stds: POTS & 3270 terms. WAN (comm. stds.) Late '80s LAN-PC world LAN (2 stds.) Clusters (prop.) 3 © Gordon Bell Computing Environments circa 2000 NT, Windows & UNIX person NT, Windows servers & UNIX person servers* Wide-area global ATM network Local & global data comm ATM† world & Local Area Networks for: terminal, PC, workstation, & servers ??? TC=TV+PC home ... (CATV or ATM) Legacy mainframes & Legacy minicomputers mainframe & terms servers & minicomputer servers & terminals multicomputers built from multiple simple, servers Centralized &Centralized departmental uni& mP servers & departmental (UNIX & NT) scalable uni- & mP servers* NFS, database, compute, print, & (NT & UNIX) * Platforms: X86 communication servers PowerPC Sparc etc. Universal high speed data service using ATM or ?? † also10 - 100 mb/s pt-to-pt Ethernet 4 © Gordon Bell Beyond Dual & Duel Use Technology: Parallelism can & must be free! HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging: • every in situ workstation & multiprocessor server • as part of the local ... national network. Parallelism is a capability that all computing environments can & must possess! --not a feature to segment "mono use" computers Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers Apps follow pervasive computing environments 5 © Gordon Bell Computer genetic engineering & species selection has been ineffective Although Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers." The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs. Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing. User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers. 6 © Gordon Bell 10000 The teraflops Cray DARPA Intel $300M 1000 • CM5 $240M CM5 $120M • Intel $55M NEC CM5 $30M 100 Bell Prize • Performance (t) 10 Cray Super $30M 1 1988 1990 1992 1994 7 1996 1998 2000 © Gordon Bell We get no Teraflop before it's time: it's Moore's Law! Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s! All Flops are not equal (peak announced performance-PAP or real app perf. -RAP) FlopsCMOSPAP*< C x 1.6**(1992-t) x $; C = 128 x 10**6 flops / $30,000 FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM) 92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25) *Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year When does a Teraflop arrive if only $30 million** is spent on a super? 1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP 10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP How do you get a teraflop earlier? **A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years. 8 © Gordon Bell Funding Heuristics (50 computers & 30 years of hindsight) 1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come. 2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue. a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS) b. Transfer to large companies has not been effective c. Government labs... rare, an accident if something emerges 3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc. c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -hence are unlikely to be very helpful as users in the trek to the teraflop. 4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN! © Gordon Bell 9 Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers). A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining. 6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects. 7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective. 8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat. If government funds apps, they must be ported cross-platform! 9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer. 10 © Gordon Bell Scalability: The Platform of HPCS & why continued funding is unnecessary Mono use aka MPPs have been, are, and will be doomed The law of scalability Four scalabilities: machine, problem x machine, generation (t), & now spatial How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size? What's the nature of problems & work for monos? What about the mapping of problems onto monos? What about the economics of software to support monos? What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P? 11 © Gordon Bell Special, mono-use MPPs are doomed... no matter how much fedspend! Special because it has non-standard nodes & networks -- with no apps Having not evolved to become mainline -- events have over-taken them. It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead. Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps •workstations -- do very long grain, are in situ and have apps •workstation clusters -- have identical characteristics and have apps •low priced ($2 million) multis -- are superior i.e., shorter grain and have apps •scalable multiprocessors -- formed from multis are in design stage Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF 12 © Gordon Bell The Law of Massive Parallelism is based on application scale There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem. A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time Challenge to theoreticians: How well will an algorithm run? Challenge for software: Can package be scalable & portable? Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops? Challenge to HPCC: Is the cost justified? if so let users do it! 13 © Gordon Bell Scalabilities Size scalable computers are designed from a few components, with no bottleneck component. Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer. Although large scale problems allow high flops, large probs running longer may not produce more insight. Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources. 14 © Gordon Bell 100 Linpack rate in Gflops vs Matrix Order CM5 1K SX 3 4 P 10 ??? 1 100 1000 10000 15 100000 © Gordon Bell Linpack Solution time vs Matrix Order 1000 CM5 1K 100 10 SX 3 4 P 1 100 1,000 10,000 16 100,000 © Gordon Bell GB's Estimate of Parallelism in Engineering & Scientific Applications ----scalable multiprocessors----log (# of apps) WSs Supers dusty decks for supers massive mCs & WSs new or scaled-up apps scalar vector mP (<8) >>// embarrassingly 60% 15% vector 5% or perfectly parallel 5% 15% granularity & degree of coupling (comp./comm.) 17 © Gordon Bell MPPs are only for unique, $M 100 very large scale, data parallel apps . . . 10 mono use . s . . s s s s s 1 . >>// >>// . mP mP mP mP mP . mP .1 WS . . WS WS WS WS . WS .01 Scalar| vector |vector mP| data // | emb. // | gp work | viz | apps Application characterization 18 © Gordon Bell Applicability of various technical computer alternatives Domain scalar vector vect.mP data // ep & inf.// gp wrkld vizualiz'n apps PC|WS Multi servr SC & Mfrm >>// WS Clusters 1 2* na na 1 3 1 1 1 2 2 1 2 1 na 1 2 1 1 2 3 1 na 1 1* 2 na 1 1 2 1 from WS na 3 3 1 2 na na na *Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization 19 © Gordon Bell Performance using distributed computers depends on problem & machine granularity Berkeley's log(p) model characterizes granularity & needs to be understood, measured, and used Three parameters are given in terms of processing ops: l = latency -- delay time to communicate between apps o = overhead -- time lost transmitting messages g = gap - 1 / message-passing rate ( bandwidth) between messages - time p = number of processors 20 © Gordon Bell x Granularity Nomograph Processor speed 10M Grain Comm. Latency & Synch.1sec. Ovhd. 1995µ 1993 1G C 90 Ultra 1M Very 100 K 10 ms. Coarse 1 ms. (LAN) 10K 10 µs. 1993µ WANs & LANs 100 ms. (WAN) 100 µs. 100M 10 M 1 µs. 100 ns. 21 MPPs Med. 1000 Fine 100 Grain length (ops) © Gordon Bell x 10 M Granularity Nomograph Grain Comm. Latency & Synch. Ovhd. 1sec. 1 ms. (LAN) 100 µs. 1995µ 1G 1993 super Very 100 K 10 ms. 10M 1993µ 1M 100 ms. (WAN) Processor speed 100M Ultra Cray T3D 10 µs. Coarse VPP 500 10K Med. 1000 1 µs. C 90 Fine 100 ns. (Supers mem.) VP 22 100 Grain length (ops) © Gordon Bell Economics of Packaged Software Platform Cost Leverage # copies MPP >100K 1 1-10 copies Minis, mainframe 10-100K 10-100 1000s copies also, evolving high performance multiprocessor servers Workstation 1-100K 1-10K 1-100K copies PC 25-500 50K-1M 1-10M copies 23 © Gordon Bell Chuck Seitz comments on multicomputers “I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...” 24 © Gordon Bell Convergence to a single architecture with a single address space that uses a distributed, shared memory limited (<20) scalability multiprocessors >> scalable multiprocessors workstations with 1-4 processors >> workstation clusters & scalable multiprocessors workstation clusters >> scalable multiprocessors State Computers built as message passing multicomputers >> scalable multiprocessors 25 © Gordon Bell limited scalability: mP, uniform memory access mP mainframe, super micros C om mi c p etiti o ro s n& mP bus based multi: mini, W/S note, only two structures: 1. shared memory mP with mP ring-based uniform & non-uniform memory access; and multi 2. networked workstations, shared nothing ?? DEC, Encore, Sequent, Stratus, SGI, SUN, etc. Conv ex, Cray, mPs continue istu, IBM, to be the main line Fuj Hitachi, NEC 1995? mainframes & supers scalable, mP: smP, non-uniform memory access 1st smP -0 cache Convergence to one architecture Cache for locality networked workstations: smC smC: very coarse grain Apollo, SUN, HP, etc. Natural evolution r 64 e > ach 32 ? c ? hr ead e d p ro high bandwith switch , comm. protocols e.g. ATM smP DSM some cache o y 1995? Cm* ('75), I c C S ren Butterfly ('85), smC , ts h e Cedar ('88) bi co next gen. DSM=>smP experimental, 1st smC smC scalable, hypercube WS Micros, med-coarse Fuj itsu, Intel, multicomputer: smC, Transputer fast switch Meiko, NCUBE, grain non uniform memory (grid) H ig h TMC; 1985-1994 den s ity, m access u lti -t Cosmic Cube, iPSC 1, NCUBE, Transputer-based DASH, Conv ex, Cray T3D, SCI cess or s & swi tc h smC coarse gr. clusters WSs Clusters v ia special sw itches 1994 &ATM 1995 26 1995? smC fine-grain DSM?? smP all cache arch. KSR Allcache next gen. smP research e.g. DDM, DASH+ Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers Mosaic-C, J-machine © Gordon Bell Re-engineering HPCS Genetic engineering of computers has not produced a healthy strain that lives more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations! High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers Networking is free via ATM Nodes are free via in situ workstations Apps follow pervasive computing environments Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes. MPP won, mainstream vendors have adopted multiple CMOS. Stop funding! © Gordon Bell environments & apps are needed, but are27 unlikely because the market is small Recommendations to HPCS Goal: By 2000, massive parallelism must exist as a by-products that leverages a widescale national network & workstation/multi HW/SW nodes Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs) Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computers Fund Challenges who in turn fund purchase, not product development Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use Review effectiveness of State Computers e.g., need, economics, efficacy Each committee member might visit 2-5 sites using a >>// computer Review // program environments & the efficacy to produce & support apps Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructure stop funding the development of mono computers, including the 10Tflop it must be acceptable & encouraged to buy any computer for any contract © Gordon Bell 28 Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility.... Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product Provide many small, scalable computers vs large, centralized Encourage (revert to) & support not so grand challenges Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not Fund balanced machines/efforts; stop starting Viet Nams Drop the funding & directed purchase of state computers Revert to university research -> company & product development Review the HPCC & GCs program's output ... *High Performance Cash Conscriptor; Big Spenders Disclaimer This talk may appear inflammatory ... i.e. the speaker may have appeared "to flame". It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way. 30 © Gordon Bell Scalability: The Platform of HPCS The law of scalability Three kinds: machine, problem x machine, & generation (t) How do flops, memory size, efficiency & time vary with problem size? What's the nature problems & work for the computers? What about the mapping of problems onto the machines? 31 © Gordon Bell