Goal of the Committee The goal of the committee is to assess the status of supercomputing in the United States, including the characteristics of relevant systems and architecture research in government, industry, and academia and the characteristics of the relevant market. The committee will examine key elements of context--the history of supercomputing, the erosion of research investment, the needs of government agencies for supercomputing capabilities--and assess options for progress. Key historical or causal factors will be identified. The committee will examine the changing nature of problems demanding supercomputing (e.g., weapons design, molecule modeling and simulation, cryptanalysis, bioinformatics, climate modeling) and the implications for systems design. It will seek to understand the role of national security in the supercomputer market and the long-term federal interest in supercomputing. 1 © Gordon Bell NRC-CSTB Future of Supercomputing Committee 22 May 2003 NRC Brooks-Sutherland Committee 11 March 1994 NRC OSTP Report 18 August 1984 Gordon Bell Microsoft Bay Area Research Center San Francisco 2 © Gordon Bell Outline • • • • Community re-Centric Computing vs. Computer Centers Background: Where we are and how we got here. Performance (t). Hardware trends and questions If we didn’t have general purpose computation centers, would we invent them? • Atkins Report: Past and future concerns…be careful what we wish for • Appendices: NRC Brooks-Sutherland 94 comments; CISE(gbell) Concerns re. Supercomputing Centers (’87); CISE(gbell) //ism goals (‘87); NRC Report 84 comments. • Bottom line, independent of the question: It has been and always will be the software and apps, stupid! And now it’s the data, too! © Gordon Bell 3 Community re-centric Computing... • Goal: Enable technical communities to create their own computing environments for personal, data, and program collaboration and distribution. • Design based on technology trends, especially networking, apps programs maintenance, databases, & providing web and other services • Many alternative styles and locations are possible – – – – Service from existing centers, including many state centers Software vendors could be encouraged to supply apps svcs NCAR style center around data and apps Instrument-based databases. Both central & distributed when multiple viewpoints create the whole – Wholly distributed with many individual groups 4 © Gordon Bell Community re-Centric Computing Time for a major change • Community Centric • Community is responsible – – – – • Centers Centric • Center is responsible – – – – Planned & budget as resources Responsible for its infrastructure Apps are community centric Computing is integral to sci./eng. • In sync with technologies Computing is “free” to users Provides a vast service array for all Runs & supports all apps Computing grant disconnected • Counter to technologies directions – 1-3 Tflops/$M; 1-3 PBytes/$M to buy smallish Tflops & PBytes. – More costly. Large centers operate at a dis-economy of scale • New scalables are “centers” fast • Based on unique, fast computers – Community can afford – Dedicated to a community – Program, data & database centric – Center can only afford – Divvy cycles among all communities – Cycles centric; but politically difficult to maintain highest power vs more centers – Data is shipped to centers requiring, expensive, fast networking – May be aligned with instruments or other community activities • Output = web service; an entire community demands real-time web service 5 • Output = diffuse among gp centers; Are centers ready or able to support on-demand, real time web services? © Gordon Bell Background: scalability at last • Q: How did we get to scalable computing and parallel processing? • A: Scalability evolved from a two decade old vision and plan starting at DARPA & NSF. Now picked up by DOE & row. • Q: What should be done now? • A: Realize scalability, the web, and now web services change everything. Redesign to get with the program! • Q: Why do you seem to be wanting to de-center? • A: Besides the fact that user demand has been and is totally de-coupled from supply, I believe technology doesn’t necessarily support users or their mission, and that centers are potentially inefficient compared with a more distributed approach. Steve Squires & Gordon Bell at our “Cray” at the start of DARPA’s SCI program c1984. 20 years later: Clusters of Killer micros become the single standard Copyright Gordon Bell Lost in the search for parallelism ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Cogent Convex > HP Cray Computer Cray Research > SGI > Cray Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Encore Elexsi ETA Systems Evans and Sutherland Computer Exa Flexible Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories searching again MasPar Meiko Multiflow Myrias Numerix Pixar Parsytec nCube Prisma Pyramid Ridge Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Tera > Cray Company Thinking Machines Copyright Gordon Bell Vitesse Electronics Wavetracer A brief, simplified history of HPC 1. Cray formula smPv evolves for Fortran. 60-02 (US:60-90) 2. 1978: VAXen threaten computer centers… 3. NSF response: Lax Report. Create 7-Cray centers 1982 4. SCI: DARPA searches for parallelism using killer micros 5. Scalability found: “bet the farm” on micros clusters Users “adapt”: MPI, lcd programming model found. >95 Result: EVERYONE gets to re-write their code!! 6. Beowulf Clusters form by adopting PCs and Linus’ Linux to create the cluster standard! (In spite of funders.)>1995 7. ASCI: DOE gets petaflops clusters, creating “arms” race! 8. “Do-it-yourself” Beowulfs negate computer centers since everything is a cluster and shared power is nil! >2000. 9. 1997-2002: Tell Fujitsu & NEC to get “in step”! 10.High speed nets enable peer2peer & Grid or Teragrid 11.Atkins Report: Spend $1B/year, form more and larger centers and connect them as a single center… The Virtuous Economic Cycle drives the PC industry… & Beowulf Attracts suppliers Greater availability @ lower cost Standards Attracts users Creates apps, tools, training, Copyright Gordon Bell Lessons from Beowulf An experiment in parallel computing systems ‘92 Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that allowed apps to form Industry began to form beyond a research project Copyright Gordon Bell Courtesy, Thomas Sterling, Caltech. Technology: peta-bytes, -flops, -bps We get no technology before its time Moore’s Law 2004-2012: 40X The big surprise will be the 64 bit micro 2004: O(100) processors = 300 GF PAP, $100K – – 3 TF/M, not diseconomy of scale for large systems 1 PF => 330M, but 330K processors; other paths Storage 1-10 TB disks; 100-1000 disks Internet II killer app – NOT teragrid – – Access Grid, new methods of communication Response time to provide web services Copyright Gordon Bell 10,000,000 RAP(GF) Performance metrics (t) 19871,000,000 Proc(#) ES 2009 cost ($M) 100,000 Density(Gb/in) 10,000 110% Flops/$ 1,000 100% 100 10 1 60% 19 87 19 89 19 91 19 93 19 95 19 97 19 99 20 01 20 03 20 05 20 07 20 09 0 Computing Laws Perf (PAP) = c x 1.6**(t-1992); c = 128 GF/$300M ‘94 prediction: c = 128 GF/$30M 1.E+16 1.E+15 1.E+14 1.E+13 1.E+12 1.E+11 1.E+10 1.E+09 1.E+08 1992 GB peak 1996 30 M super 2000 100 M super 14 2004 300 M super 2008 2012 Flops(PAP)M/$ © Gordon Bell Performance(TF) vs. cost($M) of noncentral and centrally distributed systems 100 10 1 Performance + Centers (old style super) 0.1 0.01 0.1 Non-central 1 Centers delivery 10 100 Center purchase Cost base National Semiconductor Technology Roadmap (size) 10000 0.35 Memory size (Mbytes/chip) & Mtransistors/ chip Mem(MBytes) 0.3 Micros Mtr/chip Line width 1000 0.25 0.2 + 1Gbit 100 0.15 0.1 10 0.05 1 0 1995 1998 2001 2004 2007 2010 Disk Density Explosion Magnetic disk recording density (bits per mm2) grew at 25% per year from 1975 until 1989. Since 1989 it has grown at 60-70% per year Since 1998 it has grown at >100% per year – This rate will continue into 2003 Factors causing accelerated growth: – – – Improvements in head and media technology Improvements in signal processing electronics Lower head flying heights Courtesy Richie Lary National Storage Roadmap 2000 100x/decade =100%/year ~10x/decade = 60%/year Computing Laws Disk / Tape Cost Convergence $3.00 5400 RPM ATA Disk Retail Price . $2.50 SDLT Tape Cartridge $2.00 $1.50 $1.00 $0.50 $0.00 1/01 1/03 1/04 1/05 3½” ATA disk could cost less than SDLT cartridge in 2004. 1/02 If disk manufacturers maintain 3½”, multi-platter form factor Volumetric density of disk will exceed tape in 2001. “Big Box of ATA Disks” could be cheaper than a tape library of equivalent size in 2001 Courtesy of Richard Lary 19 Disk Capacity / Performance Imbalance Capacity growth outpacing 100 performance growth Difference must be made up by better caching and load 10 balancing Actual disk capacity may be capped by market (red line); shift to smaller disks 1 (already happening for 1992 high speed disks) Capacity Performance 1995 1998 140x in 9 years (73%/yr) 3x in 9 years (13%/yr) 2001 Courtesy of Richard Lary Re-Centering to Community Centers • There is little rational support for general purpose centers – – – – – Scalability changes the architecture of the entire Cyberinfrastructure No need to have a computer bigger than the largest parallel app. They aren’t super. World is substantially data driven, not cycles driven. Demand is de-coupled from supply planning, payment or services • Scientific / Engineering computing has to be the responsibility of each of its communities – Communities form around instruments, programs, databases, etc. – Output is web service for the entire community 21 © Gordon Bell Grand Challenge (c2002) problems become desktop (c2004) tractable I don’t buy problem growth mantra 2x res. >2**4 (yrs.) • Radical machines will come from low cost 64-bit explosion • Today’s desktop has and will increasingly trump yesteryear’s super simply due to memory size emplosion • Pittsburgh Alpha: 3D MRI skeleton computing/viewing using a large memory is a desktop problem given a large memory • Tony Jamieson: “I can models an entire 747 on my laptop!” 22 © Gordon Bell Centers aren’t very super… • Pittsburgh: 6; NCAR: 10; LSU: 17; Buffalo: 22; FSU: 38; San Diego: 52; NCSA: 82; Cornell: 88 Utah: 89; • 17 Universities, world-wide in top 100 • Massive upgrade is continuously required: – Large memories: machines aren’t balanced and haven’t been. Bandwidth 1 Byte/flop vs. 24 Bytes/flop – File Storage > Databases • Since centers systems have >4 year lives, they start as obsolete and overpriced…and then get worse. 23 © Gordon Bell Centers: The role going forward • The US builds scalable clusters, NOT supercomputers – Scalables are 1 to n commodity PCs that anyone can assemble. – Unlike the “Crays” all are equal. Use is allocated in small clusters. – Problem parallelism sans ∞// has been elusive (limited to 1K) – No advantage of having a computer larger than a //able program • User computation can be acquired and managed effectively. – Computation is divvied up in small clusters e.g. 128 nodes that individual groups can acquire and manage effectively • The basic hardware evolves, doesn’t especially favor centers – 64-bit architecture. 512Mb x 32/dimm =2GB x 4/system = 8GB Systems >>16GB (Centers machine will be obsolete, by memory / balance rules.) – 3 year timeframe: 1 TB disks at $0.20/TB – Last mile communication costs are not decreasing to favor centers or grids. 24 © Gordon Bell Review the bidding • 1984: Japanese are coming. CMOS and killer Micros. Build // machines. – 40+ computers were built & failed based on CMOS and/or micros – No attention to software or apps • 1994: Parallelism and Grand Challenges – Converge to Linux Clusters (Constellations nodes >1 Proc.) & MPI – No noteworthy middleware software to aid apps & replace Fortran e.g. HPF failed. – Whatever happened to the Grand Challenges?? • 2004: Teragrid has potential as a massive computer and massive research – We have and will continue to have the computer clusters to reach a <$300M Petaflop – Massive review and re-architecture of centers and their function. – Instrument/program/data/community centric (CERN, Fermi, NCAR, Calera) 25 © Gordon Bell Recommendations • Give careful advice on the Atkins Report (It is just the kind of plan that is likely to fly.) • Community Centric Computing – Community/instrument/data/programcentric (Calera, CERN, NCAR) • Small number of things to report – Forget about hardware for now…it’s scalables. The die has been cast. – Support training, apps, and any research to ease apps development. – Databases represent the biggest gain. Don’t grep, just reference it. 26 © Gordon Bell The End 27 © Gordon Bell Atkins Report: Be careful of what you ask for • Suggestions (gbell) – Centers to be re-centered in light of data versus flops – Overall re-architecture based on user need & technology – Highly distributed structure aligned with users who plan their facilities – Very skeptical “gridized” projects e.g. tera, GGF – Training in the use of databases is needed! It will get more productivity than another generation of computers. • The Atkins Report – 1.02 Billion per year recommendation for research, buy software, and spend $600 M to build and maintain more centers that are certain to be obsolete and non-productive 28 © Gordon Bell Summary to Atkins Report 2/15/02 15:00 gbell • • • • Same old concerns: “I don’t have as many flops as users at the national labs.” Many facilities should be distributed and with build-it yourself Beowulf clusters to get extraordinary cycles and bytes. Centers need to be re-centered see Bell & Gray, “What’s Next in High Performance Computing?”, Comm. ACM, Feb. 2002, pp91-95. Scientific computing needs re architecting based on networking, communication, computation, and storage. Centrality versus distributed depends on costs and the nature of the work e.g. instrumentation that generates lots of data. (Last mile problem is significant.) – Fedex’d hard drive is low cost. Cost of hard drive < network cost. Net is very expensive! – Centers flops and bytes are expensive. Distributed likely to be less so. – Many sciences need to be reformulated as a distribute computing/dbase • • • • Network costs (last mi.) are a disgrace. $1 billion boondoggle with NGI, Internet II. Grid funding: Not in line with COTS or IETF model. Another very large SW project! Give funding to scientists in joint grants with tool builders e.g. www came from user Database technology is not understood by users and computer scientists – Training, tool funding, & combined efforts especially when large & distributed – Equipment, problems, etc are dramatically outstripping capabilities! • It is still about software, especially in light of scalable computers that require reformulation into a new, // programming model Atkins Report: the critical challenges 1) build real synergy between computer and information science research and development, and its use in science and engineering research and education; 2) capture the cyberinfrastructure commonalities across science and engineering disciplines; 3) use cyberinfrastructure to empower and enable, not impede, collaboration across science and engineering disciplines; 4) exploit technologies being developed commercially and apply them to research applications, as well as feed back new approaches from the scientific realm into the larger world; 5) engage social scientists to work constructively with other scientists and technologists. 30 © Gordon Bell Atkins Report: Be careful of what you ask for 1. fundamental research to create advanced cyberinfrastructure ($60M); 2. research on the application of cyberinfrastructure to specific fields of science and engineering research ($100M); 3. acquisition and development of production quality software for cyberinfrastructure and supported applications ($200M); 4. provisioning and operations (including computational centers, data repositories, digital libraries, networking, and application support) ($600M). 5. archives for software ($60M) 31 © Gordon Bell NRC Review Panel on High Performance Computing 11 March 1994 Gordon Bell 32 © Gordon Bell Position Dual use: Exploit parallelism with in situ nodes & networks Leverage WS & mP industrial HW/SW/app infrastructure! No Teraflop before its time -- its Moore's Law It is possible to help fund computing: Heuristics from federal funding & use (50 computer systems and 30 years) Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -apps must leverage, be cross-platform & self-sustaining •let "Challenges" choose apps, not mono use computers •"industry" offers better computers & these are jeopardized •users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why? Summary recommendations 33 © Gordon Bell Principle computing Environments ASCII & PC terminals circa 1994 ->4 networks to support POTs net for switching mainframes, terminals minis, Wide-area Data UNIX servers, inter-site comm. network workstations & worlds PCs X mainframes mainframes clusters IBM & propritary mainframe world '50s 3270 (&PC) terminals clusters minicomputers minicomputers ASCII & PC terminals terminals '80s Unix distributed workstations & servers world UNIX workstations NFS servers Compute & dbase uni- & mP servers Token-ring Ethernet (gateways, bridges, (gateways, bridges, routers, hubs, etc.) routers, hubs, etc.) LANs LANs PCs (DOS, Windows, NT) Novell & NT servers 70's mini (prop.) world & '90s UNIX mini world UNIX Multiprocessor servers operated as traditional minicomputers >4 Interconnect & comm. stds: POTS & 3270 terms. WAN (comm. stds.) Late '80s LAN-PC world LAN (2 stds.) Clusters (prop.) 34 © Gordon Bell Computing Environments circa 2000 NT, Windows & UNIX person NT, Windows servers & UNIX person servers* Wide-area global ATM network Local & global data comm ATM† world & Local Area Networks for: terminal, PC, workstation, & servers ??? TC=TV+PC home ... (CATV or ATM) Legacy mainframes & Legacy minicomputers mainframe & terms servers & minicomputer servers & terminals multicomputers built from multiple simple, servers Centralized &Centralized departmental uni& mP servers & departmental (UNIX & NT) scalable uni- & mP servers* NFS, database, compute, print, & (NT & UNIX) * Platforms: X86 communication servers PowerPC Sparc etc. Universal high speed data service using ATM or ?? † also10 - 100 mb/s pt-to-pt Ethernet 35 © Gordon Bell Beyond Dual & Duel Use Technology: Parallelism can & must be free! HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging: • every in situ workstation & multiprocessor server • as part of the local ... national network. Parallelism is a capability that all computing environments can & must possess! --not a feature to segment "mono use" computers Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers Apps follow pervasive computing environments 36 © Gordon Bell Computer genetic engineering & species selection has been ineffective Although Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers." The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs. Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing. User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers. 37 © Gordon Bell 10000 The teraflops Cray DARPA Intel $300M 1000 • CM5 $240M CM5 $120M • Intel $55M NEC CM5 $30M 100 Bell Prize • Performance (t) 10 Cray Super $30M 1 1988 1990 1992 1994 38 1996 1998 2000 © Gordon Bell We get no Teraflop before it's time: it's Moore's Law! Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s! All Flops are not equal (peak announced performance-PAP or real app perf. -RAP) FlopsCMOSPAP*< C x 1.6**(t-1992) x $; C = 128 x 10**6 flops / $30,000 FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM) 92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25) *Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year When does a Teraflop arrive if only $30 million** is spent on a super? 1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP 10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP How do you get a teraflop earlier? **A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years. 39 © Gordon Bell Re-engineering HPCS Genetic engineering of computers has not produced a healthy strain that lives more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations! High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers Networking is free via ATM Nodes are free via in situ workstations Apps follow pervasive computing environments Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes. MPP won, mainstream vendors have adopted multiple CMOS. Stop funding! © Gordon Bell environments & apps are needed, but are40 unlikely because the market is small Recommendations to HPCS Goal: By 2000, massive parallelism must exist as a by-products that leverages a widescale national network & workstation/multi HW/SW nodes Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs) Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computers Fund Challenges who in turn fund purchase, not product development Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use Review effectiveness of State Computers e.g., need, economics, efficacy Each committee member might visit 2-5 sites using a >>// computer Review // program environments & the efficacy to produce & support apps Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructure stop funding the development of mono computers, including the 10Tflop it must be acceptable & encouraged to buy any computer for any contract © Gordon Bell 41 Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility.... Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product Provide many small, scalable computers vs large, centralized Encourage (revert to) & support not so grand challenges Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of) Drop the funding & directed purchase of state computers Revert to university research -> company & product development Review the HPCC & GCs program's output ... *High Performance Cash Conscriptor; Big Spenders Scalability: The Platform of HPCS The law of scalability Three kinds: machine, problem x machine, & generation (t) How do flops, memory size, efficiency & time vary with problem size? What's the nature problems & work for the computers? What about the mapping of problems onto the machines? 43 © Gordon Bell Disclaimer This talk may appear inflammatory ... i.e. the speaker may have appeared "to flame". It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way. 44 © Gordon Bell Backups 45 © Gordon Bell Funding Heuristics (50 computers & 30 years of hindsight) 1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come. 2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue. a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS) b. Transfer to large companies has not been effective c. Government labs... rare, an accident if something emerges 3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc. c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -hence are unlikely to be very helpful as users in the trek to the teraflop. 4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN! © Gordon Bell 46 Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers). A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining. 6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects. 7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective. 8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat. If government funds apps, they must be ported cross-platform! 9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer. 47 © Gordon Bell Scalability: The Platform of HPCS & why continued funding is unnecessary Mono use aka MPPs have been, are, and will be doomed The law of scalability Four scalabilities: machine, problem x machine, generation (t), & now spatial How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size? What's the nature of problems & work for monos? What about the mapping of problems onto monos? What about the economics of software to support monos? What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P? 48 © Gordon Bell Special, mono-use MPPs are doomed... no matter how much fedspend! Special because it has non-standard nodes & networks -- with no apps Having not evolved to become mainline -- events have over-taken them. It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead. Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps •workstations -- do very long grain, are in situ and have apps •workstation clusters -- have identical characteristics and have apps •low priced ($2 million) multis -- are superior i.e., shorter grain and have apps •scalable multiprocessors -- formed from multis are in design stage Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF 49 © Gordon Bell The Law of Massive Parallelism is based on application scale There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem. A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time Challenge to theoreticians: How well will an algorithm run? Challenge for software: Can package be scalable & portable? Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops? Challenge to HPCC: Is the cost justified? if so let users do it! 50 © Gordon Bell Scalabilities Size scalable computers are designed from a few components, with no bottleneck component. Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer. Although large scale problems allow high flops, large probs running longer may not produce more insight. Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources. 51 © Gordon Bell 100 Linpack rate in Gflops vs Matrix Order CM5 1K SX 3 4 P 10 ??? 1 100 1000 10000 52 100000 © Gordon Bell Linpack Solution time vs Matrix Order 1000 CM5 1K 100 10 SX 3 4 P 1 100 1,000 10,000 53 100,000 © Gordon Bell GB's Estimate of Parallelism in Engineering & Scientific Applications ----scalable multiprocessors----massive mCs & WSs log (# of apps) WSs Supers dusty decks new or for supers scaled-up apps scalar vector mP (<8) >>// embarrassingly 60% 15% vector 5% or perfectly parallel 5% 15% granularity & degree of coupling (comp./comm.) 54 © Gordon Bell MPPs are only for unique, $M 100 very large scale, data parallel apps . . . 10 mono use . s . . s s s s s 1 . >>// >>// . mP mP mP mP mP . mP .1 WS . . WS WS WS WS . WS .01 Scalar| vector |vector mP| data // | emb. // | gp work | viz | apps Application characterization 55 © Gordon Bell Applicability of various technical computer alternatives Domain scalar vector vect.mP data // ep & inf.// gp wrkld vizualiz'n apps PC|WS Multi servr SC & Mfrm >>// WS Clusters 1 2* na na 1 3 1 1 1 2 2 1 2 1 na 1 2 1 1 2 3 1 na 1 1* 2 na 1 1 2 1 from WS na 3 3 1 2 na na na *Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization 56 © Gordon Bell Performance using distributed computers depends on problem & machine granularity Berkeley's log(p) model characterizes granularity & needs to be understood, measured, and used Three parameters are given in terms of processing ops: l = latency -- delay time to communicate between apps o = overhead -- time lost transmitting messages g = gap - 1 / message-passing rate ( bandwidth) between messages - time p = number of processors 57 © Gordon Bell x Granularity Nomograph Processor speed 10M Grain Comm. Latency & Synch.1sec. Ovhd. 1995µ 1993 1G C 90 Ultra 1M Very 100 K 10 ms. Coarse 1 ms. (LAN) 10K 10 µs. 1993µ WANs & LANs 100 ms. (WAN) 100 µs. 100M 10 M 1 µs. 100 ns. 58 MPPs Med. 1000 Fine 100 Grain length (ops) © Gordon Bell x 10 M Granularity Nomograph Grain Comm. Latency & Synch. Ovhd. 1sec. 1 ms. (LAN) 100 µs. 1995µ 1G 1993 super Very 100 K 10 ms. 10M 1993µ 1M 100 ms. (WAN) Processor speed 100M Ultra Cray T3D 10 µs. Coarse VPP 500 10K Med. 1000 1 µs. C 90 Fine 100 ns. (Supers mem.) VP 59 100 Grain length (ops) © Gordon Bell Economics of Packaged Software Platform Cost Leverage # copies MPP >100K 1 1-10 copies Minis, mainframe 10-100K 10-100 1000s copies also, evolving high performance multiprocessor servers Workstation 1-100K 1-10K 1-100K copies PC 25-500 50K-1M 1-10M copies 60 © Gordon Bell Chuck Seitz comments on multicomputers “I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...” 61 © Gordon Bell Convergence to a single architecture with a single address space that uses a distributed, shared memory limited (<20) scalability multiprocessors >> scalable multiprocessors workstations with 1-4 processors >> workstation clusters & scalable multiprocessors workstation clusters >> scalable multiprocessors State Computers built as message passing multicomputers >> scalable multiprocessors 62 © Gordon Bell limited scalability: mP, uniform memory access mP mainframe, super micros C om mi c p etiti o ro s n& mP bus based multi: mini, W/S note, only two structures: 1. shared memory mP with mP ring-based uniform & non-uniform memory access; and multi 2. networked workstations, shared nothing ?? DEC, Encore, Sequent, Stratus, SGI, SUN, etc. Conv ex, Cray, mPs continue istu, IBM, to be the main line Fuj Hitachi, NEC 1995? mainframes & supers scalable, mP: smP, non-uniform memory access 1st smP -0 cache Convergence to one architecture Cache for locality networked workstations: smC smC: very coarse grain Apollo, SUN, HP, etc. Natural evolution r 64 e > ach 32 ? c ? hr ead e d p ro high bandwith switch , comm. protocols e.g. ATM smP DSM some cache o y 1995? Cm* ('75), I c C S ren Butterfly ('85), smC , ts h e Cedar ('88) bi co next gen. DSM=>smP experimental, 1st smC smC scalable, hypercube WS Micros, med-coarse Fuj itsu, Intel, multicomputer: smC, Transputer fast switch Meiko, NCUBE, grain non uniform memory (grid) H ig h TMC; 1985-1994 den s ity, m access u lti -t Cosmic Cube, iPSC 1, NCUBE, Transputer-based DASH, Conv ex, Cray T3D, SCI cess or s & swi tc h smC coarse gr. clusters WSs Clusters v ia special sw itches 1994 &ATM 1995 63 1995? smC fine-grain DSM?? smP all cache arch. KSR Allcache next gen. smP research e.g. DDM, DASH+ Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers Mosaic-C, J-machine © Gordon Bell Re-engineering HPCS Genetic engineering of computers has not produced a healthy strain that lives more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations! High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers Networking is free via ATM Nodes are free via in situ workstations Apps follow pervasive computing environments Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes. MPP won, mainstream vendors have adopted multiple CMOS. Stop funding! © Gordon Bell environments & apps are needed, but are64 unlikely because the market is small Recommendations to HPCS Goal: By 2000, massive parallelism must exist as a by-products that leverages a widescale national network & workstation/multi HW/SW nodes Dual use not duel use of products and technology or the principle of "elegance" one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs) Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computers Fund Challenges who in turn fund purchase, not product development Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be crossplatform based to benefit multiple vendors & have cross-platform use Review effectiveness of State Computers e.g., need, economics, efficacy Each committee member might visit 2-5 sites using a >>// computer Review // program environments & the efficacy to produce & support apps Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructure stop funding the development of mono computers, including the 10Tflop it must be acceptable & encouraged to buy any computer for any contract © Gordon Bell 65 Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility.... Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product Provide many small, scalable computers vs large, centralized Encourage (revert to) & support not so grand challenges Grand Challenges (GCs) need explicit goals & plans -disciplines fund & manage (demand side)... HPCC will not Fund balanced machines/efforts; stop starting Viet Nams (efforts that are rat holes that you can’t get out of) Drop the funding & directed purchase of state computers Revert to university research -> company & product development Review the HPCC & GCs program's output ... *High Performance Cash Conscriptor; Big Spenders Scalability: The Platform of HPCS The law of scalability Three kinds: machine, problem x machine, & generation (t) How do flops, memory size, efficiency & time vary with problem size? What's the nature problems & work for the computers? What about the mapping of problems onto the machines? 67 © Gordon Bell Disclaimer This talk may appear inflammatory ... i.e. the speaker may have appeared "to flame". It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way. 68 © Gordon Bell Re. Centers Funding August 1987 gbell memo to E. Bloch A fundamentally broken system! 1 Status quo. NSF funds them, as we do now in competition with computer science… use is completely decoupled from the supply... If I make the decision to trade-off, it will not favor the centers... 2. Central facility. NSF funds ASC as an NSF central facility. This allows the Director, who has the purview for all facilities and research to make the trade-offs across the foundation. 3. NSF Directorate use taxation. NSF funds it via some combination of the directorates on a taxed basis. The overall budget is set by AD’s. DASC would present the options, and administer the program. 4. Directorate-based centers. The centers (all or in part) are “given” to the research directorates. NCAR provides an excellent model. Engineering might also operate a facility. I see great economy, increased quality, and effectiveness coming through specialization of programs, databases, and support. 5. Co-pay. In order to differentially charge for all the upgrades …a tax would be levied on various allocation awards. Such a tax would be nominal (e.g. 5%) in order to deal with the infinite appetite for new hardware and software. This would allow other agencies who use the computer to also help pay. 6 Manufacturer support. Somehow, I don’t see this changing for a long time. A change would require knowing something about the power of the machines so that manufacturers could compete to provide lower costs. BTW:Erich Bloch and I visited Cray Research and succeeded in getting their assistance. 7. Make the centers larger to share support costs. Manufacturers or service providers could contract with the centers to “run” facilities. This would reduce our costs somewhat on a per machine basis. 8. Fewer physical centers. While we could keep the number of centers constant, greater economy of scale would be created by locating machines in a central facility… LANL and LLNL each run 8 Crays to share operators, mass storage and other hardware and software support. With decent networks, multiple centers are even less important. 9. Simply have fewer centers. but with increasing power. This is the sole argument for centers! 10. Maintain centers at their current or constant core levels for some specified period. Each center would be totally responsible for upgrades, etc. and their own ultimate fate. 11. Free market mechanism. Provide grant money for users to buy time. This might cost more because I sure we get free rides e.g. Berkeley, Michigan, Texas and the increasing number of institutions providing service. 69 © Gordon Bell GB Interview as CISE AD July 1987 • We, together with our program advisory committees have described the need for basic work in parallel processing to exploit both the research challenge and the plethora of parallel-processing machines that are available and emerging. We believe NSF’s role is to sponsor a wide range of software research about these machines. • This research includes basic computational models more suited to parallelism, new algorithms, standardized primitives (a small number) for addition to the standard programming languages, new languages based on parallelcomputation primitives rather than extensions to sequential languages, and new applications that exploit parallelism. • Three approaches to parallelism are clearly here now: 70 © Gordon Bell Bell CISE Interview July 1987 • First, vector processing has become primitive in supercomputers and minisupercomputers. In becoming so, it has created a revolution in scientific applications. Unfortunately, computer science and engineering departments are not part of the revolution in scientific computation that is occurring as a result of the availability of vectors. New texts and curricula are needed. • Second, message-passing models of computation can be used now on workstation clusters, on the various multicomputers such as the Hypercube and VAX clusters, and on the shared-memory multiprocessors (from supercomputers to multiple microprocessors). The Unix pipes mechanism may be acceptable as a programming model, but it has to be an awful lot faster for use in problems where medium-grain parallelism occurs. A remote procedure-call mechanism may be required for control. • Third, microtasking of a single process using shared-memory multiprocessors must also be used independently. On shared- memory multiprocessors, both mechanisms would be provided and used in forms appropriate to the algorithms and applications. Of course, other forms of parallelism will be used because it is relatively easy to build large, © Gordon Belluseful 71 SIMD [ multiple-data] machines Q: What performance do you expect from parallelism in the next decade? A: Our goal is obtaining a factor of 100 in the performance of computing, not counting vectors, within the decade and a factor of 10 within five years. I think 10 will be easy because it is inherently there in most applications right now. The hardware will clearly be there if the software can support it or the users can use it. Many researchers think this goal is aiming too low. They think it should be a factor of I million within 15 years. However, I am skeptical that anything more than our goal will be too difficult in this time period. Still, a factor of 1 million may be possible through SIMD. The reasoning behind the NSF goals is that we have parallel machines now and on the near horizon that can actually achieve these levels of performance. Virtually all new computer systems support parallelism in some form (such as vector processing or clusters of computers). However, this quiet revolution demands a major update of computer science, from textbooks and curriculum to applications research. 72 © Gordon Bell Bell Prize Initiated… 73 © Gordon Bell NRC/OSTP Report 18 August 1984 • Summary – Pare the poor projects; fund proven researchers – Understand the range of technologies required and especially the Japanese position; also vector processors • Heuristics for the program – Apply current multicomputers and multicomputers now – Fund software and applications starting now for //ism… 74 © Gordon Bell …the report greatly underestimates the position and underlying strength of the Japanese in regard to Supercomputers. The report fails to make a substantive case about the U. S. position, based on actual data in all the technologies from chips (where the Japanese clearly lead) to software engineering. The numbers used for present and projected performance appear to be wildly optimistic with no real underlying experimental basis. A near term future based on parallelism other than evolving pipelining is probably not realistic. The report continues the tradition of recommending that funding science is good, and in addition everything be funded. The conclusions to continue to invest in small scale fundamental research without a prioritization across the levels of integration or kinds of projects would seem to be of little value to decision makers. For example, the specific knowledge that we badly need in order to exploit parallelism is not addressed. Nor is the issue of how we go about getting this knowledge. My own belief is that small scale research around a single researcher is the only style of work we understand or are effective with. This may not get us very far in supercomputers. Infrastructure is more important than wild, new computer structures if the "one professor" research model is to be useful in the supercomputer effort. While this is useful to generate small startup companies, it also generates basic ideas for improving the Japanese state of the art. This occurs because the Japanese excel in the transfer of knowledge from world research laboratories into their products and because the U.S. has a declining technological base of product and process (manufacturing) engineering. The problem of organizing experimental research in the many projects requiring a small laboratory (Cray-style lab of 40 or so) to actually build supercomputer prototypes isn't addressed; these larger projects have been uniformly disastrous and the transfer to non-Japanese products negligible. Surprisingly, no one asked Seymour Cray whether there was anything he wanted in order to © Gordon Bell stay ahead… 75 1. 2. 3. 4. 5. 6. 7. 8. 9. Narrow the choice of architectures that are to be pursued. There are simply too many poor ones, and too few that can be adequately staffed. Fund only competent, full-time efforts where people have proven ability to build … systems. These projects should be carried out by full-time people, not researchers servicing multiple contracts and doing consulting. New entrants must demonstrate competence by actually building something! Have competitive proposals and projects. If something is really an important area to fund…, then have two projects with …information exchange. Fund balanced hardware/software/systems applications. Doing architectures without user involvement (or understanding) is sure to produce useless toys. Recognize the various types of projects and what the various organizational structures are likely to be able to produce. A strong infrastructure of chips to systems to support individual researchers will continue to produce interesting results. These projects are not more than a dozen people because professors don't work for or with other professors well. There are many existing multicomputers and multiprocessors that could be delivered to universities to understand parallelism before we go off to build… It is essential to get the Cray X-MP alongside the Fujitsu machine to understand …parallelism associated with multiple processors, and pipelines. Build "technology transfer mechanisms" in up front. Transfer doesn't happen automatically. Monitor the progress associated with "the transfer". 76 © Gordon Bell Residue 77 © Gordon Bell NSF TeraGrid c2003 78 © Gordon Bell Notes with Jim • • • • • • • • • • • • • • • Applications software and its development is still the no. 1 problem 64 bit addressing will change everything Many machines are used for their large memory Centers will always use all available time: Cycles bottom feeders. Allocation is still central and a bad idea. Not big enough centers Can’t really architect or recommend a plan, unless you have some notion of the needs and costs! No handle on communication costs especially for the last mi where its 50 – 150 Mbps; not fiber (10 Gbps). Two orders of magnitude low… BeoW happened as an embarrassment to funders, not because of it Walk through 7>2 now 3 centers. A center is a 50M/year expense when you upgrade! NSF: The tools development is questionable. Part of business. Feel very, very uncomfortable developing tools Centers should be functionally specialized around communities and databases Planning, budgets and allocation must be with the disciplines. People vs. Machines. Teragrid problem: having not solved the Clusters problem move to larger problem File oriented vs database hpss. 79 © Gordon Bell