Supercomputers(t) Gordon Bell Bay Area Research Center Microsoft Corp. http://research.microsoft.com/users/gbell Photos courtesy of The Computer Museum History Center Please only copy with credit! http://www.computerhistory.org 7/27/2016 Copyright G Bell & TCM History Center 1 Supercomputer Largest computer at a given time Technical use for science and engineering calculations Large government defense, weather, aero laboratories are first buyers Price is no object Market size is 3-5 7/27/2016 Copyright G Bell & TCM History Center 2 Growth in Computational Resources Used for UK Weather Forecasting 10T • 1T • 1010/ 50 yrs = 1.5850 100G • 10G • 205 1G • 100M • YMP 195 10M • 1M • KDF9 100K • 10K • Mercury 1K • 100 • Leo 10 • • 1950 7/27/2016 • 2000 Copyright G Bell & TCM History Center 3 What a difference 25 years and spending >10x more makes! Artist’s view of 40 Tflops ESRDC c2002 LLNL 150 Mflops machine room c1978 7/27/2016 Copyright G Bell & TCM History Center 4 Harvard Mark I aka IBM ASCC 7/27/2016 Copyright G Bell & TCM History Center 5 “ market for maybe five computers. ” I think there is a world Thomas Watson Senior, Chairman of IBM, 1943 7/27/2016 Copyright G Bell & TCM History Center 6 The scientific market is still about that size… 3 computers When scientific processing was 100% of the industry a good predictor $3 Billion: 6 vendors, 7 architectures DOE buys 3 very big ($100-$200 M) machines every 3-4 years 7/27/2016 Copyright G Bell & TCM History Center 7 Supercomputer price (t) Time $M 1950 1 1960 3 1970 1980 1990 2000 7/27/2016 10 30 250 1,000 structure mainframes instruction //sm mainframe SMP pipelining vectors; SCI MIMDs: mC, SMP, DSM ASCI, COTS MPP Copyright G Bell & TCM History Center example many... IBM / CDC 7600 / Cray 1 “Crays” “Crays”/MPP Grid, Legion 8 Supercomputing: speed at any price, using parallelism Intra processor Memory overlap & instruction lookahead Functional parallelism (2-4) Pipelining (10) SIMD ala ILLIAC 2d array of 64 pe vs vectors Wide instruction word (2-4) MTA (10-20) MIMDs… processor replication SMP (4-64) Distributed Shared Memory SMPs 100 MIMD… computer replication Multicomputers aka MPP aka clusters (10K) Grid: 100K 7/27/2016 Copyright G Bell & TCM History Center 9 High performance architectures timeline 1950 . 1960 . 1970 . Vtubes Trans. MSI(mini) Processor overlap, lookahead Cray era SMP 1980 . 2000 Micro RISC nMicr “killer micros” 6600 7600 Cray1 X Y C T Vector-----SMP----------------> mainframes---> “multis”-----------> DSM Clusters 1990 . Tandm VAX MPP if n>1000 7/27/2016 Copyright G Bell & TCM History Center Networks n>10,000 KSR SGI----> IBM UNIX-> Ncube Intel IBM-> NOW Grid10 High performance architectures timeline 1950 . 1960 . 1970 . Vtubes Trans. MSI(mini) 1980 . 1990 . Micro RISC 2000 nMicr Sequential programming---->-----------------------------<SIMD Vector--//--------------Parallelization--- Parallel programming multicomputers ultracomputers 10X in price “in situ” resources 100x in //sm 7/27/2016 Copyright G Bell & TCM History Center <--------------<--MPP era-----10xMPP NOW VLC Grid11 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Processors IBM Interleaving, overlap, Instruction lookahead CDC/Cray/Supers 6600 7600 Vector DEC mini Alpha Intel 8008 8086,8 286 386 486 Ppro P2,3, Merced RISC and "the killer micros" RISC VLIW Cydrome & Multiflow XXX SIMD Illiac IV CM1 CM2 Maspar XXX Multi-threaded Architecture Dennelcor? Tera MTA ???????????? Time line of hpcc contributions Multiprocessors SMP cabinet mainframes SMP "multis_ SMP on a chip Burroughs, Univac, IBM, etc.----------------Mulits=Sequent,Encore, etc. ------------X------------------- SMPv. Cray, NEC, Fujitsu, Hitachi XMP YMP Distributed Shared Memory KSR Shared address multicomputers Multicomputers aka clusters aka MPP Clusters of minis or mainframes Tandem MPPs: Intel, Thinking Machines, IBM 7/27/2016 Workstation clusters NOW worldwide C BBN T3D VAX Clustr CalT Copyright G Bell & TCM History Center ---------- Origin numa----- ?????? ?????? T3E Sysplex Ncube T UNIX --------------------Beowulf------------------ UC/B NOW etc.------Grid-------- ?????? 12 ?????? 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Processors IBM Stretch 360 CDC 1604 6600 DEC PDP 8 Intel RISC all VLIW SIMD Multi-threaded Architecture 370 7600 PDP11 8008 G Time line of hpcc contributions Cray 1 VAX 8086,8 286 Illiac IV Dennelcor? Alpha 386 486 PproP2,3,Merced MIPS/Ppc/Sparc Cydrome & Multiflow XXX CM1 CM2 Maspar Tera MTA Multiprocessors SMP SMP.IBM SMP.SUN B5000, Univac, etc. 8090 …. Mulits=Sequent,Encore, etc. 10K SMPv.Cray SMPv.NEC XMP YMP SX 1…… 5 DSM SUN DSM.SGI/Cray KSR T3D C T SUN NUMA Origin numa T3E Multicomputers aka clusters aka MPP Clusters Multicomputers Intel MPPs Thinking Machines IBM MPP 7/27/2016 NOW Tandem VAX Clustr Sysplex UNIX CalTechNcube Beowulf iPSC1, 2,Par,Delta 1.Tf 2Tf CM1,2, 5 Copyright G Bell & TCM History Center SP1 SP2 UC/B NOW Grid 13 Lehmer UC/Berkeley precomputer number sieves 7/27/2016 Copyright G Bell & TCM History Center 14 Eniac c1946 7/27/2016 Copyright G Bell & TCM History Center 15 Manchester: the first computer. Baby, Mark I, and Atlas 7/27/2016 Copyright G Bell & TCM History Center 16 von Neumann computers : Rand Johniac 7/27/2016 Copyright G Bell & TCM History Center 17 Gene Amdahl’s Dissertation and first computer 7/27/2016 Copyright G Bell & TCM History Center 18 IBM 7/27/2016 Copyright G Bell & TCM History Center 19 IBM Stretch c1961 & 360/91 c1965 consoles! 7/27/2016 Copyright G Bell & TCM History Center 20 IBM Terabit Photodigital Store c1967 7/27/2016 Copyright G Bell & TCM History Center 21 STC Terabytes of storage c1999 7/27/2016 Copyright G Bell & TCM History Center 22 Amdahl aka Fujitsu version of the 360 c1975 7/27/2016 Copyright G Bell & TCM History Center 23 IBM ASCI Red @ LLNL 7/27/2016 Copyright G Bell & TCM History Center 24 CDC, ETA, Cray Research, Cray Computer 7/27/2016 Copyright G Bell & TCM History Center 25 Cray 1925 -1996 7/27/2016 Copyright G Bell & TCM History Center 26 Circuits and Packaging, Plumbing (bits and atoms) & Parallelism… plus Programming and Problems Packaging, including heat removal High level bit plumbing… getting the bits from I/O, into memory through a processor and back to memory and to I/O Parallelism Programming: O/S and compiler Problems being solved 7/27/2016 Copyright G Bell & TCM History Center 27 Seymour Cray Computers 1951: ERA 1103 control circuits 1957: Sperry Rand NTDS; to CDC 1959: Little Character to test transistor ckts 1960: CDC 1604 (3600, 3800) & 160/160A 1964: CDC 6600 (6xxx series) 1969: CDC 7600 7/27/2016 Copyright G Bell & TCM History Center 28 Cray Research, Cray Computer Corp. and SRC Computer Corp. 1976: Cray 1... (1/M, 1/S, XMP, YMP, C90, T90) 1985: Cray Computer Cray 2 from Cray Research; GaAs: Cray 3 (1993), Cray 4 1999: SRC Company large scale, shared memory multiprocessor using x86 microprocessors 7/27/2016 Copyright G Bell & TCM History Center 29 Cray contributions… Creative and productive during his entire career 1951-1996. Creator and un-disputed designer of supers from c1960 1604 to Cray 1, 1s, 1m c1977… basis for SMPvector: XMP, YMP, T90, C90, 2, 3 Circuits, packaging, and cooling… “the mini” as a peripheral computer Use I/O computers versus I/O processors Use the main processor and interrupt it for I/O versus I/O processors aka IBM Channels 7/27/2016 Copyright G Bell & TCM History Center 30 Cray Contributions Multi-theaded processor (6600 PPUs) CDC 6600 functional parallelism leading to RISC… software control Pipelining in the 7600 leading to... Use of vector registers: adopted by 10+ companies. Mainstream for technical computing Established the template for vector supercomputer architecture SRC Company use of x86 micro in 1986 that could lead to largest, smP? 7/27/2016 Copyright G Bell & TCM History Center 31 “Cray” Clock speed (Mhz), no. of processors, peak power (Mflops) 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1960 7/27/2016 1970 1980 Copyright G Bell & TCM History Center 1990 2000 32 CDC 1604 & 6600 7/27/2016 Copyright G Bell & TCM History Center 34 CDC 7600: pipelining 7/27/2016 Copyright G Bell & TCM History Center 35 CDC 8600 Prototype: SMP, scalar, discrete circuits, failed to achieve clock speed 7/27/2016 Copyright G Bell & TCM History Center 36 CDC STAR… ETA10 7/27/2016 Copyright G Bell & TCM History Center 37 CDC 7600 & Cray 1 at Livermore Cray 1 CDC 7600 Disks 7/27/2016 Copyright G Bell & TCM History Center 38 Cray 1 #6 from LLNL. Located at The Computer Museum History Center, Moffett Field 7/27/2016 Copyright G Bell & TCM History Center 39 Cray 1 150 Kw. MG set & heat exchanger 7/27/2016 Copyright G Bell & TCM History Center 40 Cray XMP/4 Proc. c1984 7/27/2016 Copyright G Bell & TCM History Center 41 Cray 2 from NERSC/LBL 7/27/2016 Copyright G Bell & TCM History Center 42 Cray 3 c1995 processor 500 MHz 32 modules 1K GaAs ic’s/module 8 proc. 7/27/2016 Copyright G Bell & TCM History Center 43 c1970: Beginning the search for parallelism SIMDs Illiac IV CDC Star Cray 1 7/27/2016 Copyright G Bell & TCM History Center 44 Iliac IV: first SIMD c 1970s 7/27/2016 Copyright G Bell & TCM History Center 45 SCI (Strategic Computing Initiative) funded by DARPA and aimed at a Teraflops! Era of State computers and many efforts to build high speed computers… lead to HPCC Thinking Machines, Intel supers, Cray T3 series 7/27/2016 Copyright G Bell & TCM History Center 46 Minisupercomputers: a market whose time never came. Alliant, Convex, Ardent+Stellar= Stardent = 0, 7/27/2016 Copyright G Bell & TCM History Center 47 Cydrome and Multiflow: prelude to wide word parallelism in Merced Minisupers with VLIW attack the market Like the minisupers, they are repelled It’s software, software, and software Was it a basically good idea that will now work as Merced? 7/27/2016 Copyright G Bell & TCM History Center 48 MasPar... A less costly, CM 1/2 done in silicon chips It is repelled. S is the fatal flaw 7/27/2016 Copyright G Bell & TCM History Center 49 Thinking Machines: 7/27/2016 Copyright G Bell & TCM History Center 50 Thinking Machines: CM1 & CM5 c1983-1993 7/27/2016 Copyright G Bell & TCM History Center 51 “ In Dec. 1995 computers with 1,000 processors will do most of the scientific processing. ” Danny Hillis 1990 (1 paper or 1 company) 7/27/2016 Copyright G Bell & TCM History Center 52 The Bell-Hillis Bet Massive Parallelism in 1995 TMC TMC TMC World-wide Supers World-wide Supers World-wide Supers Applications Petaflops / mo. Revenue 7/27/2016 Copyright G Bell & TCM History Center 53 Bell-Hillis Bet: wasn’t paid off! My goal was not necessarily to just win the bet! Hennessey and Patterson were to evaluate what was really happening… Wanted to understand degree of MPP progress and programmability 7/27/2016 Copyright G Bell & TCM History Center 54 KSR 1: first commercial DSM NUMA (non-uniform memory access) aka COMA (cache-only memory architecture) 7/27/2016 Copyright G Bell & TCM History Center 55 SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, 7/27/2016 Copyright G Bell & TCM History Center Thinking Machines (Connection Machine), 56 Those who gave their lives in the search for parallellism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC, Chen Systems, CHOPP, Cogent, Convex (now HP), Culler, Cray Computers, Cydrome, Dennelcor, Elexsi, ETA, E & S Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, KSR, MasPar, Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY, SCS, SDSA, Supertek (now Cray), Suprenum, Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Thinking Machines, Vitec, Vitesse, Wavetracer.Copyright G Bell & TCM History Center 7/27/2016 57 NCSA Cluster of 8 x 128 processors SGI Origin c1999 7/27/2016 Copyright G Bell & TCM History Center 58 Humble beginning: In 1981… would you have predicted this would be the basis of supers? 7/27/2016 Copyright G Bell & TCM History Center 59 Intel’s ipsc 1 & Touchstone Delta 7/27/2016 Copyright G Bell & TCM History Center 60 Intel Sandia Cluster 9K PII: 1.8 TF 7/27/2016 Copyright G Bell & TCM History Center 61 GB with NT, Compaq, HP cluster 7/27/2016 Copyright G Bell & TCM History Center 62 The Alliance LES NT Supercluster “Supercomputer performance at mail-order prices”-- Jim Gray, Microsoft • Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA • Myrinet Network, HPVM, Fast Msgs • Microsoft NT OS, MPI API 192 HP 300 MHz 64 Compaq 333 MHz 7/27/2016 Copyright G Bell & TCM History Center 63 Our Tax Dollars At Work ASCI for Stockpile Stewardship Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LANL/Cray: 6144 CPUs Maui Supercomputer Center –7/27/2016 512x1 SP2Copyright G Bell & TCM History Center 64 ASCI Blue Mountain 3.1 Tflops SGI Origin 2000 12,000 sq. ft. of floor space 1.6 MWatts of power 530 tons of cooling 384 cabinets to house 6144 CPU’s with 1536 GB (32GB / 128 CPUs) 48 cabinets for metarouters 96 cabinets for 76 TB of raid disks 36 x HIPPI-800 switch Cluster Interconnect 9 cabinets for 36 HIPPI switches about 348 miles of fiber cable 7/27/2016 Copyright G Bell & TCM History Center 65 Half of SGI ASCI Computer at LASL c1999 7/27/2016 Copyright G Bell & TCM History Center 66 LASL ASCI Cluster Interconnect 18 Separate Networks 18 16x16 Crossbar Switches 1 2 3 1 4 5 2 6 7 8 3 9 10 11 4 12 13 14 15 16 5 17 18 6 6 Groups of 8 Computers each 7/27/2016 Copyright G Bell & TCM History Center 67 LASL ASCI Cluster Interconnect 7/27/2016 Copyright G Bell & TCM History Center 68 3 TeraOps makes a difference! ASCI Blue Mountain MCNP simulation: • 1 mm resolution (256x256x250) • 100 million particles • 2 hours on 6144 CPUs Typical MCNP BNCT simulation: • 1 cm resolution (21x21x25) • 1 million particles • 1 hour on 200 MHz PC 7/27/2016 Copyright G Bell & TCM History Center 69 LLNL Architecture System Parameters • 3.89 TFLOP/s Peak • 2.6 TB Memory • 62.5 TB Global disk Sector S HiPPI 2.5 GB/node Memory 24.5 TB Global Disk 8.3 TB Local Disk 12 6 FDDI 6 24 HPGN 24 Sector Y Each SP sector has • 488 Silver nodes • 24 HPGN Links 7/27/2016 Sector K 1.5 GB/node Memory 24 20.5 TB Global Disk 4.4 TB Local Disk SST Achieved >1.2TFLOP/s on sPPM and Problem >70x Larger Than Ever Solved Before! 1.5 GB/node Memory 20.5 TB Global Disk Copyright G Bell & TCM History Center 4.4 TB Local Disk 70 I/O Hardware Architecture 488 Node IBM SP Sector GPFS GPFS GPFS GPFS GPFS GPFS GPFS GPFS 56 GPFS Servers System Data and Control Networks 24 SP Links to Second Level Switch 432 Silver Compute Nodes Each SST Sector Full system mode • local and global I/O file system • Application launch over full 1,464 Silver nodes • 2.2 GB/s global I/O performance • 1,048 MPI/us tasks, 2,048 MPI/IP tasks • 3.66 GB/s local I/O performance • High speed, low latency communication 7/27/2016 Copyright G Bell & TCM History CenterSTDIO interface • Separate SP first level switches • Single 71 Fujitsu VPP5000 multicomputer: (not available in the U.S.) Computing nodes speed: 9.6 Gflops vector, 1.2 Gflops scalar primary memory: 4-16 GB memory bandwidth: 76 GB/s (9.6 x 64 Gb/s) inter-processor comm: 1.6 GB/s non-blocking with global addressing among all nodes I/O: 3 GB/s to scsi, hippi, gigabit ethernet, etc. 1-128 computers deliver 1.22 Tflops 7/27/2016 Copyright G Bell & TCM History Center 72 NEC SX 5: clustered SMPv (not available in the U.S.) SMPv computing nodes – – – – 4 - 8 processors/computer Processor pap: 8 Gflops Memory I/O speed Cluster 7/27/2016 Copyright G Bell & TCM History Center 73 NEC Supers 7/27/2016 Copyright G Bell & TCM History Center 74 High Performance COTS Raceway and (RACE++) Busses – – – – ANSI Standardized Mapped Memory, Message Passing, ‘Planned Direct’ Transfers Circuit Switched; Basic Bus Interface Unit Is a 6 (8) Port Bidirectional Switch at 40MB/s (66MB/s) Per Port. Scales to 4000 Processors Skychannel – – – ANSI Standardized 320mb/sec; Crossbar backplane supports up to 1.6 GB/s Throughput Non-blocking Heart of Air Force $3M / 256 Gflops System 7/27/2016 Copyright G Bell & TCM History Center 75 Mercury & Sky Computers - & $ Rugged System With 10 Modules ~ $100K; $1K /# 3 Scalable to several K processors; ~1-10 Gflop / Ft 10 9U Boards * 4 Ppc750’s 440 Specfp95 in 3 1 Ft (18.5 * 8 * 10.75”) Sky 384 Signal Processor, #20 on ‘Top 500’, $3M 7/27/2016 VME Platinum Copyright GSystem Bell & TCM History Center Mercury Sky PPC Daughtercard 76 Brookhaven/Columbia QCD c1999 (1999 Bell Prize for performance/$) 7/27/2016 Copyright G Bell & TCM History Center 77 Brookhaven/Columbia QCD board 7/27/2016 Copyright G Bell & TCM History Center 78 HT-MT: What’s 0.55? c1999 7/27/2016 Copyright G Bell & TCM History Center 79 HT-MT… Mechanical: cooling and signals Chips: design tools, fabrication Chips: memory, PIM Architecture: mta on steroids Storage material 7/27/2016 Copyright G Bell & TCM History Center 80 HTMT challenges the heuristics for a successful computer Mead 11 year rule: time between lab appearance and commercial use Requires >2 break throughs Team’s first computer or super It’s government funded… albeit at a university 7/27/2016 Copyright G Bell & TCM History Center 81