Recent Progress on Scaleable Servers Jim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research. Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: – many little & few big • Many little has credible programming model – tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects Scaleability Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard SMP Super Server Grow Out with Cluster Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs Key Technologies • Hardware – – – – commodity processors nUMA Smart Storage SAN/VIA • Software – – – – – – – Directory Services Security Domains Process/Data migration Load balancing Fault tolerance RPC/Objects Streams/Rivers MAPS - The Problems • Manageability: N machines are N times harder to manage • Availability: N machines fail N times more often • Programmability: N machines are 2N times harder to program • Scaleability: N machines cost N times more but do little more work. Manageability • Goal: Systems self managing • N systems as easy to manage as one system • Some progress: – – – – – – Distributed name servers (gives transparent naming) Distributed security Auto cooling of disks Auto scheduling and load balancing Global event log (reporting) Automate most routine tasks • Still very hard and app-specific Availability • Redundancy allows failover/migration (processes, disks, links) • Good progress on technology (theory and practice) • Migration also good for load balancing • Transaction concept helps exception handling Programmability & Scaleability • That’s what the rest of this talk is about • Success on embarrassingly parallel jobs – file server, mail, transactions, web, crypto • Limited success on “batch” – relational DBMs, PVM,.. Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: – many little & few big • Many little has credible programming model – tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects Scaleup Has Limits (chart courtesy of Catharine Van Ingen) LANL Loki P6 Linux NAS Expanded Linux Cluster Cray T3E • Vector Supers ~ 10x supers – 3 ~ GFlops – bus/memory ~ 20 GBps – IO ~ 1GBps 100.000 • PCs are slow Mflop/s/$K – 300 ~ Mflops – bus/memory ~ 2 GBps – IO ~ 1 GBps IBM SP SGI Origin 2000195 Sun Ultra Enterprise 4000 UCB NOW 10.000 • Supers ~ 10x PCs Mflop/s/$K vs Mflop/s 1.000 0.100 – 30 ~ Mflops 0.010 – and bus/memory ~ 200MBps – and IO ~ 100 MBps 0.001 0.1 1 10 100 Mflop/s 1000 10000 100000 Loki: Pentium Clusters for Science http://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beo wulf.html • Scientists want cheap mips. Your Tax Dollars At Work ASCI for Stockpile Stewardship • Intel/Sandia: 9000x1 node Ppro • LLNL/IBM: 512x8 PowerPC (SP2) • LANL/Cray: ? • Maui Supercomputer Center – 512x1 SP2 TOP500 Systems by Vendor (courtesy of Larry Smarr NCSA) 500 Other Japanese Vector Machines Number of Systems 400 Other DEC Intel Japanese TMC Sun DEC Intel HP 300 TMC IBM Sun Convex HP 200 Convex SGI IBM SGI 100 CRI TOP500 Reports: http://www.netlib.org/benchmark/top500.html Jun-98 Nov-97 Jun-97 Nov-96 Jun-96 Nov-95 Jun-95 Nov-94 Jun-94 Nov-93 0 Jun-93 CRI NCSA Super Cluster http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html • National Center for Supercomputing Applications University of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model A Variety of Discipline Codes Single Processor Performance Origin vs. T3E nUMA vs UMA (courtesy of Larry Smarr NCSA) Single Processor MFLOPS 160 140 QMC 120 RIEMANN 100 Laplace 80 QCD 60 PPM PIMC 40 ZEUS 20 0 Origin T3E Basket of Applications Average Performance as Percentage of Linpack Performance (courtesy of Larry Smarr NCSA) 1800 1600 22% Application Codes: 1400 Linpack 1200 Apps. Ave. 25% 1000 800 14% 19% 600 33% 26% 400 200 0 T90 C90 SPP2000 SP2160 Origin 195 PCA CFD Biomolecular Chemistry Materials QCD Observations • Uniprocessor RAP << PAP – real app performance << peak advertised performance • Growth has slowed (Bell Prize – – – – – – 1987: 0.5 GFLOPS 1988 1.0 GFLOPS 1 year 1990: 14 GFLOPS 2 years 1994: 140 GFLOPS 4 years 1998: 604 GFLOPS xxx: 1 TFLOPS 5 years? • Time Gap = 2N-1 or 2N-1 where N =( log(performance)-9) “Commercial” Clusters • 16-node Cluster – 64 cpus – 2 TB of disk – Decision support • 45-node Cluster – – – – 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) • 1 B tpd (14 k tps) Oracle/NT • • • • 27,383 tpmC 71.50 $/tpmC 4 x 6 cpus 384 disks =2.7 TB 24 cpu, 384 disks (=2.7TB) Microsoft.com: ~150x4 nodes Building 11 Staging Servers (7) Ave CFG:4xP6, Internal WWW Ave CFG:4xP5, 512 RAM, 30 GB HD FTP Servers Ave CFG:4xP5, 512 RAM, Download 30 GB HD Replication SQLNet Feeder LAN Router Live SQL Servers MOSWest Admin LAN Live SQL Server www.microsoft.com (4) register.microsoft.com (2) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:12 Ave CFG:4xP6, 512 RAM, 50 GB HD www.microsoft.com (4) premium.microsoft.com (2) home.microsoft.com (3) FDDI Ring (MIS2) cdm.microsoft.com (1) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:7 Ave CFG:4xP6, 256 RAM, 30 GB HD Ave Cost:$25K FY98 Fcst:2 Router Router msid.msn.com (1) premium.microsoft.com (1) FDDI Ring (MIS3) www.microsoft.com premium.microsoft.com (3) (1) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 30 GB HD 512 RAM, 50 GB HD FTP Download Server (1) HTTP Download Servers (2) SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD msid.msn.com (1) Switched Ethernet search.microsoft.com (2) Router Internet Secondary Gigaswitch support.microsoft.com search.microsoft.com (1) (3) Router support.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD 13 DS3 (45 Mb/Sec Each) Ave CFG:4xP5, 512 RAM, 30 GB HD register.microsoft.com (2) register.microsoft.com (1) (100Mb/Sec Each) Router FTP.microsoft.com (3) msid.msn.com (1) 2 OC3 Primary Gigaswitch Router Ave CFG:4xP5, 256 RAM, 20 GB HD register.msn.com (2) search.microsoft.com (1) Japan Data Center Internet Router home.microsoft.com (2) Switched Ethernet Router Router www.microsoft.com (3) FTP Download Server (1) activex.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP5, 256 RAM, 12 GB HD SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD Router Ave CFG:4xP6 512 RAM 28 GB HD FDDI Ring (MIS1) 512 RAM, 30 GB HD msid.msn.com (1) search.microsoft.com (3) home.microsoft.com (4) Ave CFG:4xP6, 1 GB RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:2 msid.msn.com (1) 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD www.microsoft.com premium.microsoft.com (1) Ave CFG:4xP6, Ave CFG:4xP6,(3) 512 RAM, 50 GB HD SQL Consolidators DMZ Staging Servers Router SQL Reporting Ave CFG:4xP6, 512 RAM, 160 GB HD European Data Center IDC Staging Servers MOSWest www.microsoft.com (5) Internet FDDI Ring (MIS4) home.microsoft.com (5) 2 Ethernet (100 Mb/Sec Each) The Microsoft TerraServer Hardware • • • • Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks – 3 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (4 TB) • WindowsNT 4 EE, SQL Server 7.0 TerraServer: Example Lots of Web Hits Total Average Peak Hits 913 m 10.3 m 29 m Queries 735 m 8.0 m 18 m Images 359 m 3.0 m 9 m Page Views 405 m 5.0 m 9 m 71 • • • • • 1 TB, largest SQL DB on the Web 99.95% uptime since 1 July 1998 No downtime in August No NT failures (ever) most downtime is for SQL software upgrades HotMail: ~400 Computers Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: – – – – many little & few big Many little has credible programming model tp, web, mail, fileserver,… all based on RPC Few big has marginal success (best is DSS) • Rivers and objects Two Generic Kinds of computing • Many little – – – – – embarrassingly parallel Fit RPC model Fit partitioned data and computation model Random works OK OLTP, File Server, Email, Web,….. • Few big – sometimes not obviously parallel – Do not fit RPC model (BIG rpcs) – Scientific, simulation, data mining, ... Many Little Programming Model • • • • • • • • • many small requests route requests to data encapsulate data with procedures (objects) three-tier computing RPC is a convenient/appropriate model Transactions are a big help in error handling Auto partition (e.g. hash data and computation) Works fine. Software CyberBricks Object Oriented Programming Parallelism From Many Little Jobs • • • • • Gives location transparency ORB/web/tpmon multiplexes clients to servers Enables distribution Exploits embarrassingly parallel apps (transactions) HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server Few Big Programming Model • Finding parallelism is hard – Pipelines are short (3x …6x speedup) • Spreading objects/data is easy, but getting locality is HARD • Mapping big job onto cluster is hard • Scheduling is hard – coarse grained (job) and fine grain (co-schedule) • Fault tolerance is hard Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM Database Systems “Hide” Parallelism • Automate system management via tools – data placement – data organization (indexing) – periodic tasks (dump / recover / reorganize) • Automatic fault tolerance – duplex & failover – transactions • Automatic parallelism – among transactions (locking) – within a transaction (parallel execution) SQL a Non-Procedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan – Picks data flow web (pipeline), – degree of parallelism (partitioning) – other execution parameters (process placement, memory,...) Execution Planning Monitor Schema GUI Optimizer Plan Executors Rivers Partitioned Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL parallelism N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Landsat date loc image 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Temporal Spatial Image Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it date, location, & image tests Answer image Data Rivers: Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano /SQL Server. Generalization: Object-oriented Rivers • Rivers transport sub-class of record-set (= stream of objects) – record type and partitioning are part of subclass • Node transformers are data pumps – an object with river inputs and outputs – do late-binding to record-type • Programming becomes data flow programming – specify the pipelines • Compiler/Scheduler does data partitioning and “transformer” placement NT Cluster Sort as a Prototype • Using – data generation and – sort as a prototypical app • “Hello world” of distributed processing • goal: easy install & execute PennySort • Hardware – 266 Mhz Intel PPro – 64 MB SDRAM (10ns) – Dual Fujitsu DMA 3.2GB EIDE • Software – NT workstation 4.3 – NT 5 sort PennySort Machine (1107$ ) Disk 25% Cabinet + Assembly 7% Memory 8% Other 22% board 13% • Performance Network, Video, floppy 9% Software 6% cpu 32% – sort 15 M 100-byte records (~1.5 GB) – Disk to disk – elapsed time 820 sec • cpu time = 404 sec Remote Install •Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx() Cluster StartupExecution •Setup : MULTI_QI struct COSERVERINFO struct •CoCreateInstanceEx() MULT_QI COSERVERINFO HANDLE HANDLE HANDLE •Retrieve remote object handle from MULTI_QI struct Sort() •Invoke methods as usual Sort() Sort() Cluster Sort •Multiple Data Sources A AAA BBB CCC Conceptual Model •Multiple Data Destinations AAA AAA AAA •Multiple nodes AAA AAA AAA •Disks -> Sockets -> Disk -> Disk B C AAA BBB CCC CCC CCC CCC AAA BBB CCC CCC CCC CCC BBB BBB BBB BBB BBB BBB Summary • Clusters of Hardware CyberBricks – all nodes are very intelligent – Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Software CyberBricks – standard way to interconnect intelligent nodes – needs execution model • partition & pipeline • RPC and Rivers) – needs parallelism Recent Progress on Scaleable Servers Jim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research. end What I’m Doing • TerraServer: Photo of the planet on the web – a database (not a file system) – 1TB now, 15 PB in 10 years – http://www.TerraServer.microsoft.com/ • Sloan Digital Sky Survey: picture of the universe – just getting started, cyberbricks for astronomers – http://www.sdss.org/ • Sorting: – one node pennysort (http://research.microsoft.com/barc/SortBenchmark/) – multinode: NT Cluster sort (shows off SAN and DCOM) What I’m Doing • NT Clusters: – failover: Fault tolerance within a cluster – NT Cluster Sort: balanced IO, cpu, network benchmar – AlwaysUp: Geographical fault tolerance. • RAGS: random testing of SQL systems – a bug finder • Telepresence – Working with Gordon Bell on “the killer app” – FileCast and PowerCast – Cyberversity (international, on demand, free university) Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: – many little & few big • Many little has credible programming model – tp, web, fileserver, mail,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace • Cost 1,000 $ • Come with – NT – DBMS – High speed Net – System management – GUI / OOUI – Tools • Compatible with everyone else • CyberBricks Super Server: 4T Machine Array of 1,000 4B machines 1 b ips processors 1 B B DRAM 10 B B disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: CPU 50 GB Disc 5 GB RAM Manageability Programmability Security Cyber Brick a 4B machine Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Cluster Vision Buying Computers by the Slice • Rack & Stack – Mail-order components – Plug them into the cluster • Modular growth without limits – Grow by adding small modules • Fault tolerance: – Spare modules mask failures • Parallel execution & data search – Use multiple processors and disks • Clients and servers made from the same stuff – Inexpensive: built with commodity CyberBricks Nostalgia Behemoth in the Basement • today’s PC is yesterday’s supercomputer • Can use LOTS of them • Main Apps changed: – scientific commercial web – Web & Transaction servers – Data Mining, Web Farming Kilo Mega Giga Tera Peta Exa Zetta Technology Drivers: Disks • Disks on track • 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk replaces tape? Yotta • Disk is super computer!