Scaleable Computing Jim Gray Microsoft Corporation Gray@Microsoft.com ™ Thesis: Scaleable Servers Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Servers should be able to Commodity software and Commodity hardware Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Scale down (can start small) Key software technologies Objects, Transactions, Clusters, Parallelism 1987: 256 tps Benchmark 14 M$ computer (Tandem) A dozen people False floor, 2 rooms of machines Admin expert Hardware experts A 32 node processor array Simulate 25,600 clients Network expert Manager Performance expert DB expert A 40 GB disk array (80 drives) Auditor OS expert 1988: DB2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB 1997: 10 years later 1 Person and 1 box = 1250 tps 1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 1,000x less 25 micro dollars per transaction Hardware expert OS expert Net expert DB expert App expert 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk 3 x7 x 4GB disk arrays What Happened? Moore’s law: Things get 4x better every 3 years (applies to computers, storage, and networks) New Economics: Commodity class price/mips software $/mips k$/year mainframe 10,000 100 minicomputer 100 10 microcomputer 10 1 time GUI: Human - computer tradeoff optimize for people, not computers What Happens Next Last 10 years: 1000x improvement Next 10 years: ???? 1985 1995 2005 Today: text and image servers are free 25 m$/hit => advertising pays for them Future: video, audio, … servers are free “You ain’t seen nothing yet!” Kinds Of Information Processing Point-to-point Immediate Timeshifted Broadcast Conversation Money Lecture Concert Network Mail Book Newspaper Database It’s ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis and automatic processing are being added Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Immediate OR time-delayed Why Put Everything In Cyberspace? Point-to-point OR broadcast Network Locate Process Analyze Summarize Database Magnetic Storage Cheaper Than Paper File cabinet: cabinet (four drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3¢/sheet Disk: Image: disk (4 GB =) ASCII: 2 mil pages 800$ 0.04¢/sheet (80x cheaper) 200,000 pages 0.4¢/sheet Store everything on disk (8x cheaper) Databases Information at Your Fingertips™ Information Network™ Knowledge Navigator™ All information will be in an online database (somewhere) You might record everything you Read: 10MB/day, 400 GB/lifetime (eight tapes today) Hear: 400MB/day, 16 TB/lifetime (three tapes/year today) See: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday) Database Store ALL Data Types The old world: Millions of objects 100-byte objects People Name Address David NY Mike Berk Won Austin The new world: Billions of objects Big objects (1 MB) Objects have behavior (methods) People Name Address Papers David NY Mike Berk Won Austin Picture Voice Paperless office Library of Congress online All information online Entertainment Publishing Business WWW and Internet Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous Billions Of Clients Need Millions Of Servers All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP Future Super Server: 4T Machine Array of 1,000 4B machines 1 bps processors 1 BB DRAM 10 BB disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: Manageability Programmability CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine Security Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Performance = Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’s Now we count memory references Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Disc Wait Disc Wait Sort Sort OS Memory Wait B-Cache Data Miss 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not I-Cache Miss D-Cache Miss Storage Latency: How Far Away is the Data? Clock Ticks 10 9 Andromeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min The Hardware Is In Place… And then a miracle occurs ? SNAP: scaleable network and platforms Commodity-distributed OS built on: Commodity platforms Commodity network interconnect Enables parallel applications Thesis: Scaleable Servers Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Servers should be able to Commodity software and Commodity hardware Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Scale down (can start small) Key software technologies Objects, Transactions, Clusters, Parallelism Scaleable Servers BOTH SMP And Cluster SMP super server Departmental server Personal system Grow up with SMP; 4xP6 is now standard Grow out with cluster Cluster has inexpensive parts Cluster of PCs SMPs Have Advantages Single system image easier to manage, easier to program threads in shared memory, disk, Net 4x SMP is commodity SMP super Software capable of 16x server Problems: Departmental >4 not commodity server Scale-down problem (starter systems expensive) Personal There is a BIGGEST one system Building the Largest Node There is a biggest node (size grows over time) Today, with NT, it is probably 1TB We are building it (with help from DEC and SPIN2) 1 TB GeoSpatial SQL Server database (1.4 TB of disks = 320 drives). 30K BTU, 8 KVA, 1.5 metric tons. 1-TB home page Will put it on the Web as a demo app. 10 meter image of the ENTIRE PLANET. www.SQL.1TB.com 2 meter image of interesting parts (2% of land) Todo loo da loo-rah, ta da ta-la la la Todo loo da loo-rah, ta da ta-la la la Todo loo da loo-rah, ta da ta-la la la One pixel per meter = 500 TB uncompressed. Todo loo da loo-rah, ta da ta-la la la Todo loo da loo-rah, ta da ta-la la la Todo loo da loo-rah, ta da ta-la la la Todo loo da loo-rah, ta da ta-la la la TM Better resolution in US (courtesy of USGS). 1-TB SQL Server DB Satellite and aerial photos Support files What’s TeraByte? 1 Terabyte: 1,000,000,000 business letters 150 miles of book shelf 100,000,000 book pages 15 miles of book shelf 50,000,000 FAX images 7 miles of book shelf 10,000,000 TV pictures (mpeg) 10 days of video 4,000 LandSat images 16 earth images (100m) 100,000,000 web page 10 copies of the web HTML Library of Congress (in ASCII) is 25 TB 1980: $200 million of disc $5 million of tape silo 1997: $200 k$ of magnetic disc $30 k$ nearline tape Terror Byte ! 10,000 discs 10,000 tapes 48 discs 20 tapes TB DB User Interface Next Tpc-C Web-Based Benchmarks Order Invoice Query to server via Web page interface Web server translates to DB SQL does DB work Net: easy to implement performance is GREAT! HTTP Client is a Web browser (7,500 of them!) Submits IIS = Web ODBC SQL TPC-C Shows How Far SMPs have come Performance is amazing: Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC) Best Price/Perf: 6,712 tpmC @ $65/tpmC (MS SQL/DEC/Intel) graphs show UNIX high price & diseconomy of scaleup tpm C & Price Pe rform ance (only "best" data shown for each vendor) DB2 400 Informix MS SQL Server 350 Oracle 300 Sybase 250 $/tpmC 2,000 users is the min! 30,000 users on a 4x12 alpha cluster (Oracle) 200 150 100 50 0 0 5000 10000 tpmC 15000 20000 TPC C SMP Performance • SMPs do offer speedup but 4x P6 is better than some 18x MIPSco tpm C vs CPS SUN Scaleability 20,000 20,000 18,000 SUN Scaleability 16,000 15,000 SQL Server 14,000 tpmC tpmC 12,000 10,000 10,000 8,000 6,000 5,000 4,000 2,000 0 0 0 5 10 CPUs 15 20 0 5 10 cpus 15 20 The TPC-C Revolution Shows How Far NT and SQL Server have Come tpmC and $/tpmC MS SQL Server: Economy of Scale & Low Price $250 DB2 Informix Microsoft Oracle Sybase $200 Better Economy of scale on Windows NT Recent Microsoft SQL Server benchmarks are Web-based Price $/TPM-C $150 $100 $50 $0 0 1000 2000 3000 4000 5000 Performance tpmC 6000 7000 8000 What Happens To Prices? No expensive UNIX front end (20$/tpmC) No expensive TP monitor software (10$/tpmC) => 65$/tpmC 164 188 TPC Price/tpmC 100 93 90 Informix on SNI Oracle on DEC Unix Oracle on Compaq/NT Sybase on Compaq/NT Microsoft on Compaq with Visigenics Microsoft on HP with Visagenics Microsoft on Intergraph with IIS Microsoft on Compaq with IIS 80 70 66 64 66 60 50 40 54 45 44 35 44 38 44 40 39 39 35 30 27 30 20 42 40 38 41 39 31 22 18 19 21 16 8 10 3 0 30 processor disk software net Grow UP and OUT 1 Terabyte DB SMP super server Departmental server Personal system Cluster: •a collection of nodes •as easy to program and manage as a single node 1 billion transactions per day Clusters Have Advantages Clients and servers made from the same stuff Inexpensive: Fault tolerance: Spare modules mask failures Modular growth Built with commodity components Grow by adding small modules Unlimited growth: no biggest one Windows NT clusters Key goals: Easy: to install, manage, program Reliable: better than a single node Scaleable: added parts add power Microsoft & 60 vendors defining NT clusters Almost all big hardware and software vendors involved No special hardware needed but it may help Enables Commodity fault-tolerance Commodity parallelism (data mining, virtual reality…) Also great for workgroups! Initial: two-node failover Beta testing since December96 SAP, Microsoft, Oracle giving demos. File, print, Internet, mail, DB, other services Easy to manage Each node can be 4x (or more) SMP Next (NT5) “Wolfpack” is modest size cluster About 16 nodes (so 64 to 128 CPUs) No hard limit, algorithms designed to go further ™ SQL Server Failover Using “Wolfpack” Windows NT Clusters Each server “owns” half the database When one fails… The other server takes over the shared disks Recovers the database and serves it Private disks Private disks Shared SCSI disk strings B A Clients Billion Transactions per Day Project Building a 20-node Windows NT Cluster (with help from Intel) > 800 disks All commodity parts Using SQL Server & DTC distributed transactions Each node has 1/20 th of the DB Each node does 1/20 th of the work 15% of the transactions are “distributed” How Much Is 1 Billion Transactions Per Day? 1 Btpd = 11,574 tps (transactions per second) Millions of transactions per day ~ 700,000 tpm 1,000. (transactions/minute) 400 M customers 250,000 ATMs worldwide 7 billion transactions / year (card+cheque) in 1994 0.1 NYSE Visa ~20 M tpd 1. BofA 185 million calls (peak day worldwide) AT&T 10. Visa AT&T Mtpd 100. 1 Btpd Parallelism The OTHER aspect of clusters Clusters of machines allow two kinds of parallelism Many little jobs: online transaction processing TPC-A, B, C… A few big jobs: data search and analysis TPC-D, DSS, OLAP Both give automatic parallelism Kinds of Parallel Execution Pipeline Any Sequential Program Partition outputs split N ways inputs merge M ways Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Any Sequential Program Any Sequential Program Any Sequential Program Data Rivers Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Partitioned Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL parallelism Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Parallel Law Of Computing Grosch's Law: 1 MIPS 1$ 2x $ is 4x performance 1,000 MIPS 32 $ .03$/MIPS 2x $ is 2x performance Parallel Law: Needs: Linear speedup and linear scale-up Not always possible 1,000 MIPS 1,000 $ 1 MIPS 1$ Thesis: Scaleable Servers Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff” Servers should be able to Commodity software and Commodity hardware Scale up (grow node by adding CPUs, disks, networks) Scale out (grow by adding nodes) Scale down (can start small) Key software technologies Objects, Transactions, Clusters, Parallelism The BIG Picture Components and transactions Software modules are objects Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) Standard interfaces allow software plug-ins Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object Request Broker ActiveX and COM COM is Microsoft model, engine inside OLE ALL Microsoft software is based on COM (ActiveX) CORBA + OpenDoc is equivalent Heated debate over which is best Both share same key goals: Encapsulation: hide implementation Polymorphism: generic operations key to GUI and reuse Versioning: allow upgrades Transparency: local/remote Security: invocation can be remote Shrink-wrap: minimal inheritance Automation: easy COM now managed by the Open Group Linking And Embedding Objects are data modules; transactions are execution modules Link: pointer to object somewhere else Think URL in Internet Embed: bytes are here Objects may be active; can callback to subscribers Objects Meet Databases The basis for universal data servers, access, & integration object-oriented (COM oriented) programming interface to data Breaks DBMS into components Anything can be a data source Optimization/navigation “on top of” other data sources A way to componentized a DBMS Makes an RDBMS and O-R DBMS (assumes optimizer understands objects) DBMS engine Database Spreadsheet Photos Mail Map Document The Pattern: Three Tier Computing Presentation Clients do presentation, gather input Clients do some workflow (Xscript) Clients send high-level requests to ORB (Object Request Broker) ORB dispatches workflows and business objects -- proxies for client, Business Objects orchestrate flows & queues Server-side workflow scripts call on distributed business objects to execute task workflow Database 49 The Three Tiers Web Client HTML VB Java plug-ins VBscritpt JavaScrpt Middleware VB or Java Script Engine Object server Pool VB or Java Virt Machine Internet HTTP+ DCOM ORB ORB TP Monitor Web Server... Object & Data server. DCOM (oleDB, ODBC,...) IBM Legacy Gateways 50 Why Did Everyone Go To Three-Tier? Manageability Business rules must be with data Middleware operations tools Performance (scaleability) workflow Server resources are precious ORB dispatches requests to server pools Technology & Physics Presentation Put UI processing near user Put shared data processing near shared data Business Objects Database 51 What Middleware Does ORB, TP Monitor, Workflow Mgr, Web Server Registers transaction programs workflow and business objects (DLLs) Pre-allocates server pools Provides server execution environment Dynamically checks authority (request-level security) Does parameter binding Dispatches requests to servers parameter binding load balancing Provides Queues Operator interface 53 Server Side Objects Easy Server-Side Execution A Server Give simple execution environment Object gets Network start invoke shutdown Everything else is automatic Drag & Drop Business Objects Queue Connections Context Security Thread Pool Service logic Synchronization Shared Data 54 Configuration Management Receiver A new programming paradigm Develop object on the desktop Better yet: download them from the Net Script work flows as method invocations All on desktop Then, move work flows and objects to server(s) Gives desktop development three-tier deployment Software Cyberbricks Transactions Coordinate Components (ACID) Transaction properties Atomic: all or nothing Consistent: old and new values Isolated: automatic locking or versioning Durable: once committed, effects survive Transactions are built into modern OSs MVS/TM Tandem TMF, VMS DEC-DTM, NT-DTC Transactions & Objects Application requests transaction identifier (XID) XID flows with method invocations Object Managers join (enlist) in transaction Distributed Transaction Manager coordinates commit/abort Distributed Transactions Enable Huge Throughput Each node capable of 7 KtmpC (7,000 active users!) Can add nodes to cluster (to support 100,000 users) Transactions coordinate nodes ORB / TP monitor spreads work among nodes Distributed Transactions Enable Huge DBs Distributed database technology spreads data among nodes Transaction processing technology manages nodes Thesis: Scaleable Servers Scaleable Servers Built from Cyberbricks Servers should be able to Allow new applications Scale up, out, down Key software technologies Clusters (ties the hardware together) Parallelism: (uses the independent cpus, stores, wires Objects (software CyberBricks) Transactions: masks errors. Computer Industry Laws (Rules of thumb) Metcalf’s law Moore’s first law Bell’s computer classes (7 price tiers) Bell’s platform evolution Bell’s platform economics Bill’s law Software economics Grove’s law Moore’s second law Is info-demand infinite? The death of Grosch’s law Metcalf’s Law Network Utility = Users2 How many connections can it make? 1 user: no utility 100,000 users: a few contacts 1 million users: many on Net 1 billion users: everyone on Net That is why the Internet is so “hot” Exponential benefit Moore’s First Law 1GB XXX doubles every 18 months 128MB 60% increase per year 8MB Micro processor speeds 1MB 128KB Chip density 8KB Magnetic disk density 1970 Communications bandwidth bits: 1K 4K WAN bandwidth approaching LANs 1980 The past does not matter 10x here, 10x there, soon you’re talking REAL change PC costs decline faster than any other platform 1990 2000 16K 64K 256K 1M 4M 16M 64M 256M Exponential growth: 1 chip memory size ( 2 MB to 32 MB) Volume and learning curves PCs will be the building bricks of all future Bumps In The Moore’s Law Road $/MB of DRAM 1000000 DRAM: 1988: United States anti-dumping rules 1993-1995: ?price flat Magnetic disk: 1965-1989: 10x/decade 1989-1996: 4x/3year! 100X/decade 10000 100 1 1970 1980 1990 2000 $/MB of DISK 10,000 100 1 .01 1970 1980 1990 2000 Gordon Bell’s 1975 VAX Planning Model... He Didn’t Believe It! (t-1972) System Price = 5 x 3 x .04 x memory size/ 1.26 5x: Memory is 20% of cost 3x: DEC markup .04x: $ per byte He didn’t believe: the projection $500 machine He couldn’t comprehend the implications K$ 100,000.K$ 10,000.K$ 1,000.K$ 100.K$ 10.K$ 1.K$ 0.1K$ 0.01K$ 1960 16 KB 1970 1980 64 KB 256 KB 1990 1 MB 2000 8 MB Gordon Bell’s Processing Memories, And Comm 100 Years 1.E+18 1.E+15 1.E+12 1.E+09 1.E+06 1.E+03 1.E+00 1947 1967 Processing 1987 2007 2027 Sec. Mem. Pri. Mem POTS(bps) 2047 Backbone Gordon Bell’s Seven Price Tiers 10$: 100$: 1,000$: 10,000$: 100,000$: 1,000,000$: 10,000,000$: wrist watch computers pocket/ palm computers portable computers • personal computers (desktop) departmental computers (closet) site computers (glass house) regional computers (glass castle) Super server: costs more than $100,000 “Mainframe”: costs more than $1 million Must be an array of processors, disks, tapes, comm ports Bell’s Evolution Of Computer Classes Technology enables two evolutionary paths: 1. constant performance, decreasing cost 2. constant price, increasing performance Log price Mainframes (central) Minis (dep’t.) WSs PCs (personals) Time ?? 1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8 1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62 Gordon Bell’s Platform Economics Traditional computers: custom or semi-custom, high-tech and high-touch New computers: high-tech and no-touch 100000 10000 Price (K$) Volume (K) Application price 1000 100 10 1 0.1 0.01 Mainframe WS Computer type Browser Software Economics Microsoft: $9 billion An engineer costs Profit R&D about 24% 16% $150,000/year R&D gets [5%…15%] SG&A Tax 34% 13% of budget Need [$3 million… Product and Service $1 million] revenue 13% per engineer Intel: $16 billion IBM: $72 billion Oracle: $3 billion Profit 22% R&D 8% SG&A 11% Tax 12% P&S 47% Profit Tax 6% 5% R&D 8% Profit 15% Tax 7% SG&A 22% P&S 59% P&S 26% R&D 9% SG&A 43% Software Economics: Bill’s Law Fixed_Cost Price = + Marginal _Cost Units Bill Joy’s law (Sun): don’t write software for less than 100,000 platforms @$10 million engineering expense, $1,000 price Bill Gate’s law: don’t write software for less than 1,000,000 platforms @$10 engineering expense, $100 price Examples: UNIX versus Windows NT: $3,500 versus $500 Oracle versus SQL-Server: $100,000 versus $6,000 No spreadsheet or presentation pack on UNIX/VMS/... Commoditization of base software and hardware Grove’s Law The New Computer Industry Horizontal integration is new structure Each layer picks best from lower layer Desktop (C/S) market 1991: 50% 1995: 75% Function Operation Integration Applications Middleware Baseware Systems Silicon & Oxide Example AT&T EDS SAP Oracle Microsoft Compaq Intel & Seagate The cost of fab lines doubles every generation (three years) Money limit hard to imagine: $10-billion line $20-billion line $40-billion line Physical limit Quantum effects at 0.25 micron now 0.05 micron seems hard 12 years, three generations Lithograph: need Xray below 0.13 micron $million/ Fab Line Moore’s Second Law $10,000 $1,000 $100 $10 $1 1960 1970 1980 Year 1990 2000 Constant Dollars Versus Constant Work Constant work: One SuperServer can do all the world’s computations Constant dollars: The world spends 10% on information processing Computers are moving from 5% penetration to 50% $300 billion to $3 trillion We have the patent on the byte and algorithm Crossing The Chasm New market Product finds customers No product no customers Hard Old market Boring competitive slow growth Old technology Hard Customers find product New technology