Scaleable Computing Jim Gray Researcher US-WAT MSR San Francisco Microsoft Corporation Gray@Microsoft.com ™ Outline Why scaleable servers? Problems and solutions for scaleable servers How Internet Information Server revolutionizes OLTP “Wolfpack” Windows NT® clusters for scaleability, availability, manageability ActiveX™ object model as structuring principle OLE DB (DAO) for data sources MTX as a new programming paradigm MTX as a server Distributed transactions to coordinate components “Falcon” queues for asynchronous processing Kinds Of Information Processing Point-to-point Immediate Timeshifted Broadcast Conversation Money Lecture Concert Network Mail Book Newspaper Database It’s ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis and automatic processing are being added Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Immediate OR time-delayed Why Put Everything In Cyberspace? Point-to-point OR broadcast Network Locate Process Analyze Summarize Database Magnetic Storage Cheaper Than Paper File cabinet: cabinet (four drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3¢/sheet Disk: Image: disk (4 GB =) ASCII: 2 mil pages 800$ 0.04¢/sheet (80x cheaper) 200,000 pages 0.4¢/sheet Store everything on disk (8x cheaper) Databases Information at Your Fingertips™ Information Network™ Knowledge Navigator™ All information will be in an online database (somewhere) You might record everything you Read: 10MB/day, 400 GB/lifetime (eight tapes today) Hear: 400MB/day, 16 TB/lifetime (three tapes/year today) See: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday) Database Store ALL Data Types The old world: Millions of objects 100-byte objects People Name Address David NY Mike Berk Won Austin The new world: Billions of objects Big objects (1 MB) Objects have behavior (methods) People Name Address Papers David NY Mike Berk Won Austin Picture Voice Paperless office Library of Congress online All information online Entertainment Publishing Business WWW and Internet Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous Billions Of Clients Need Millions Of Servers All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Conclusion Commodity hardware allows new applications New applications need huge servers Ideally, clients and servers are built of the same “stuff” Servers should be built from Commodity software and Commodity hardware Servers should be able to Scale up (grow by adding CPUs, disks, networks) Scale down (can start small) Scaleable Systems BOTH SMP And Cluster SMP super server Departmental server Personal system Grow up with SMP; 4xP6 is now standard Grow out with cluster Cluster has inexpensive parts Cluster of PCs SMPs Have Advantages Single system image easier to manage, easier to program threads in shared memory, disk, Net 4x SMP is commodity SMP super Software capable of 16x server Problems: Departmental >4 not commodity server Scale-down problem (starter systems expensive) Personal There is a BIGGEST one system The TPC-C Revolution Shows How Far SMPs Have Come Performance is amazing: Prices dropping fast Vendor's tpmC and $/tpmC UNIX Dis-Economy of Scale $500 $450 $400 Price $/TPM-C 2,000 users is the min! 30,000 users on a 4x12 alpha cluster (Oracle) $350 DB2 $300 Informix $250 Better $200 $150 Microsoft Oracle Sybase $100 $50 Informix on NT $0 0 5000 10000 15000 20000 Performance tpmC 25000 30000 35000 SQL Server executes, returns ODBC Web server builds HTML page Sends it to client via HTTP 6750 transactions/ minute C on 4xP6 Net: Internet server performance is GREAT! IIS = Web ODBC HTTP TPC-C Web-Based Benchmarks SQL What Happens To Prices? No expensive UNIX front end (20$/tpmC) No expensive TP monitor software (10$/tpmC) => 81$/tpmC TPC Price/tpmC 100 93 90 Informix on SNI Oracle on DEC Unix Oracle on Compaq/NT Sybase on Compaq/NT Microsoft on Compaq with Visigenics Microsoft on HP with Visagenics Microsoft on Intergraph with IIS Microsoft on Compaq with IIS 80 70 66 64 66 60 50 40 54 45 44 35 44 38 44 40 39 39 35 30 27 30 20 42 40 38 41 39 31 22 18 19 21 16 8 10 3 0 processor disk software net Scaleable Systems Clusters Scale Beyond Largest SMP SMP super server Departmental server Personal system Cluster of PCs Clusters Have Advantages Clients and servers made from the same stuff Inexpensive: Fault tolerance: Spare modules mask failures Modular growth Built with commodity components Grow by adding small modules Unlimited growth: no biggest one Parallelism The OTHER aspect of clusters Clusters of machines allow two kinds of parallelism Many little jobs: online transaction processing TPC-A, B, C… A few big jobs: data search and analysis TPC-D, DSS, OLAP Both give automatic parallelism Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP Future Super Server: 4T Machine Array of 1,000 4B machines 1 bps processors 1 BB DRAM 10 BB disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: Manageability Programmability CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine Security Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work The Hardware Is In Place… And then a miracle occurs ? SNAP: scaleable network and platforms Commodity-distributed OS built on: Commodity platforms Commodity network interconnect Enables parallel applications Two Scaleability Projects 1-TB DB and 1 billion TPD 1 Terabyte DB Grow UP and grow OUT SMP super server Departmental server Personal system 1 billion transactions per day Building The Biggest Node There is a biggest node (size grows over time) Today, with Windows NT, it is probably 1TB We are building it (with help from DEC and SPOT) 1 TB GeoSpatial SQL Server database (1.4 TB of disks = 280 drives) 30K BTU, 8 KVA, 1.5 metric tons We plan to put it on the Web as a demonstration application It will hold satellite images of the entire planet One pixel per 10 meters Better resolution in U.S. (courtesy of USGS) What’s A TeraByte? 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images 150 miles of book shelf 15 miles of book shelf 7 miles of book shelf 10 days of video 16 earth images (100m) Library of Congress (in ASCII) is 25 TB 1980: $200 million of disc $5 million of tape silo 1996: $200,000 of magnetic disc $50,000 nearline tape 10,000 discs 10,000 tapes 120 discs 50 tapes Terror Byte! User Interface + + + Next What The 1-Billion TPD Project Is Doing Building a 20-node Windows NT Cluster (with help from Intel) All commodity parts Using SQL Server & DTC distributed transactions Each node has 1/20th of the DB Each node does 1/20th of the work 15% of the transactions are “distributed” Uses the “Viper” distributed transaction coordinator How Much Is 1 Billion Transactions Per Day? 1 Btpd = 11,574 tps (transactions per second) Millions of transactions per day ~ 700,000 tpm 1,000. (transactions/minute) 400 M customers 250,000 ATMs worldwide 7 billion transactions / year (card+cheque) in 1994 0.1 NYSE Visa ~20 M tpd 1. BofA 185 million calls (peak day worldwide) AT&T 10. Visa AT&T Mtpd 100. 1 Btpd Outline Why scaleable servers? Problems and solutions for scaleable servers How Internet Information Server revolutionizes OLTP “Wolfpack” Windows NT clusters for scaleability, availability, manageability ActiveX object model as structuring principle OLE DB (DAO) for data sources MTX as a new programming paradigm MTX as a server Distributed transactions to coordinate components “Falcon” queues for asynchronous processing “Wolfpack” Windows NT Clusters The great hope Tandem, Teradata, VAX clusters are proprietary Microsoft & 60 vendors defining Windows NT Clusters Code name “Wolfpack” Almost all big hardware and software vendors involved No special hardware needed -but it may help “Wolfpack” clusters Key goals: Initial “Wolfpack” is two-node failover Easy: to install, manage, program Reliable: more reliable than single node Scaleable: added parts add throughput Each node can be 4x (or more) SMP File, print, Internet, mail, DB, other services Easy to manage Next (NT5) “Wolfpack” is modest size cluster About 16 nodes (so 64 to 128 CPUs) No hard limit, algorithms designed to go further Outline Why scaleable servers? Problems and solutions for scaleable servers How Internet Information Server revolutionizes OLTP “Wolfpack” Windows NT clusters for scaleability, availability, manageability ActiveX object model as structuring principle OLE DB (DAO) for data sources MTX as a new programming paradigm MTX as a server Distributed transactions to coordinate components “Falcon” queues for asynchronous processing The BIG Picture Components and transactions Software modules are objects Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) Standard interfaces allow software plug-ins Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object Request Broker Component Object Model COM is Microsoft model, engine inside OLE ALL Microsoft software is based on COM (ActiveX) CORBA + OpenDoc is equivalent Heated debate over which is best Both share same key goals: Encapsulation: hide implementation Polymorphism: generic operations key to GUI and reuse Versioning: allow upgrades Transparency: local/remote Security: invocation can be remote Shrink-wrap: minimal inheritance Automation: easy COM now managed by the Open Group OLE DB: Objects Meet Databases The basis for universal data servers, access, & integration OLE DB: object-oriented (COM oriented) programming interface to data Breaks DBMS into components Anything can be a data source Optimization/navigation “on top of” other data sources A way to componentized a DBMS Makes an RDBMS and O-R DBMS (assumes optimizer understands objects) DBMS engine Database Spreadsheet Photos Mail Map Document Transactions Coordinate Components (ACID) Programmer’s view: bracket a collection of actions A simple failure model Only two outcomes: Begin() action action action action Commit() Success! Begin() Begin() action action action action action action Rollback() Fail ! Rollback() Failure! Distributed Transactions Enable Huge Throughput Each node capable of 7 KtmpC (7,000 active users!) Can add nodes to cluster (to support 100,000 users) Transactions coordinate nodes ORB / TP monitor spreads work among nodes Distributed Transactions Enable Huge DBs Distributed database technology spreads data among nodes Transaction processing technology manages nodes Microsoft Transaction Service A new programming paradigm Develop your ActiveX object on the desktop Better yet: download them from the Net Script your work flows as invocations of ActiveX objects All on desktop Design and development phase Server(s) Client Presentation layer Workflow layer Application objects Then, move work flows and objects to server(s) Gives Database layer desktop development three-tier deployment Deployment phase Presentation layer Workflow layer Application Objects Database layer MTX execution environment MTX Provides Server-Side Execution Environment Scheduling and load balancing Deadlocks and starvation Network Receiver Queue Connections Context Security Thread Pool Service logic Synchronization Shared Data Configuration Directory registration, congestion and flow control Authentication Object handles National language Clients Management Accepts ActiveX objects Manages bindings (it’s an ORB) Efficient (pre-bound servers) Manages thread pools Manages security Includes transaction services Provides operator interface GUI administrative interface Structure of a scaleable server MTX Also Coordinates And Interoperates Coordinates distributed transactions Client application Begin dist tran: Update sales Update inventory Update warranty Commit Windows NT Server Windows NT Server DTC DTC SQL Server Sales SQL Server Inventory Windows NT Server DTC Other DBMS Warranty MTX Also Coordinates And Interoperates Interoperates with Internet and with legacy systems Browser/client Windows NT Server 4.0 Internet Information Server MTx ActiveX Components SNA Server LU6.2 OLETX XA CICS/MVS SQL Server Other DBMS “Falcon” Queue Management Asynchronous transaction processing Many tasks are time-shifted “Falcon” gives a QUEUE mechanism Message-oriented middleware Decouples client from server Server works on priority queues Point-to-point Immediate Time shifted Broadcast conversation money lecture concert mail book newspaper Server Client Net work Database Outline Why scaleable servers? Problems and solutions for scaleable servers How Internet Information Server revolutionizes OLTP “Wolfpack” Windows NT clusters for scaleability, availability, manageability ActiveX object model as structuring principle OLE DB (DAO) for data sources MTX as a new programming paradigm MTX as a server Distributed transactions to coordinate components “Falcon” queues for asynchronous processing ™