Scaleable WindowsNT? • Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray 1 Outline • What is Scalability? • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7 & Exchange 2 Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard SMP Super Server Grow Out with Cluster Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs Billions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous Billions Of Clients Need Millions Of Servers All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP Outline • What is Scalability • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7 & Exchange 7 Scalability 1 billion transactions 100 million web hits • Scale up: to large SMP nodes • Scale out: to clusters of SMP nodes 4 terabytes of data 1.8 million mail messages 8 “Commercial” NT Clusters • 16-node Tandem Cluster » 64 cpus » 2 TB of disk » Decision support • 45-node Compaq Cluster » 140 cpus » 14 GB DRAM » 4 TB RAID disk » OLTP (Debit Credit) • 1 B tpd (14 k tps) 9 Tandem Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks =2.7 TB 10 24 cpu, 384 disks (=2.7TB) 11 Billion Transactions per Day Project • • • • • • • Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks All off-the-shelf parts Using SQL Server & DTC distributed transactions DebitCredit Transaction Each node has 1/20 th of the DB Each node does 1/20 th of the work 15% of the transactions are “distributed” Billion Transactions Per Day Hardware • 45 nodes (Compaq Proliant) • Clustered with 100 Mbps Switched Ethernet • 140 cpu, 13 GB, 3 TB. Type Workflow MTS SQL Server Distributed Transaction Coordinator TOTAL nodes CPUs DRAM ctlrs disks 20 Compaq Proliant 2500 20 Compaq Proliant 5000 5 Compaq Proliant 5000 45 20x 20x 20x 20x RAID space 20x 2 128 1 1 2 GB 20x 20x 20x 20x 4 512 4 20x 36x4.2GB 7x9.1GB 130 GB 5x 5x 5x 5x 5x 4 256 1 3 8 GB 140 13 GB 105 895 3 TB 13 How Much Is 1 Billion Tpd? Mtpd Millions of Transactions Per Day 1,000. 900. 800. 100. 700. 600. 500. 10. 400. 300. 1. 200. 100. 0. 0.1 • 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) • ATT » 185 million calls per peak day (worldwide) • Visa ~20 million tpd » » » 1 Btpd Visa ATT BofA NYSE 400 million customers 250K ATMs worldwide 7 billion transactions (card+cheque) in 1994 • New York Stock Exchange » 600,000 tpd • Bank of America • » » 20 million tpd checks cleared (more than any other bank) 1.4 million tpd ATM transactions Worldwide Airlines Reservations: 250 Mtpd 14 Infinite, Ubiquitous Scaling Redefining the rules Per Sec Per Min Per Day 10K TPC 166 10,000 14,400,000 1 BTPD 11,574 694,444 1,000,000,000 1.4 BTPD 16,204 972,222 1,400,000,000 IIS MTS All Shipping Products! COM / ActiveX SQL SQL SQL SQL SQL SQL 15 Microsoft.com: ~150x4 nodes Building 11 Staging Servers (7) Ave CFG:4xP6, Internal WWW Ave CFG:4xP5, 512 RAM, 30 GB HD FTP Servers Ave CFG:4xP5, 512 RAM, Download 30 GB HD Replication SQLNet Feeder LAN Router Live SQL Servers MOSWest Admin LAN Live SQL Server www.microsoft.com (4) register.microsoft.com (2) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:12 Ave CFG:4xP6, 512 RAM, 50 GB HD www.microsoft.com (4) premium.microsoft.com (2) home.microsoft.com (3) FDDI Ring (MIS2) cdm.microsoft.com (1) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:7 Ave CFG:4xP6, 256 RAM, 30 GB HD Ave Cost:$25K FY98 Fcst:2 Router Router msid.msn.com (1) premium.microsoft.com (1) FDDI Ring (MIS3) www.microsoft.com premium.microsoft.com (3) (1) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 30 GB HD 512 RAM, 50 GB HD FTP Download Server (1) HTTP Download Servers (2) SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD msid.msn.com (1) Switched Ethernet search.microsoft.com (2) Router Internet Secondary Gigaswitch support.microsoft.com search.microsoft.com (1) (3) Router 2 Ethernet (100 Mb/Sec Each) support.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD 13 DS3 (45 Mb/Sec Each) Ave CFG:4xP5, 512 RAM, 30 GB HD register.microsoft.com (2) register.microsoft.com (1) (100Mb/Sec Each) Router FTP.microsoft.com (3) msid.msn.com (1) 2 OC3 Primary Gigaswitch Router Ave CFG:4xP5, 256 RAM, 20 GB HD register.msn.com (2) search.microsoft.com (1) Japan Data Center Internet Router home.microsoft.com (2) Switched Ethernet Router Router www.microsoft.com (3) FTP Download Server (1) activex.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP5, 256 RAM, 12 GB HD SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD Router Ave CFG:4xP6 512 RAM 28 GB HD FDDI Ring (MIS1) 512 RAM, 30 GB HD msid.msn.com (1) search.microsoft.com (3) home.microsoft.com (4) Ave CFG:4xP6, 1 GB RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:2 msid.msn.com (1) 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD www.microsoft.com premium.microsoft.com (1) Ave CFG:4xP6, Ave CFG:4xP6,(3) 512 RAM, 50 GB HD SQL Consolidators DMZ Staging Servers Router SQL Reporting Ave CFG:4xP6, 512 RAM, 160 GB HD European Data Center IDC Staging Servers MOSWest www.microsoft.com (5) Internet FDDI Ring (MIS4) home.microsoft.com (5) 16 NCSA Super Cluster http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html • National Center for Supercomputing Applications • • • • • University of Illinois @ Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model 17 TPC C Improved Fast (250%/year!) $1,000 $/tpmC vs time 100,000 40% hardware, 100% software, 100% PC Technology tpmC vs time 10,000 250 %/year improvement! tpmC $100 250 %/year improvement! 1,000 1.5 2.755676 $10 Jan-93 Jun-94 Oct-95 Date Mar-97 Jul-98 100 Jan-93 Jun-94 Oct-95 Date Mar-97 Jul-98 18 tpmC Windows NT Versus UNIX 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 Jan-95 tpmC vs Time NT Unix h Jan-96 Jan-97 19 Economy Of Scale Transactions/k$ By Vendor 25.0 20.0 tpmC/k$ Microsoft/NT 15.0 Oracle/Unix Sybase/Unix 10.0 Informix/Unix DB2/Unix 5.0 0.0 0 10,000 20,000 tpmC 30,000 40,000 20 Microsoft TerraServer: Scaleup to Big Databases • • Build a 1 TB SQL Server database Data must be • Loaded • • On the web (world’s largest atlas) Sell images with commerce server. » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) 21 Microsoft TerraServer Background • Earth is 500 Tera-meters square • Someday • • • • • • » USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas » multi-spectral image » of everywhere » once a day / hour 1.8x1.2 km2 tile 10x15 km2 thumbnail 20x30 km2 browse image 40x60 km2 jump image • globe • gazetteer » StreetsPlus™ in the USA 22 Demo • navigate by coverage map to White House • Download image • buy imagery from USGS • navigate by name to Venice • buy SPIN2 image & Kodak photo • Pop out to Expedia street map of Venice • Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) 23 The Microsoft TerraServer Hardware • Compaq AlphaServer 8400 • 8x400Mhz Alpha cpus • 10 GB DRAM • 324 9.2 GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (4 TB) • WindowsNT 4 EE, SQL Server 7.0 24 Software Web Client Image Server Active Server Pages Internet Information Server 4.0 Java Viewer browser MTS Terra-Server Stored Procedures HTML The Internet Internet Info Server 4.0 SQL Server 7 Microsoft Automap ActiveX Server TerraServer DB Automap Server TerraServer Web Site Internet Information Server 4.0 Microsoft Site Server EE Image Delivery SQL Server Application 7 25 Image Provider Site(s) Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape DLT Tape “tar” NT DoJob \Drop’N’ LoadMgr DB Wait 4 Load Backup LoadMgr LoadMgr ESA Alpha Server 4100 100mbit EtherSwitch 60 4.3 GB Drives Alpha Server 4100 ImgCutter \Drop’N’ \Images ... 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place Enterprise Storage Array STK DLT Tape Library 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives Alpha Server 8400 26 TerraServer: A Real “World” Example 71 Hits Queries Images PageViews Total Average Peak 728.45m 10.26m 29.27m 565.09m 7.96m 17.76m 212.02m 2.99m 9.23m 376.29m 5.30m 9.20m • Largest DB on the Web • 1.3TB • 99.95% uptime since July 1 • No downtime, period, in August • 70% of downtime for SQL software upgrades 27 NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image » Naming » Protection/security » Management/load balance • Fault tolerance »“Wolfpack” • Hot pluggable hardware & software 28 Symmetric Virtual Server Failover Example Browser Server Server 11 Server 2 Web site Web site Web site Database Database Database Web site files Web site files Database files Database files 29 Windows NT 5 (scalability features) • Better SMP support • Clusters: » 16x packs (fault tolerant clusters) » 100x mobs: arrays for manageability » SAN/VIA support • 64 bit addressing for data » Apps like SQL, Oracle, will use it for data » 64 bit API to NT comes later (in lab now). • Remote management (scripting and DCOM) • Active Directory • Veritas volume manager • Many 3rd party HSMs • Batch support 30 Microsoft SQL Server 7.0 • Fixes the famous performance bugs » dynamic record locking » online backup, quick recovery…. • 64 bit addressing buffer pool • SMP parallelism and better SMP support • Built in OLAP (cubes and MOLAP) • Scale down to Win9x • Improved management interfaces • Data transform services (for warehouses) 31 Outline • What is Scalability • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7 32 end Other slides would be interesting, but... 33 Interesting “other slides” No time for them but... • How much information is there? • IO bandwidth in the Intel world • Intelligent disks • SAN/VIA • NT Cluster Sort 34 Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) » 15 PB by 2007 • Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!) 35 Kilo A letter A novel Info Capture • You can record Mega Giga A Movie • Tera Library of Congress (text) • Peta LoC (image) everything you see or hear or read. What would you do with it? How would you organize & analyze it? Exa Video All Disks Audio Read or write: Zetta All Tapes See: http://www.lesk.com/mlesk/ksg97/ksg.html Yotta 8 PB per lifetime (10GBph) 30 TB (10KBps) 8 GB (words) 36 Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. 37 PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application Data 10-15 MBps 7.2 MB/s File System Buffers SCSI Disk 133 MBps 7.2 MB/s PCI 38 PAP vs RAP • Reads are easy, writes are hard • Async write can match WCE. 422 MBps 142 MBps SCSI Application Data Disks 40 MBps File System 10-15 MBps 31 MBps 9 MBps • 133 MBps 72 MBps PCI SCSI 39 Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter ~30 MBps Adapter PCI ~70 MBps Memory Read/Write ~150 MBps Adapter PCI Adapter 40 Year 2002 Disks • Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential • Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential • Both running Windows NT™ 7.0? (see below for why) 41 How Do They Talk to Each Other? • Each node has an OS • Each node has local resources: A federation. • Each node does not completely trust the others. • Nodes use RPC to talk to each other RMI? Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. VIAL/VIPL ? RPC streams datagrams Applications » CORBA? DCOM? IIOP? » One or all of the above. h Wire(s) 42 SAN: Standard Interconnect Gbps Ethernet: 110 MBps • LAN faster than PCI 32: 70 MBps UW Scsi: 40 MBps • • • memory bus? 1 GBps links in lab. 300$ port cost soon Port is computer FW scsi: 20 MBps scsi: 5 MBps 43 PennySort • Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE • Software » NT workstation 4.3 » NT 5 sort • Performance PennySort Machine (1107$ ) Disk 25% Cabinet + Assembly 7% Memory 8% Other 22% board 13% Network, Video, floppy 9% Software 6% cpu 32% » sort 15 M 100-byte records » Disk to disk » elapsed time 820 sec (~1.5 GB) • cpu time = 404 sec 44 Cluster Sort •Multiple Data Sources A AAA BBB CCC Conceptual Model •Multiple Data Destinations AAA AAA AAA •Multiple nodes AAA AAA AAA •Disks -> Sockets -> Disk -> Disk B C AAA BBB CCC CCC CCC CCC AAA BBB CCC BBB BBB BBB BBB BBB BBB CCC CCC CCC 45 Cluster Install & Execute •If this is to be used by others, it must be: •Easy to install •Easy to execute • Installations of distributed systems take time and can be tedious. (AM2, GluGuard) • Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ? 46 Remote Install •Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx() 47 Cluster Execution •Setup : MULTI_QI struct COSERVERINFO struct •CoCreateInstanceEx() MULT_QI COSERVERINFO HANDLE HANDLE HANDLE •Retrieve remote object handle from MULTI_QI struct Sort() •Invoke methods as usual Sort() Sort() 48