Scaleabilty Jim Gray Gray@Microsoft.com (with help from Gordon Bell, George Spix, Catharine van Ingen Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party A peta-op business app? • P&G and friends pay for the web (like they paid for broadcast television) – no new money, but given Moore, traditional advertising revenues can pay for all of our connectivity - voice, video, data…… (presuming we figure out how to & allow them to brand the experience.) • Advertisers pay for impressions and ability to analyze same. • A terabyte sort a minute – to one a second. • Bisection bw of ~20gbytes/s – to ~200gbytes/s. • Really a tera-op business app (today’s portals) Scaleability Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard SMP Super Server Grow Out with Cluster Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous Trillions Billions Of Clients Need Millions Of Servers Billions All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Micro 1 MB Mini Mainframe Pico Processor Nano 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance & Management? 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace • Cost 1,000 $ • Come with – NT – DBMS – High speed Net – System management – GUI / OOUI – Tools • Compatible with everyone else • CyberBricks Kilo Mega Giga Tera Peta Exa Computers shrink to a point • Disks 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk is super computer! Zetta Yotta • This is already true of printers and “terminals” Super Server: 4T Machine Array of 1,000 4B machines 1 b ips processors 1 B B DRAM 10 B B disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: CPU 50 GB Disc 5 GB RAM Manageability Programmability Security Cyber Brick a 4B machine Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Cluster Vision Buying Computers by the Slice • Rack & Stack – Mail-order components – Plug them into the cluster • Modular growth without limits – Grow by adding small modules • Fault tolerance: – Spare modules mask failures • Parallel execution & data search – Use multiple processors and disks • Clients and servers made from the same stuff – Inexpensive: built with commodity CyberBricks Systems 30 Years Ago • MegaBuck per Mega Instruction Per Second (mips) • MegaBuck per MegaByte • Sys Admin & Data Admin per MegaBuck Disks of 30 Years Ago • 10 MB • Failed every few weeks • • • • 1988: IBM DB2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB 1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1.8M$/y) • False floor, 2 rooms of machines 32 node processor array Admin expert Performance Hardware experts expert Network expert Auditor Manager Simulate 25,600 clients 40 GB disk array (80 drives) DB expert OS expert 1997: 9 years later 1 Person and 1 box = 1250 tps • • • • 1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 100,000x less 5 micro dollars per transaction Hardware expert OS expert Net expert DB expert App expert 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk 3 x7 x 4GB disk arrays What Happened? Where did the 100,000x come from? • • • • Moore’s law: 100X (at most) Software improvements: 10X (at most) Commodity Pricing: 100X (at least) Total 100,000X • 100x from commodity – (DBMS was 100K$ to start: now 1k$ to start – IBM 390 MIPS is 7.5K$ today – Intel MIPS is 10$ today – Commodity disk is 50$/GB vs 1,500$/GB – ... time Web & server farms, server consolidation / sqft http://www.exodus.com (charges by mbps times sqft) SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC cpus 2.1 4.7 7.0 4.7 5.0 13.3 specint 29.0 60.5 132.7 79.3 72.3 253.3 ram 4.1 4.7 7.0 0.6 5.0 6.8 disks 1.3 0.5 5.2 0.0 2.5 13.3 per sqft Standard package, full height, fully populated, 3.5” disks HP, DELL, Compaq are trading places wrt rack mount lead PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft. $650K from Arrow (3yr warrantee!) on chip at speed L2 gb Application Taxonomy Technical General purpose, nonparallelizable codes PCs have it! Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs) Commercial If central control & rich then IBM or large SMPs else PC Clusters Database Database/TP Web Host Stream Audio/Video 10x every 5 years, 100x every 10 (1000x in 20 if SC) Except --- memory & IO bandwidth Peta scale w/ traditional balance 2000 2010 1 PIPS processors 1015 ips 10 PB of DRAM 106 cpus @109 ips 108 chips @107 bytes 104 cpus @1011 ips 106 chips @109 bytes 108 disks 107 Bps 105 disks 1010 B 107 disks 108 Bps 107 tapes 1010 B 105 tapes 1012 B 10 PBps memory bandwidth 1 PBps IO bandwidth 100 PB of disk storage 10 EB of tape storage 103 disks 1012 B “ market for maybe five computers. ” I think there is a world Thomas Watson Senior, Chairman of IBM, 1943 Microsoft.com: ~150x4 nodes: a crowd Building 11 Staging Servers (7) Ave CFG:4xP6, Internal WWW Ave CFG:4xP5, 512 RAM, 30 GB HD FTP Servers Ave CFG:4xP5, 512 RAM, Download 30 GB HD Replication SQLNet Feeder LAN Router Live SQL Servers MOSWest Admin LAN Live SQL Server www.microsoft.com (4) register.microsoft.com (2) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:12 Ave CFG:4xP6, 512 RAM, 50 GB HD www.microsoft.com (4) premium.microsoft.com (2) home.microsoft.com (3) FDDI Ring (MIS2) cdm.microsoft.com (1) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:7 Ave CFG:4xP6, 256 RAM, 30 GB HD Ave Cost:$25K FY98 Fcst:2 Router Router msid.msn.com (1) premium.microsoft.com (1) FDDI Ring (MIS3) www.microsoft.com premium.microsoft.com (3) (1) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 30 GB HD 512 RAM, 50 GB HD FTP Download Server (1) HTTP Download Servers (2) SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD msid.msn.com (1) Switched Ethernet search.microsoft.com (2) Router Internet Secondary Gigaswitch support.microsoft.com search.microsoft.com (1) (3) Router support.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD 13 DS3 (45 Mb/Sec Each) Ave CFG:4xP5, 512 RAM, 30 GB HD register.microsoft.com (2) register.microsoft.com (1) (100Mb/Sec Each) Router FTP.microsoft.com (3) msid.msn.com (1) 2 OC3 Primary Gigaswitch Router Ave CFG:4xP5, 256 RAM, 20 GB HD register.msn.com (2) search.microsoft.com (1) Japan Data Center Internet Router home.microsoft.com (2) Switched Ethernet Router Router www.microsoft.com (3) FTP Download Server (1) activex.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP5, 256 RAM, 12 GB HD SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD Router Ave CFG:4xP6 512 RAM 28 GB HD FDDI Ring (MIS1) 512 RAM, 30 GB HD msid.msn.com (1) search.microsoft.com (3) home.microsoft.com (4) Ave CFG:4xP6, 1 GB RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:2 msid.msn.com (1) 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD www.microsoft.com premium.microsoft.com (1) Ave CFG:4xP6, Ave CFG:4xP6,(3) 512 RAM, 50 GB HD SQL Consolidators DMZ Staging Servers Router SQL Reporting Ave CFG:4xP6, 512 RAM, 160 GB HD European Data Center IDC Staging Servers MOSWest www.microsoft.com (5) Internet FDDI Ring (MIS4) home.microsoft.com (5) 2 Ethernet (100 Mb/Sec Each) HotMail (a year ago): ~400 Computers Crowd (now 2x bigger) DB Clusters (crowds) • 16-node Cluster – 64 cpus – 2 TB of disk – Decision support • 45-node Cluster – – – – 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) • 1 B tpd (14 k tps) The Microsoft TerraServer Hardware • • • • Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks – 3 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (4 TB) • WindowsNT 4 EE, SQL Server 7.0 TerraServer: Lots of Web Hits 35 Total 71 Average Peak Sessions 30 29 m 18 m 15 m 6.6 m 76 k 125 k Hit 25 Page View Count Hits 1,065 m 8.1 m Queries 877 m 6.7 m Images 742 m 5.6m Page Views 170 m 1.3 m Users 6.4 m 48 k Sessions 10 m 77 k 20 DB Query 15 Image 10 5 0 Date • • • • A billion web hits! 1 TB, largest SQL DB on the Web 100 Qps average, 1,000 Qps peak 877 M SQL queries so far TerraServer Availability • Operating for 13 months • Unscheduled outage: 2.9 hrs • Scheduled outage: 2.0 hrs Software upgrades • Availability: 99.93% overall up • No NT failures (ever) • One SQL7 Beta2 bug • One major operator-assisted outage Configuration StorageTek TimberWolf 9710 DEC StorageWorks UltraSCSI Raid-5 Array Legato Networker PowerEdition 4.4a Windows NT Server Enterprise Edition 4.0 Backup / Restore • Performance Data Bytes Backed Up Total Time Number of Tapes Consumed Total Tape Drives Data ThroughPut Average ThroughPut Per Device Average Throughput Per Device NTFS Logical Volumes 1.2 7.25 27 10 168 16.8 4.97 2 TB Hours tapes drives GB/Hour GB/Hour MB/Sec Windows NT Versus UNIX 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 tpmC vs Time 100,000 tpmC vs Time Unix Unix tpmC tpmC Best Results on an SMP: SemiLog plot shows 3x (~2 year) lead by UNIX Does not show Oracle/Alpha Cluster at 100,000 tpmC All these numbers are off-scale huge (40,000 active users?) h 10,000 NT NT Jan- Jan- Jan- Jan- Jan- Jan95 96 97 98 99 00 h 1,000 Jan- Jan- Jan- Jan- Jan- Jan95 96 97 98 99 00 TPC C Improvements (MS SQL) 40% hardware, 100% software, 250%/year on Price, 100% PC Technology 100%/year performance bottleneck is 3GB address space $1,000 $/tpmC vs time 100,000 tpmC vs time tpmC $/tpmC 10,000 $100 1,000 1.5 2.755676 $10 Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98 100 Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98 UNIX (dis) Economy Of Scale tpmC/k$ 50 45 40 MS SQL Server Bang for the Buck tpmC/K$ 35 30 25 20 15 10 5 0 Sybase Oracle Informix 0 10,000 20,000 30,000 tpmC 40,000 50,000 60,000 Two different pricing regimes This is late 1998 prices TPC Price/tpmC 70 61 60 53 50 47 Sequent/Oracle 89 k tpmC @ 170$/tpmC Sun Oracle 52 k tpmC @ 134$/tpmC 45 HP+NT4+MS SQL 16.2 ktpmC @ 33$/tpmC 40 35 30 30 20 10 17.0 17 9 8 4 12 7 5 3 0 processor disk software net total/10 Storage Latency: How far away is the data? 109 Andromeda Tape /Optical Robot 106 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Los Angeles 2 Years 1.5 hr This Resort 10 min This Room My Head 1 min Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Sort Disc Wait Disc Wait Sort OS Memory Wait B-Cache Data Miss I-Cache Miss D-Cache Miss Storage Hierarchy (10 levels) Registers, Cache L1, L2 Main (1, 2, 3 if nUMA). Disk (1 (cached), 2) Tape (1 (mounted), 2) Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs Speed 1012 109 106 104 Cache Nearline Tape Offline Main 102 Tape Disc Secondary Online Online Secondary Tape Tape 100 Disc Main Offline Nearline Tape Tape -2 $/MB Typical System (bytes) 1015 Price vs Speed 10 Cache 103 10-4 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) Meta-Message: Technology Ratios Are Important • If everything gets faster & cheaper the same rate THEN nothing really changes. • Things getting MUCH BETTER: – communication speed & cost 1,000x – processor speed & cost 100x – storage size & cost 100x • Things staying about the same – speed of light (more or less constant) – people (10x more expensive) – storage speed (only 10x better) at Storage Ratios Changed 10x better access time 10x more bandwidth 4,000x lower media price DRAM/DISK 100:1 to 10:10 to 50:1 Disk Performance vs Time (accesses/ second & Capacity) Disk Performance vs Time 100 10000 1 1980 10 1990 Year 1 2000 Accesses per Second 10 bandwidth (MB/s) access time (ms) 100 10 10 1 1980 1 1990 Year 0.1 2000 1000 100 $/MB 100 Storage Price vs Time Disk Capackty (GB) • • • • 10 1 0.1 0.01 1980 1990 Year 2000 The Pico Processor 1 MM 3 1 M SPECmarks Pico Processor 10 pico-second ram megabyte 10 nano-second ram 10 gigabyte 1 terabyte 10 microsecond ram 10 millisecond disc 100 terabyte 10 second tape archive 100 petabyte Terror Bytes! 106 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Bottleneck Analysis • Drawn to linear scale Disk R/W ~9MBps Memory MemCopy Read/Write ~50 MBps ~150 MBps Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Bottleneck Analysis • NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI 64 ~ 155 MBps Unbuffered read (175 raw) ~ 95 MBps Unbuffered write Good, but 10x down from our UNIX brethren (SGI, SUN) Adapter ~70 MBps Adapter PCI ~110 MBps Memory Read/Write ~250 MBps Adapter PCI Adapter PennySort • Hardware – 266 Mhz Intel PPro – 64 MB SDRAM (10ns) – Dual Fujitsu DMA 3.2GB EIDE disks • Software – NT workstation 4.3 – NT 5 sort • Performance PennySort Machine (1107$ ) Disk 25% – sort 15 M 100-byte records (~1.5 GB) board 13% – Disk to disk – elapsed time 820 sec • cpu time = 404 sec Cabinet + Assembly 7% Memory 8% Other 22% Network, Video, floppy 9% Software 6% cpu 32% Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. – – – – – Hardware and Software cost Depreciated over 3 years 1M$ system gets about 1 second, 1K$ system gets about 1,000 seconds. Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is – 100-byte records (random data) – key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories How Good is NT5 Sort? • • • • CPU and IO not overlapped. System should be able to sort 2x more RAM has spare capacity Disk is space saturated (1.5GB in, 1.5GB out on 3GB drive.) Need an extra 3GB drive or a >6GB drive Fixed CPU Disk ram Sandia/Compaq/ServerNet/NT Sort • Sort 1.1 Terabyte (13 Billion records) in 47 minutes • 68 nodes (dual 450 Mhz processors) Compaq Proliant 1850R Server 2 400 MHz CPUs To X Fabric 543 disks, 1.5 M$ • 1.2 GBps network rap (2.8 GBps pap) • 5.2 GBps of disk rap (same as pap) • (rap=real application performance, pap= peak advertised performance) To Y fabric 6-port ServerNet I crossbar sw itch 512 MB SDRAM PCI Bus ServerNet I dual-ported PCI NIC 4 SCSI busses, each with 2 data disks Bisection Line (Each switch on this line adds 3 links to bisection width) X Fabric (10 bidirectional bisection links) Y Fabric (14 bidirectional bisection links) 6-port ServerNet I crossbar sw itch The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National Labs SP sort 4.0 • 2 – 4 GBps! 3.5 GPFS read GPFS write 3.0 Local read Local write GB/s 2.5 2.0 1.5 1.0 0.5 0.0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds) 56 nodes 18 racks Storage 432 nodes 37 racks compute 488 nodes 55 racks 1952 processors, 732 GB RAM, 2168 disks 56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node) Compute rack: 16 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 1 32x33 PCI bus 9 GB scsi disk 150MBps full duplex SP switch Storage rack: 8 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 3 32x33 PCI bus 30x4 GB scsi disk (4+1 RAID5) 150MBps full duplex SP switch Progress on Sorting: NT now leads both price and performance • Speedup comes from Moore’s law 40%/year • Processor/Disk/Network arrays: 60%/year (this is a software speedup). SPsort 1.E+08 Sort Re cords/se cond vs T ime SPsort/ IB 1.E+07 1.E+06 Records Sorted per Second Doubles Every Year NOW IBM RS6000 1.E+06 IBM 3090 Sandia/Compaq /NT Ordinal+SGI NT/PennySort Alpha Compaq/NT 1.E+03 1.E+05 Cray YMP Sequent 1.E+04 1.E+03 Intel HyperCube Penny NT sort 1.E+00 Kitsuregawa Hardware Sorter Tandem 1.E+02 1985 GB Sorted per Dollar Doubles Every Year Bitton M68000 1990 1995 2000 1.E-03 1985 1990 1995 2000 Recent Results • NOW Sort: 9 GB on a cluster of 100 UltraSparcs in 1 minute • MilleniumSort: 16x Dell NT cluster: 100 MB in 1.18 Sec (Datamation) • Tandem/Sandia Sort: 68 CPU ServerNet 1 TB in 47 minutes • IBM SPsort 408 nodes, 1952 cpu 2168 disks 17.6 minutes = 1057sec (all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect Data Gravity Processing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in – Modem – Display – Microphones (speech recognition) & cameras (vision) – Storage: Data storage and analysis • System is “distributed” (a cluster/mob) SAN: Standard Interconnect Gbps SAN: 110 MBps PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps • LAN faster than memory bus? • • • • 1 GBps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) Disk = Node • • • • has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution Applications environment Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel end • Capacity: Standard Storage Metrics – RAM: MB and $/MB: today at 10MB & 100$/MB – Disk: GB and $/GB: today at 10 GB and 200$/GB – Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline) • Access time (latency) – RAM: 100 ns – Disk: 10 ms – Tape: 30 second pick, 30 second position • Transfer rate SCAN? • Kaps: How many KB objects served per second – The file server, transaction processing metr – This is the OLD metric. • Maps: How many MB objects served per – The Multi-Media metric • SCAN: How long to scan all the data – The data mining and utility metric (good 1998 devices packaged in system ) DRAM DISK TAPE robot http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf Unit capacity (GB) Unit price $ $/GB Latency (s) Bandwidth (Mbps) Kaps Maps Scan time (s/TB) $/Kaps $/Maps $/TBscan 1 4000 4000 1.E-7 500 5.E+5 5.E+2 2 9.E-11 8.E-8 $0.08 18 500 28 1.E-2 15 1.E+2 13.04 1200 5.E-8 4.E-7 $0.35 35 X 14 10000 20 3.E+1 7 3.E-2 3.E-2 70000 3.E-3 3.E-3 $211 (good 1998 devices packaged in system ) 5.E+05 DRAM 7.E+04 http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf 1.E+06 4.E+03 1200 DISK 500 500 99 2820 1.E+03 157 13 0.03 1.E+00 TAPE robot 2 3.E-03 3.E-03 1.E-03 4.E-07 5.E-088.E-08 1.E-06 9.E-11 Bs ca n $/ T ap s ap s $/ M e t im $/ K B) (s /T ap s M Sc an Ka ps bp s) (M id th $/ G B 1.E-12 Ba nd w X 14 0.35 0.08 0.03 1.E-09 211 Maps, SCANs At 10 MB/s: 1.2 days to scan • parallelism: use many parallel 1 Terabyte 1,000 x parallel: 100 secondsin SCAN. little devices 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. a Card The 1 TB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc 14" LOTS of accesses/second Life is cheap, its the accessories that cost ya. bandwidth Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card). Not Mainframe Silos 100 robots 1M$ 50TB 50$/GB 3K Maps 10K$ robot 14 tapes 27 hr Scan 500 GB 5 MB/s 20$/GB Scan in 27 hours. independent tape robots 30 Maps many (like a disc farm) Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap: => 1.5 $/GB 30 $/tape 20 GB/tape (100x cheaper than disc). Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out! The Access Time Myth Wait The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server Transfer Transfer Rotate Rotate Seek Seek What To Do About HIGH Availability • Need remote MIRRORED site to tolerate environmental failures (power, net, fire, flood) operations failures • Replicate changes across the net • Failover servers across the net (some distance) client server State Changes server • Allows: software upgrades, site moves, fires,... >100 feet or >100 miles • Tolerates: operations errors, hiesenbugs, Scaleup Has Limits (chart courtesy of Catharine Van Ingen) LANL Loki P6 Linux NAS Expanded Linux Cluster Cray T3E • Vector Supers ~ 10x supers • Supers ~ 10x PCs – 300 Mflops/cpu – bus/memory ~ 2 GBps – IO ~ 1 GBps 100.000 IBM SP SGI Origin 2000195 Sun Ultra Enterprise 4000 UCB NOW 10.000 Mflop/s/$K – ~3 Gflops/cpu – bus/memory ~ 20 GBps – IO ~ 1GBps • PCs are slow – ~ 30 Mflops/cpu – and bus/memory ~ 200MBps – and IO ~ 100 MBps Mflop/s/$K vs Mflop/s 1.000 0.100 0.010 0.001 0.1 1 10 100 Mflop/s 1000 10000 100000 TOP500 Systems by Vendor (courtesy of Larry Smarr NCSA) 500 Other Japanese Vector Machines Number of Systems 400 Other DEC Intel Japanese TMC Sun DEC Intel HP 300 TMC IBM Sun Convex HP 200 Convex SGI IBM SGI 100 CRI TOP500 Reports: http://www.netlib.org/benchmark/top500.html Jun-98 Nov-97 Jun-97 Nov-96 Jun-96 Nov-95 Jun-95 Nov-94 Jun-94 Nov-93 0 Jun-93 CRI NCSA Super Cluster http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html • National Center for Supercomputing Applications University of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model Avalon: Alpha Clusters for Science http://cnls.lanl.gov/avalon/ 140 Alpha Processors(533 Mhz) x 256 MB + 3GB disk Fast Ethernet switches = 45 Gbytes RAM 550 GB disk + Linux…………………... = 10 real Gflops for $313,000 => 34 real Mflops/k$ on 150 benchmark Mflops/k$ Beowulf project is Parent http://www.cacr.caltech.edu/beowulf/naegling.html 114 nodes, 2k$/node, Scientists want cheap mips. Your Tax Dollars At Work ASCI for Stockpile Stewardship • Intel/Sandia: 9000x1 node Ppro • LLNL/IBM: 512x8 PowerPC (SP2) • LANL/Cray: ? • Maui Supercomputer Center – 512x1 SP2 Observations • Uniprocessor RAP << PAP – real app performance << peak advertised performance • Growth has slowed (Bell Prize – – – – – – 1987: 0.5 GFLOPS 1988 1.0 GFLOPS 1 year 1990: 14 GFLOPS 2 years 1994: 140 GFLOPS 4 years 1997: 604 GFLOPS 1998: 1600 G__OPS 4 years Two Generic Kinds of computing • Many little – – – – – embarrassingly parallel Fit RPC model Fit partitioned data and computation model Random works OK OLTP, File Server, Email, Web,….. • Few big – sometimes not obviously parallel – Do not fit RPC model (BIG rpcs) – Scientific, simulation, data mining, ... Many Little Programming Model • • • • • • • • • many small requests route requests to data encapsulate data with procedures (objects) three-tier computing RPC is a convenient/appropriate model Transactions are a big help in error handling Auto partition (e.g. hash data and computation) Works fine. Software CyberBricks Object Oriented Programming Parallelism From Many Little Jobs • • • • • Gives location transparency ORB/web/tpmon multiplexes clients to servers Enables distribution Exploits embarrassingly parallel apps (transactions) HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server Few Big Programming Model • Finding parallelism is hard – Pipelines are short (3x …6x speedup) • Spreading objects/data is easy, but getting locality is HARD • Mapping big job onto cluster is hard • Scheduling is hard – coarse grained (job) and fine grain (co-schedule) • Fault tolerance is hard Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program Why Parallel Access To xData? At 10 MB/s 1,000 parallel 1.2 days to scan 100 second SCAN. 1 Terabyte 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM Database Systems “Hide” Parallelism • Automate system management via tools – data placement – data organization (indexing) – periodic tasks (dump / recover / reorganize) • Automatic fault tolerance – duplex & failover – transactions • Automatic parallelism – among transactions (locking) – within a transaction (parallel execution) SQL a Non-Procedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan – Picks data flow web (pipeline), – degree of parallelism (partitioning) – other execution parameters (process placement, memory,...) Execution Planning Monitor Schema GUI Optimizer Plan Executors Rivers Partitioned Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL parallelism N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Landsat date loc image 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Temporal Spatial Image Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it date, location, & image tests Answer image Data Rivers: Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano /SQL Server. Generalization: Object-oriented Rivers • Rivers transport sub-class of record-set (= stream of objects) – record type and partitioning are part of subclass • Node transformers are data pumps – an object with river inputs and outputs – do late-binding to record-type • Programming becomes data flow programming – specify the pipelines • Compiler/Scheduler does data partitioning and “transformer” placement NT Cluster Sort as a Prototype • Using – data generation and – sort as a prototypical app • “Hello world” of distributed processing • goal: easy install & execute Remote Install •Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx() Cluster StartupExecution •Setup : MULTI_QI struct COSERVERINFO struct •CoCreateInstanceEx() MULT_QI COSERVERINFO HANDLE HANDLE HANDLE •Retrieve remote object handle from MULTI_QI struct Sort() •Invoke methods as usual Sort() Sort() Cluster Sort •Multiple Data Sources A AAA BBB CCC Conceptual Model •Multiple Data Destinations AAA AAA AAA •Multiple nodes AAA AAA AAA •Disks -> Sockets -> Disk -> Disk B C AAA BBB CCC CCC CCC CCC AAA BBB CCC CCC CCC CCC BBB BBB BBB BBB BBB BBB How Do They Talk to Each Other? Applications Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other – CORBA? DCOM? IIOP? RMI? – One or all of the above. Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. VIAL/VIPL ? RPC streams datagrams • • • • h Wire(s)