What Happens When Processing Storage Bandwidth are Free and Infinite? Jim Gray Microsoft Research 1 Outline Clusters of Hardware CyberBricks – all nodes are very intelligent – Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. Software CyberBricks – standard way to interconnect intelligent nodes – needs execution model – needs parallelism 2 When Computers & Communication are Free Traditional computer industry is 0 B$/year All the costs are in – Content (good) – System Management (bad) • A vendor claims it costs 8$/MB/year to manage disk storage. – => WebTV (1GB drive) costs 8,000$/year to manage! – => 10 PB DB costs 80 Billion $/year to manage! • Automatic management is ESSENTIAL In the mean time…. 3 1980 Rule of Thumb You need a systems’ programmer per MIPS You need a Data Administrator per 10 GB 4 One Person per MegaBuck 1 Breadbox ~ 5x 1987 machine room 48 GB is hand-held One person does all the work Cost/tps is 1,000x less 25 micro dollars per transaction A megabuck buys 40 of these!!! Hardware expert OS expert Net expert DB expert App expert 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk 3 x7 x 4GB disk arrays5 All God’s Children Have Clusters! Buying Computing By the Slice People are buying computers by the dozens – Computers only cost 1k$/slice! Clustering them together 6 A cluster is a cluster is a cluster It’s so natural, even mainframes cluster ! Looking closer at usage patterns, a few models emerge Looking closer at sites, you see hierarchies bunches functional specialization 7 “Commercial” NT Clusters 16-node Tandem Cluster – 64 cpus – 2 TB of disk – Decision support 45-node Compaq Cluster – – – – 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) • 1 B tpd (14 k tps) 8 Tandem Oracle/NT 27,383 tpmC 71.50 $/tpmC 4 x 6 cpus 384 disks =2.7 TB 9 Microsoft.com: ~150x4 nodes Building 11 Log Processing Ave CFG:4xP6, Internal WWW 1 GB RAM, 180 GB HD Ave Cost:$128K FY98 Fcst:2 Staging Servers (7) The Microsoft.Com Site Ave CFG:4xP5, 512 RAM, 30 GB HD Ave Cost:$35K FY98 Fcst:12 FTP Servers Ave CFG:4xP5, 512 RAM, Download 30 GB HD Replication Ave Cost:$28K FY98 Fcst: 0 SQLNet Feeder LAN Router Live SQL Servers MOSWest Admin LAN Live SQL Server All servers in Building11 are accessable from corpnet. www.microsoft.com (4) register.microsoft.com (2) Ave CFG:4xP6, home.microsoft.com (4) premium.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$35K FY98 Fcst:3 Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:12 Ave CFG:4xP6, 512 RAM, 50 GB HD Ave Cost:$35K FY98 Fcst:2 www.microsoft.com (4) Ave CFG:4xP6 512 RAM 28 GB HD Ave Cost: $35K FY98 Fcst: 17 FDDI Ring (MIS1) FDDI Ring (MIS2) activex.microsoft.com (2) Ave CFG:4xP6, 256 RAM, 30 GB HD Ave Cost:$25K FY98 Fcst:2 Router premium.microsoft.com (1) Internet Ave CFG:4xP5, 256 RAM, 20 GB HD Ave Cost:$29K FY98 Fcst:2 register.msn.com (2) search.microsoft.com (1) Japan Data Center www.microsoft.com premium.microsoft.com (3) (1) Ave CFG:4xP6, Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$35K FY98 Fcst:1 512 RAM, 50 GB HD Ave Cost:$50K FY98 Fcst:1 FTP Download Server (1) HTTP Download Servers (2) SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$80K FY98 Fcst:1 msid.msn.com (1) Switched Ethernet search.microsoft.com (2) Router Secondary Gigaswitch \\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd 12/15/97 Router (100 Mb/Sec Each) support.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$35K FY98 Fcst:9 13 DS3 (45 Mb/Sec Each) Ave CFG:4xP5, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:0 register.microsoft.com (2) support.microsoft.com search.microsoft.com (1) (3) 2 Ethernet Router FTP.microsoft.com (3) register.microsoft.com (1) (100Mb/Sec Each) Internet Router msid.msn.com (1) 2 OC3 Primary Gigaswitch Router FDDI Ring (MIS3) Switched Ethernet Router Router home.microsoft.com (2) Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:7 Router msid.msn.com (1) FTP Download Server (1) SQL SERVERS (2) Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$80K FY98 Fcst:1 Router Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:3 cdm.microsoft.com (1) Ave CFG:4xP5, 256 RAM, 12 GB HD Ave Cost:$24K FY98 Fcst:0 512 RAM, 30 GB HD Ave Cost:$35K FY98 Fcst:1 msid.msn.com (1) search.microsoft.com (3) home.microsoft.com (3) Ave CFG:4xP6, 1 GB RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:2 msid.msn.com (1) 512 RAM, 30 GB HD Ave Cost:$43K FY98 Fcst:10 Ave CFG:4xP6, 512 RAM, 50 GB HD Ave Cost:$50K FY98 Fcst:17 www.microsoft.com (3) www.microsoft.com premium.microsoft.com (1) Ave CFG:4xP6, Ave CFG:4xP6,(3) 512 RAM, 50 GB HD Ave Cost:$50K FY98 Fcst:1 SQL Consolidators DMZ Staging Servers Router SQL Reporting Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$80K FY98 Fcst:2 European Data Center IDC Staging Servers MOSWest www.microsoft.com (5) Internet FDDI Ring (MIS4) home.microsoft.com (5) 10 HotMail: ~400 Computers 11 Inktomi (hotbot), WebTV: > 200 nodes Inktomi: ~250 UltraSparcs – – – – – web crawl index crawled web and save index Return search results on demand Track Ads and click-thrus ACID vs BASE (basic Availability, Serialized Eventually) Web TV – ~200 UltraSparcs • Render pages, Provide Email – ~ 4 Network Appliance NFS file servers – A large Oracle app tracking customers 12 Loki: Pentium Clusters for Science http://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beo wulf.html Scientists want cheap mips. 13 Your Tax Dollars At Work ASCI for Stockpile Stewardship Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center – 512x1 SP2 14 Berkeley NOW (network of workstations) Project http://now.cs.berkeley.edu/ 105 nodes – Sun UltraSparc 170, 128 MB, 2x2GB disk – Myrinet interconnect (2x160MBps per node) – SBus (30MBps) limited GLUNIX layer above Solaris Inktomi (HotBot search) NAS Parallel Benchmarks Crypto cracker Sort 9 GB per second 15 Wisconsin COW 40 UltraSparcs 64MB + 2x2GB disk + Myrinet SUN OS Used as a compute engine 16 Andrew Chien’s JBOB http://www-csag.cs.uiuc.edu/individual/achien.html 48 nodes 36 HP 2PIIx128 1 disk Kayak boxes 10 Compaq 2PIIx128 1 disk, Wkstation 6000 32-Myrinet&16-ServerNet connected Operational All running NT 17 NCSA Cluster The National Center for Supercomputing Applications University of Illinois @ Urbana 500 Pentium cpus, 2k disks, SAN Compaq + HP +Myricom A Super Computer for 3M$ Classic Fortran/MPI programming NT + DCOM programming model 18 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace Cost 1,000 $ Come with – NT – DBMS – High speed Net – System management – GUI / OOUI – Tools Compatible with everyone else CyberBricks 19 Super Server: 4T Machine Array of 1,000 4B machines 1 b ips processors 1 B B DRAM 10 B B disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: CPU 50 GB Disc 5 GB RAM Manageability Programmability Security Cyber Brick a 4B machine Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work 20 Cluster Vision Buying Computers by the Slice Rack & Stack – Mail-order components – Plug them into the cluster Modular growth without limits – Grow by adding small modules Fault tolerance: – Spare modules mask failures Parallel execution & data search – Use multiple processors and disks Clients and servers made from the same stuff – Inexpensive: built with commodity CyberBricks 21 Nostalgia Behemoth in the Basement today’s PC is yesterday’s supercomputer Can use LOTS of them Main Apps changed: – scientific commercial web – Web & Transaction servers – Data Mining, Web Farming 22 SMP -> nUMA: BIG FAT SERVERS Directory based caching lets you build large SMPs Every vendor building a HUGE SMP – 256 way – 3x slower remote memory – 8-level memory hierarchy • • • • • • • L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs – 64 bit addressing – nUMA sensitive OS • (not clear who will do it) Or Hypervisor – like IBM LSF, – Stanford Disco www-flash.stanford.edu/Hive/papers.html You get an expensive cluster-in-a-box with very fast network 23 Great Debate: Shared What? Shared Memory (SMP) CLIENTS Shared Disk CLIENTS Easy to program Difficult to build Difficult to scale SGI, Sun, Sequent Shared Nothing (network) CLIENTS Hard to program Easy to build Easy to scale VMScluster, Sysplex Tandem, Teradata, SP2, NT NUMA blurs distinction, but has it’s own problems 24 Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 25 A Hypothetical Question Taking things to the limit Moore’s law 100x per decade: – Exa-instructions per second in 30 years – Exa-bit memory chips – Exa-byte disks Gilder’s Law of the Telecosom 3x/year more bandwidth 60,000x per decade! – 40 Gbps per fiber today 26 Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years Today: – 10 Gbps per channel – 4 channels per fiber: 40 Gbps – 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps 27 Networking BIG!! Changes coming! Technology – 10 GBps bus “now” – 1 Gbps links “now” – 1 Tbps links in 10 years – Fast & cheap switches CHALLENGE – reduce software tax on messages – Today 30 K ins + 10 ins/byte – Goal: 1 K ins + .01 ins/byte Standard interconnects – processor-processor Best bet: – processor-device (=processor) – SAN/VIA Deregulation WILL work – Smart NICs someday – Special protocol – User-Level Net IO (like disk)28 What if Networking Was as Cheap As Disk IO? TCP/IP – Unix/NT 100% cpu @ 40MBps Disk – Unix/NT 8% cpu @ 40MBps Why the Difference? Host does TCP/IP packetizing, checksum,… flow control small buffers Host Bus Adapter does SCSI packetizing, checksum,… flow control 29 DMA The Promise of SAN/VIA 10x better in 2 years Today: – wires are 10 MBps (100 Mbps Ethernet) – ~20 MBps tcp/ip saturates 2 cpus – round-trip latency is ~300 us In two years 250 200 Now Soon 150 100 50 0 Bandwidth Latency Overhead – wires are 100 MBps (1 Gbps Ethernet, ServerNet,…) – tcp/ip ~ 100 MBps 10% of each processor – round-trip latency is 20 us works in lab today assumes app uses zero-copy Winsock2 api. See http://www.viarch.org/ 30 Functionally Specialized Cards Storage P mips processor ASIC Today: P= 20 mips M MB DRAM Network M= 2 MB In a few years ASIC P= 200 mips M= 64 MB Display ASIC 32 It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a – several network interfaces – A Postscript engine • • • • cpu, memory, software, a spooler (soon) – and… a print engine. 33 System On A Chip Integrate Processing with memory on one chip – – – – chip is 75% memory now 1MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM,… projects abound Integrate Networking with processing on one chip – system bus is a kind of network – ATM, FiberChannel, Ethernet,.. Logic on chip. – Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip. 34 All Device Controllers will be Cray 1’s TODAY – Disk controller is 10 mips risc engine with 2MB DRAM – NIC is similar power SOON Central Processor & Memory – Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages – – – – – Uniform programming model Great tools Security economics (cyberbricks) Move computation to data (minimize traffic) Tera Byte Backplane 35 With Tera Byte Interconnect and Super Computer Adapters Processing is incidental to – Networking – Storage – UI Disk Controller/NIC is – faster than device – close to device – Can borrow device package & power Tera Byte Backplane So use idle capacity for computation. Run app in device. 36 Implications Conventional Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Central Processor & Memory Radical Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Tera Byte Backplane 37 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other Applications ? RPC streams datagrams Huge leverage in high-level interfaces. Same old distributed system story. VIAL/VIPL ? RPC streams datagrams – CORBA? DCOM? IIOP? RMI? HTTP? – One or all of the above. Applications VIAL/VIPL Wire(s) 38 Restatement The huge clusters we saw are prototypes for this: A Federation of Functionally specialized nodes Each node shrinks to a “point” device With embedded processing. Each node / device is autonomous 39 Each talks a high-level protocol Outline Clusters of Hardware CyberBricks – all nodes are very intelligent – Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. Software CyberBricks – standard way to interconnect intelligent nodes – needs execution model – needs parallelism 40 Software CyberBricks: Objects! It’s a zoo Objects and 3-tier computing (transactions) – – – – Give natural distribution & parallelism Give remote management! TP & Web: Dispatch RPCs to pool of object servers Components are a 1B$ business today! Need a Parallel & distributed computing model 41 The COMponent Promise Objects are Software CyberBricks – productivity breakthrough (plug ins) – manageability breakthrough (modules) Microsoft: DCOM + ActiveX IBM/Sun/Oracle/Netscape: CORBA + Java Beans Both promise – parallel distributed execution – centralized management of distributed system Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic ops key to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy 42 History and Alphabet Soup 1995 CORBA Solaris Object Management Group (OMG) 1990 X/Open UNIX International 1985 Open software Foundation (OSF) Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it Open Group OSF DCE NT COM 43 The OLE-COM Experience Macintosh had Publish & Subscribe PowerPoint needed graphs: – plugged MS Graph in as an component. Office adopted OLE – one graph program for all of office Internet arrived – URLs are object references, – Office is Web Enabled right away! Office97 smaller than Office95 because of shared components It works!! 44 Linking And Embedding Objects are data modules; transactions are execution modules Link: pointer to object somewhere else – Think URL in Internet Embed: bytes are here Objects may be active; can callback to subscribers 45 The BIG Picture Components and transactions Software modules are objects Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) Standard interfaces allow software plug-ins Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object Request Broker 46 Object Request Broker (ORB) Orchestrates RPC Registers Servers Manages pools of servers Connects clients to servers Does Naming, request-level authorization, Provides transaction coordination Direct and queued invocation Old names: – Transaction Processing Monitor, Transaction – Web server, – NetWare Object-Request Broker 47 The OO Points So Far Objects are software Cyber Bricks Object interconnect standards are emerging Cyber Bricks become Federated Systems. Next points: – put processing close to data – do parallel processing. 48 Three Tier Computing Clients do presentation, gather input Clients do some workflow (Xscript) Clients send high-level requests to ORB Presentation workflow ORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queues Server-side workflow scripts call on distributed business objects to execute task Application Objects Database 49 Transaction Processing Evolution to Three Tier Intelligence migrated to clients Mainframe cards Mainframe Batch processing (centralized) Dumb terminals & Remote Job Entry green screen 3270 TP Monitor Intelligent terminals database backends Workflow Systems Object Request Brokers Application Generators Server ORB Active 50 Web Evolution to Three Tier Intelligence migrated to clients (like TP) WAIS Character-mode clients, smart servers Web Server archie ghopher green screen Mosaic GUI Browsers - Web file servers GUI Plugins - Web dispatchers - CGI Smart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & server NS & IE Active 51 PC Evolution to Three Tier Intelligence migrated to server Stand-alone PC (centralized) PC + File & print server IO request reply disk I/O message per I/O PC + Database server message per SQL statement PC + App server message per transaction SQL Statement ActiveX Client, ORB ActiveX server, Xscript Transaction 52 Why Did Everyone Go To ThreeTier? Manageability Presentation – Business rules must be with data – Middleware operations tools Performance (scaleability) workflow – Server resources are precious – ORB dispatches requests to server pools Technology & Physics – – – – Put UI processing near user Put shared data processing near shared data Minimizes data moves Encapsulate / modularity Application Objects Database 53 The OO Points So Far Objects are software Cyber Bricks Object interconnect standards are emerging Cyber Bricks become Federated Systems. Put processing close to data Next point: – do parallel processing. 54 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program 55 Object Oriented Programming Parallelism From Many Little Jobs Gives location transparency ORB/web/tpmon multiplexes clients to servers Enables distribution Exploits embarrassingly parallel apps (transactions) HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server 56 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. 57 Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM 58 Database Systems “Hide” Parallelism Automate system management via tools – data placement – data organization (indexing) – periodic tasks (dump / recover / reorganize) Automatic fault tolerance – duplex & failover – transactions Automatic parallelism – among transactions (locking) – within a transaction (parallel execution) 59 SQL a Non-Procedural Programming Language SQL: functional programming language describes answer set. Optimizer picks best execution plan – Picks data flow web (pipeline), – degree of parallelism (partitioning) – other execution parameters (process placement, memory,...) Execution Planning Monitor Schema GUI Optimizer Plan Executors Rivers 60 Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Landsat date loc image 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Temporal Spatial Image Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it Answer image date, location, & image tests 61 Data Rivers: Split + Merge Streams N X M Data Streams M Consumers N producers River Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. 62 Partitioned Execution Spreads computation and IO among processors Count Count Count Count Count Count A Table A...E F...J K...N O...S T...Z Partitioned data gives NATURAL parallelism 63 N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows 64 Main Message Technology trends give – many processors and storage units – inexpensively To analyze large quantities of data – sequential (regular) access patterns are 100x faster – parallelism is 1000x faster (trades time for money) – Relational systems show many parallel algorithms. 65 Summary Clusters of Hardware CyberBricks – all nodes are very intelligent – Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. Software CyberBricks – standard way to interconnect intelligent nodes – needs execution model – needs parallelism 66