What Happens When Processing Storage Bandwidth are Free and Infinite? Jim Gray Microsoft Research 1 Outline Hardware CyberBricks – all nodes are very intelligent Software CyberBricks – standard way to interconnect intelligent nodes What next? – Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. 2 A Hypothetical Question Taking things to the limit Moore’s law 100x per decade: – Exa-instructions per second in 30 years – Exa-bit memory chips – Exa-byte disks Gilder’s Law of the Telecosom 3x/year more bandwidth 60,000x per decade! – 40 Gbps per fiber today 3 Grove’s Law Link Bandwidth doubles every 100 years! Not much has happened to telephones lately Still twisted pair 4 Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years Today: – 10 Gbps per channel – 4 channels per fiber: 40 Gbps – 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps 5 Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 6 Year 2000 4B Machine The Year 2000 commodity PC Billion Instructions/Sec .1 Billion Bytes RAM Billion Bits/s Net 10 B Bytes Disk Billion Pixel display 1 Bips Processor .1 B byte RAM 10 GB byte Disk – 3000 x 3000 x 24 1,000 $ 7 4 B PC’s: The Bricks of Cyberspace Cost 1,000 $ Come with – OS (NT, POSIX,..) – DBMS – High speed Net – System management – GUI / OOUI – Tools Compatible with everyone else CyberBricks 8 Super Server: 4T Machine Array of 1,000 4B machines 1 b ips processors 1 B B DRAM 10 B B disks 1 Bbps comm lines 1 TB tape robot A few megabucks Challenge: CPU 50 GB Disc 5 GB RAM Manageability Programmability Security Cyber Brick a 4B machine Availability Scaleability Affordability As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work 9 Functionally Specialized Cards Storage P mips processor ASIC Today: P=50 mips M MB DRAM Network M= 2 MB In a few years ASIC P= 200 mips M= 64 MB Display ASIC 10 It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a – several network interfaces – A Postscript engine • • • • cpu, memory, software, a spooler (soon) – and… a print engine. 11 System On A Chip Integrate Processing with memory on one chip – – – – chip is 75% memory now 1MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM,… projects abound Integrate Networking with processing on one chip – system bus is a kind of network – ATM, FiberChannel, Ethernet,.. Logic on chip. – Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip. 12 All Device Controllers will be Cray 1’s TODAY – Disk controller is 10 mips risc engine with 2MB DRAM – NIC is similar power SOON Central Processor & Memory – Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages – – – – – Uniform programming model Great tools Security economics (cyberbricks) Move computation to data (minimize traffic) Tera Byte Backplane 13 With Tera Byte Interconnect and Super Computer Adapters Processing is incidental to – Networking – Storage – UI Disk Controller/NIC is – faster than device – close to device – Can borrow device package & power Tera Byte Backplane So use idle capacity for computation. Run app in device. 14 Implications Conventional Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Central Processor & Memory Radical Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Tera Byte Backplane 15 How Do They Talk to Each Other? Applications ? RPC streams datagrams Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? One or all of the above. Applications Huge leverage in high-level interfaces. Same old distributed system story. ? RPC streams datagrams VIAL/VIPL VIAL/VIPL Wire(s) 16 Objects! It’s a zoo ORBs, COM, CORBA,.. Object Relationa1 Databases Objects and 3-tier computing 18 History and Alphabet Soup 1995 CORBA Solaris Object Management Group (OMG) 1990 X/Open UNIX International 1985 Open software Foundation (OSF) Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it Open Group OSF DCE NT COM 19 The Promise Objects are Software CyberBricks – productivity breakthrough (plug ins) – manageability breakthrough (modules) Microsoft Promises Cairo distributed objects, secure, transparent, fast invocation IBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans + All will deliver Customers can pick the best one Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic ops key to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy 20 The OLE-COM Experience Macintosh had Publish & Subscribe PowerPoint needed graphs: – plugged MS Graph in as an component. Office adopted OLE – one graph program for all of office Internet arrived – URLs are object references, – Office is Web Enabled right away! Office97 smaller than Office95 because of shared components It works!! 21 Linking And Embedding Objects are data modules; transactions are execution modules Link: pointer to object somewhere else – Think URL in Internet Embed: bytes are here Objects may be active; can callback to subscribers 22 Objects Meet Databases basis for universal data servers, access, & integration Object-oriented (COM oriented) interface to data Breaks DBMS into components Anything can be DBMS a data source engine Optimization/navigation “on top of” other data sources Makes an RDBMS an O-R DBMS assuming optimizer understands objects Database Spreadsheet Photos Mail Map Document 23 The BIG Picture Components and transactions Software modules are objects Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) Standard interfaces allow software plug-ins Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated ActiveX Components are a 250M$/year business. Object Request Broker 24 Object Request Broker (ORB) Orchestrates RPC Registers Servers Manages pools of servers Connects clients to servers Does Naming, request-level authorization, Provides transaction coordination Direct and queued invocation Old names: – Transaction Processing Monitor, Transaction – Web server, – NetWare Object-Request Broker 25 The OO Points So Far Objects are software Cyber Bricks Object interconnect standards are emerging Cyber Bricks become Federated Systems. Next points: – put processing close to data – do parallel processing. 26 Three Tier Computing Clients do presentation, gather input Clients do some workflow (Xscript) Clients send high-level requests to ORB Presentation workflow ORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queues Server-side workflow scripts call on distributed business objects to execute task Business Objects Database 27 The Three Tiers Web Client HTML VB Java plug-ins VBscritpt JavaScrpt Middleware VB or Java Script Engine Object server Pool VB or Java Virt Machine Internet HTTP+ DCOM ORB ORB TP Monitor Web Server... Object & Data server. DCOM (oleDB, ODBC,...) IBM Legacy Gateways 28 Transaction Processing Evolution to Three Tier Intelligence migrated to clients Mainframe cards Mainframe Batch processing (centralized) Dumb terminals & Remote Job Entry green screen 3270 TP Monitor Intelligent terminals database backends Workflow Systems Object Request Brokers Application Generators Server ORB Active 29 Web Evolution to Three Tier Intelligence migrated to clients (like TP) WAIS Character-mode clients, smart servers Web Server archie ghopher green screen Mosaic GUI Browsers - Web file servers GUI Plugins - Web dispatchers - CGI Smart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & server NS & IE Active 30 PC Evolution to Three Tier Intelligence migrated to server Stand-alone PC (centralized) PC + File & print server IO request reply disk I/O message per I/O PC + Database server message per SQL statement PC + App server message per transaction SQL Statement ActiveX Client, ORB ActiveX server, Xscript Transaction 31 Why Did Everyone Go To ThreeTier? Manageability Presentation – Business rules must be with data – Middleware operations tools Performance (scaleability) workflow – Server resources are precious – ORB dispatches requests to server pools Technology & Physics – – – – Put UI processing near user Put shared data processing near shared data Minimizes data moves Encapsulate / modularity Business Objects Database 32 Why Put Business Objects at Server? MOM’s Business Objects DAD’sRaw Data Customer comes to store Takes what he wants Fills out invoice Leaves money for goods Easy to build No clerks Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoice Customer pays clerk, gets goods Easy to manage Clerks controls access Encapsulation 33 The OO Points So Far Objects are software Cyber Bricks Object interconnect standards are emerging Cyber Bricks become Federated Systems. Put processing close to data Next point: – do parallel processing. 34 Parallelism: the OTHER half of Super-Servers Clusters of machines allow two kinds of parallelism – Many little jobs: Online transaction processing • TPC A, B, C,… – A few big jobs: data search & analysis • TPC D, DSS, OLAP Both give automatic Parallelism 35 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. 36 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Sequential Sequential Any Sequential Sequential Program Any Sequential Program Any Sequential Sequential Program 37 Why are Relational Operators Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM 38 Database Systems “Hide” Parallelism Automate system management via tools – data placement – data organization (indexing) – periodic tasks (dump / recover / reorganize) Automatic fault tolerance – duplex & failover – transactions Automatic parallelism – among transactions (locking) – within a transaction (parallel execution) 39 SQL a Non-Procedural Programming Language SQL: functional programming language describes answer set. Optimizer picks best execution plan – Picks data flow web (pipeline), – degree of parallelism (partitioning) – other execution parameters (process placement, memory,...) Execution Planning Monitor Schema GUI Optimizer Plan Executors Rivers 40 Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range A...E F...J K...N O...S T...Z Good for equijoins, range queries group-by Hash A...E F...J K...N O...S T...Z Good for equijoins Round Robin A...E F...J K...N O...S T...Z Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning 41 N x M way Parallelism Merge Merge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A...E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. 42 Parallel Objects? How does all this DB parallelism connect to hardware/software Cyber Bricks? To scale to large client sets – need lots of independent parallel execution. – Comes for from from ORB. To scale to large data sets – need intra-program parallelism (like parallel DBs) – Requires some invention. 43 Outline Hardware CyberBricks – all nodes are very intelligent Software CyberBricks – standard way to interconnect intelligent nodes What next? – Processing migrates to where the power is • • • • Disk, network, display controllers have full-blown OS Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them Computer is a federated distributed system. Parallel execution is important 44 MORE SLIDES but there is only so much time. Too bad 45 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card). 46 Parallelism: Performance is the Goal Goal is to get 'good' performance. Trade time for money. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. Parallel DBMSs obey these laws 47 Success Stories Online Transaction Processing – many little jobs – SQL systems support • 50 k tpm-C (44 cpu, 600 disk 2 node ) hardware Batch (decision support and Utility) – few big jobs, parallelism inside – Scan data at 100 MB/s – Linear Scaleup to 1,000 processors hardware 48 The New Law of Computing Grosch's Law: 1 MIPS 1$ 2x $ is 4x performance 1,000 MIPS 32 $ .03$/MIPS 2x $ is 2x performance Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possible 1,000 MIPS 1,000 $ 1 MIPS 1$ 49 Clusters being built Teradata 1,000 nodes (30k$/slice) Tandem,VMScluster 150 nodes (100k$/slice) Intel, 9,000 nodes @ 55M$ ( 6k$/slice) Teradata, Tandem, DEC moving to NT+low slice price IBM: 512 nodes ASCI @ 100m$ (200k$/slice) PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers KEY TECHNOLOGY HERE IS THE APPS. – Apps distribute data – Apps distribute execution 50 BOTH SMP and Cluster? Grow Up with SMP 4xP6 is now standard SMP Super Server Grow Out with Cluster Cluster has inexpensive parts Departmental Server Cluster of PCs Personal System 52 Clusters Have Advantages Clients and Servers made from the same stuff. Inexpensive: – Built with commodity components Fault tolerance: – Spare modules mask failures Modular growth – grow by adding small modules 53 Meta-Message: Technology Ratios Are Important If everything gets faster & cheaper at the same rate THEN nothing really changes. Things getting MUCH BETTER: Things staying about the same – communication speed & cost 1,000x – processor speed & cost 100x – storage size & cost 100x – speed of light (more or less constant) – people (10x more expensive) – storage speed (only 10x better) 54 Storage Ratios Changed 10x better access time 10x more bandwidth 4,000x lower media price DRAM/DISK 100:1 to 10:10 to 50:1 1 1980 1. 1990 Year 0.1 2000 100 Accesses per Second 10 Capacity (GB) 10. seeks per second bandwidth: MB/s 100 Storage Price vs Time Megabytes per kilo-dollar Disk accesses/second vs Time Disk Performance vs Time 10,000. 1,000. MB/k$ 10 100. 10. 1. 1 1980 1990 Year 2000 0.1 1980 1990 2000 Year 55 Performance = Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’s Now we count memory references Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Disc Wait Disc Wait Sort Sort OS Memory Wait B-Cache Data Miss 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not I-Cache Miss D-Cache Miss 59 Storage Latency: How Far Away is the Data? Clock Ticks 10 9 Andromdeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min 60 Tape Farms for Tertiary Storage Not Mainframe Silos 100 robots 1M$ 50TB 50$/GB 3K Maps 10K$ robot 14 tapes 27 hr Scan 500 GB 5 MB/s 20$/GB Scan in 27 hours. independent tape robots 30 Maps many (like a disc farm) 61 The Metrics: Disk and Tape Farms Win GB/K$ 1,000,000 Kaps 100,000 Maps Data Motel: Data checks in, but it never checks ou SCANS/Day 10,000 1,000 100 10 1 0.1 0.01 1000 x Disc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLT Tape Farm 62 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap: => 1.5 $/GB 30 $/tape 20 GB/tape (100x cheaper than disc). 63 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out! 64 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server – shorter queues – parallel transfer – lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays Wait Transfer Transfer Rotate Rotate Seek Seek 65 Billions Of Clients Every device will be “intelligent” Doors, rooms, cars… Computing will be ubiquitous 66 Billions Of Clients Need Millions Of Servers All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server 67 1987: 256 tps Benchmark 14 M$ computer (Tandem) A dozen people False floor, 2 rooms of machines Admin expert Hardware experts A 32 node processor array Simulate 25,600 clients Network expert Manager Performance expert DB expert A 40 GB disk array (80 drives) Auditor OS expert 68 1988: DB2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB 69 1997: 10 years later 1 Person and 1 box = 1250 tps 1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 1,000x less 25 micro dollars per transaction Hardware expert OS expert Net expert DB expert App expert 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk 3 x7 x 4GB 70 disk arrays What Happened? Moore’s law: Things get 4x better every 3 years (applies to computers, storage, and networks) New Economics: Commodity class mainframe minicomputer microcomputer price/mips software $/mips k$/year 10,000 100 100 10 10 1 GUI: Human - computer tradeoff optimize for people, not computers time 71 What Happens Next Last 10 years: 1000x improvement Next 10 years: ???? 1985 1995 Today: text and image servers are free 25 m$/hit => advertising pays for them Future: video, audio, … servers are free “You ain’t seen nothing yet!” 2005 72 Smart Cards Then (1979) Now (1997) EMV card with dynamic authentication (EMV=Europay, MasterCard, Visa standard) Bull CP8 two chip card first public demonstration 1979 door key, vending machines, photocopiers Courtesy of Dennis Roberson NCR . 73 Memory Size (Bits) Smart Card Memory Capacity 16 KB today but growing super-exponentially 300 M 1M 10 K You are here 3K 1990 1992 1996 1998 2000 2002 2004 Applications Source: PIN/Card -Tech/ Courtesy of Dennis Roberson NCR Cards will be able to store data (e.g. medical) books, movies,… money 74