Data Centric Computing Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999 Yotta Zetta Exa Peta Tera Giga Mega 1 Kilo Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Helped me sharpen Kim Keeton these arguments Erik Riedel Catharine Van Ingen 2 First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software (tubes not transistors) 3 1.6 meters 10 years later 4 Kilo Mega Giga Tera Peta Exa Zetta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB as 1” micro-drive • System on a chip • High-speed SAN Yotta • Disk replacing tape • Disk is super computer! 5 Disks are becoming computers • • • • • • • • Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio… 6 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Processing decentralized Today: Moving to data sources Moving to power sources ASIC P=50 mips M= 2 MB Storage Moving to sheet metal Network ASIC In a few years P= 500 mips Display ASIC ? The end of computers ? M= 256 MB 7 It’s Already True of Printers Peripheral = CyberBrick • You buy a printer • You get a – several network interfaces – A Postscript engine • • • • cpu, memory, software, a spooler (soon) – and… a print engine. 8 The (absurd?) consequences of Moore’s Law • 256 way nUMA? • Huge main memories: now: • • • • • • 1 GB RAM chips MAD at 200 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 1 bips cpus for 10$ 10 bips cpus at high end 500MB - 64GB memories then: 10GB - 1TB memories • Huge disks now: 20-200 GB 3.5” disks then: .1 - 1 TB disks • Petabyte storage farms – (that you can’t back up or restore). • Disks >> tapes – “Small” disks: One platter one inch 10GB • SAN convergence 1 GBps point to point is easy 9 The Absurd Design? • • • • Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Processors ~ 1 Tips RAM ~ 1 TB Disks 10 ~ 100TB What’s a Balanced System? (40+ disk arms / cpu) System Bus PCI Bus PCI Bus 11 Observations re TPC C, H systems • • • • • • More than ½ the hardware cost is in disks Most of the mips are in the disk controllers 20 mips/arm is enough for tpcC 50 mips/arm is enough for tpcH Need 128MB to 256MB/arm Ref: – Gray& Shenoy: “Rules of Thumb…” – Keeton, Riedel, Uysal, PhD thesis. ? The end of computers ? 13 When each disk has 1bips, no need for ‘cpu’ 16 Implications Conventional • Offload device handling to NIC/HBA • higher level protocols: I2O, NASD, VIA, IP, TCP… • SMP and Cluster parallelism is important. Central Processor & Memory Radical • Move app to NIC/device controller • higher-higher level protocols: CORBA / COM+. • Cluster parallelism is VERY important. Terabyte/s Backplane 17 Interim Step: Shared Logic • • • • Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS (except NetApp ) • 10k$/TB to 50k$/TB • Shared – – – – – Sheet metal Power Support/Config Security Network ports Snap™ ~1TB 12x80GB NAS NetApp™ ~.5TB 8x70GB NAS Maxstor™ ~2TB 12x160GB NAS 18 Gordon Bell’s Seven Price Tiers 10$: 100$: 1,000$: 10,000$: 100,000$: 1,000,000$: 10,000,000$: wrist watch computers pocket/ palm computers portable computers personal computers (desktop) departmental computers (closet) site computers (glass house) regional computers (glass castle) Super-Server: Costs more than 100,000 $ “Mainframe” Costs more than 1M$ Must be an array of processors, disks, tapes comm ports 20 Bell’s Evolution of Computer Classes Technology enable two evolutionary paths: 1. constant performance, decreasing cost 2. constant price, increasing performance Log Price Mainframes (central) Minis (dep’t.) WSs PCs (personals) Time 1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8 1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62 ?? 21 NAS vs SAN • Network Attached Storage – – – – File servers Database servers Application servers (it’s a slippery slope: as Novell showed) • Storage Area Network – – – – High level Interfaces are better A lower life form Block server: get block / put block Wrong abstraction level (too low level) Security is VERY hard to understand. • (who can read that disk block?) SCSI and iSCSI are popular. 22 How Do They Talk to Each Other? Applications Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other – WebServices/SOAP? CORBA? COM+? RMI? – One or all of the above. Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. SIO ? RPC streams datagrams • • • • SIO SAN 23 The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server • Function gravitates to data. Everything = App Server 25 Why Not a Sector Server? (let’s get physical!) • Good idea, that’s what we have today. • But – – – – – cache added for performance Sector remap added for fault tolerance error reporting and diagnostics added SCSI commends (reserve,.. are growing) Sharing problematic (space mgmt, security,…) • Slipping down the slope to a 2-D block server 26 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server • Tried and true design – HSC - VAX cluster – EMC – IBM Sysplex (3980?) • But look inside – – – – – – – – – Has a cache Has space management Has error reporting & management Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… Has locking Has remote replication Has an OS Security is problematic Low-level interface moves too many bytes 27 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server • Tried and true design – Cedar -> NFS – file server, cache, space,.. – Open file is many fewer msgs • Grows to have – – – – – – Directories + Naming Authentication + access control RAID 0, 1, 2, 3, 4, 5, 10, 50,… Locking Backup/restore/admin Cooperative caching with client 28 Why Not a File Server? Put a Little on the 2-D Block Server • Tried and true design – NetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDav • Yes, but look at NetWare – File interface grew – Became an app server • Mail, DB, Web,…. – Netware had a primitive OS • Hard to program, so optimized wrong thing 29 Why Not Everything? Allow Everything on Disk Server (thin client’s) • Tried and true design – – – – – Mainframes, Minis, ... Web servers,… Encapsulates data Minimizes data moves Scaleable • It is where everyone ends up. • All the arguments against are short-term. 30 The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server • Function gravitates to data. Everything = App Server 31 Disk = Node • • • • has magnetic storage (1TB?) has processor & DRAM has SAN attachment has execution Applications Services environment DBMS RPC, ... File System SAN driver Disk driver OS Kernel 32 Hardware • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB • 3 weeks from ordering to operational 33 Slide courtesy of Brewster Kahle, @ Archive.org Disk as Tape • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Kahle, @ Archive.org 34 Disk As Tape: What format? • • • • Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. – DHCP then file or DB server via standard interface. – Web Service in long term 35 Some Questions • • • • • • • Will the disk folks deliver? What is the product? How do I manage 1,000 nodes (disks)? How do I program 1,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB? 36 Will the disk folks deliver? Maybe! Hard Drive Unit Shipments Total Hard Drive Unit Shipments 200 150 100 50 01 20 00 20 99 19 98 19 97 19 96 19 95 19 94 19 93 19 92 19 91 19 90 19 89 19 88 19 87 19 86 0 19 Units in Millions 250 Source: DiskTrend/IDC Not a pretty picture (lately) 37 Most Disks are Personal • 85% of disks are desktop/mobile (not SCSI) • Personal media is AT LEAST 50% of the problem. • How to manage your shoebox of: – – – – – Documents Voicemail Photos Music Videos .pdf .tif .ppt Music 6.9 GB 1.8K files 180 CDs My Books 98 MB .xls Working 2.3 GB 432 folders 2.9K files .gif .jpg Archive 5.1 GB 477 folders 18.7 K files Mail Video GB .7 2.6 GB msgs 43K 10 hours Low res .doc/html .gif .jpg .pdf .tif .xls .doc/html 27.1K files & 42K .msg Files (by number)38 17.7 GB (by size) What is the Product? (see next section on media management) • Concept: Plug it in and it works! • • • • • • • • • Music/Video/Photo appliance (home) Game appliance “PC” power File server appliance Data archive/interchange appliance Web appliance Email appliance Application appliance Router appliance network 39 Auto Manage Storage • 1980 rule of thumb: – A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb – A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app). • Problem: – 5TB is 50k$ today, 5k$ in a few years. – Admin cost >> storage cost !!!! • Challenge: – Automate ALL storage admin tasks 40 How do I manage 1,000 nodes? • You can’t manage 1,000 x (for any x). • They manage themselves. – You manage exceptional exceptions. • Auto Manage – – – – Plug & Play hardware Auto-load balance & placement storage & processing Simple parallel programming model Fault masking • Some positive signs: – Few admins at Google Yahoo! Hotmail 10k nodes ? nodes, 10k nodes, 2 PB , 0.3 PB, 0.3 PB 41 How do I program 1,000 nodes? • You can’t program 1,000 x (for any x). • They program themselves. – You write embarrassingly parallel programs – Examples: SQL, Web, Google, Inktomi, HotMail,…. – PVM and MPI prove it must be automatic (unless you have a PhD)! • Auto Parallelism is ESSENTIAL 42 Plug & Play Software • RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP) – Gives huge TOOL LEVERAGE – Solves the hard problems : • • • • naming, security, directory service, operations,... • Commoditized programming environments – – – – FreeBSD, Linix, Solaris,…+ tools NetWare + tools WinCE, WinNT,…+ tools JavaOS + tools • Apps gravitate to data. 43 • General purpose OS on dedicated ctlr can run apps. It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?). A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) 44 CyberBricks • • • • • Disks are becoming supercomputers. Each disk will be a file server then SOAP server Multi-disk bricks are transitional Long-term brick will have OS per disk. Systems will be built from bricks. • There will also be – – – – Network Bricks Display Bricks Camera Bricks …. 52 Data Centric Computing Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999 Yotta Zetta Exa Peta Tera Giga Mega 53 Kilo Communications Excitement!! Point-to-Point Immediate Time Shifted Broadcast conversation money lecture concert mail book newspaper Net Work + DB Data Base Its ALL going electronic Information is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added 54 Slide borrowed from Craig Mundie Information Excitement! • But comm just carries information • Real value added is – information capture & render speech, vision, graphics, animation, … – Information storage retrieval, – Information analysis 55 Information At Your Fingertips • All information will be in an online database (somewhere) • You might record everything you – read: 10MB/day, 400 GB/lifetime (5 disks today) – hear: 400MB/day, 16 TB/lifetime (2 disks/year today) – see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday) • Data storage, organization, and analysis is challenge. • text, speech, sound, vision, graphics, spatial, time… • Information at Your Fingertips – Make it easy to capture – Make it easy to store & organize & analyze – Make it easy to present & access 56 How much information is there? • Soon everything can be recorded and indexed Everything ! • Most bytes will never be seen Recorded by humans. All Books • Data summarization, trend MultiMedia detection anomaly detection All LoC books are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ Zetta Exa Peta (words) Tera .Movi e A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Yotta Giga Mega 57 Kilo Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Immediate OR Time Delayed Why Put Everything in Cyberspace? Point-to-Point OR Broadcast Locate Process Analyze Summarize 58 Disk Storage Cheaper than Paper • File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet • Disk: disk (160 GB =) ASCII: 100 m pages 0.0001 ¢/sheet Image: 1 m photos 0.03 ¢/sheet • Store everything on disk 300$ (10,000x cheaper) (100x cheaper) 59 Gordon Bell’s MainBrain™ Digitize Everything A BIG shoebox? • • • • • • Scans Music: Photos: Video: Docs: Mail: 20 k “pages” tiff@ 300 dpi 2 k “tacks” 13 k images 10 hrs 3 k (ppt, word,..) 50 k messages 1 GB 7 GB 2 GB 3 GB 2 GB 1 GB 16 GB 60 Gary Starkweather • • • • • • Scan EVERYTHING 400 dpi TIFF 70k “pages” ~ 14GB OCR all scans (98% recognition ocr accuracy) All indexed (5 second access to anything) All on his laptop. 61 • Q: What happens when the personal terabyte arrives? • A: Things will run SLOWLY…. unless we add good software 62 Summary • Disks will morph to appliances • Main barriers to this happening – Lack of Cool Apps – Cost of Information management 63