Storage Bricks Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Helped me sharpen Erik Riedel these arguments Catharine Van Ingen 1 First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software (tubes not transistors) 2 1.6 meters 10 years later 3 Kilo Mega Giga Tera Peta Exa Zetta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB 1” micro-drive • System on a chip • High-speed SAN Yotta • Disk replacing tape • Disk is super computer! 4 Disks are becoming computers • • • • • • • • Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio… 5 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Processing decentralized Today: Moving to data sources Moving to power sources ASIC P=50 mips M= 2 MB Storage Moving to sheet metal Network ASIC In a few years P= 500 mips Display ASIC ? The end of computers ? M= 256 MB 6 It’s Already True of Printers Peripheral = CyberBrick • You buy a printer • You get a – several network interfaces – A Postscript engine • • • • cpu, memory, software, a spooler (soon) – and… a print engine. 7 The Absurd Design? • • • • Segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Processors ~ 1 Tips RAM ~ 1 TB Disks 8 ~ 100TB The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! • Optimizations: – Reduce management costs – Caching – Sequential 100x faster than random 200$ 100 MB/s 200 Kaps 1 TB 9 Disk = Node • magnetic storage (1TB) • processor + RAM + LAN • Management interface (HTTP + SOAP) • Application execution environment • Application – File – DB2/Oracle/SQL – Notes/Exchange/ TeamServer – SAP/Seibold/… – Quickbooks /Tivo/ PC.… Applications Services DBMS RPC, ... File System LAN driver Disk driver OS Kernel 10 Implications Conventional • Offload device handling to NIC/HBA • higher level protocols: I2O, NASD, VIA, IP, TCP… • SMP and Cluster parallelism is important. Central Processor & Memory Radical • Move app to NIC/device controller • higher-higher level protocols: SOAP/DCOM/RMI.. • Cluster parallelism is VERY important. Terabyte/s Backplane 11 Intermediate Step: Shared Logic • • • • • • Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS 10k$/TB to 50k$/TB Shared – – – – – Sheet metal Power Support/Config Security Network ports Snap ~1TB 12x80GB NAS NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS 12 • These bricks could run applications (e.g. SQL or Mail or..) Example • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), • 2.5processors/TB, 1GB RAM/TB • JIT storage & processing 3 weeks from order to deploy 13 Slide courtesy of Brewster Kahle, @ Archive.org What if Disk Replaces Tape? How does it work? • Backup/Restore – RAID (among the federation) – Snapshot copies (in most OSs) – remote replicas (standard in DBMS and FS) • Archive – Use “cold” 95% of disk space • Interchange – Send computers not disks. 14 It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) 15 Archive to Disk 100TB for 0.5M$ + 1.5 “free” petabytes • If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC) • So you have 1.6 PB of (mirrored) storage (160GB drives) • Use the “empty” 95% for archive storage. • No extra space or extra power cost. • Very fast access (milliseconds vs hours). • Snapshot is read-only (software enforced ) • Makes Admin easy (saves people costs) 16 Disk as Tape Archive Slide courtesy of Brewster Kahle, @ Archive.org • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. 17 Disk as Tape Interchange • Tape interchange is frustrating (often unreadable) • Beyond 1-10 GB send media not data – FTP takes too long (hour/GB) – Bandwidth still very expensive (1$/GB) • Writing DVD not much faster than Internet • New technology could change this – 100 GB DVD @ 10MBps would be competitive. • Write 1TB disk in 2.5 hrs (at 100MBps) • But, how does interchange work? 18 Disk As Tape Interchange: What format? • • • • Today I send 160GB NTFS/SQL disks. But that is not a good format for Linux/DB2 users. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. – DHCP then file or DB server via standard interface. – “pull” data from server. 19 Some Questions • • • • • • What is the product? How do I manage 10,000 nodes (disks)? How do I program 10,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB? 20 What is the Product? • Concept: Plug it in and it works! • • • • • • • • • Music/Video/Photo appliance (home) Game appliance “PC” File server appliance Data archive/interchange appliance Web server appliance DB server eMail appliance Application appliance network power 21 How Does Scale Out Work? • Files: well known designs: – – – – rooted tree partitioned across nodes Automatic cooling (migration) Mirrors or Chained declustering Snapshots for backup/archive • Databases: well known designs – Partitioning, remote replication similar to files – distributed query processing. • Applications: (hypothetical) – Must be designed as mobile objects – Middleware provides object migration system • Objects externalize methods to migrate ( == backup/restore/archive) • Web services seem to have key ideas (xml representation) 22 – Example: eMail object is mailbox Auto Manage Storage • 1980 rule of thumb: – A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb – A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app). • Problem: – 5TB is 50k$ today, 5k$ in a few years. – Admin cost >> storage cost !!!! • Challenge: – Automate ALL storage admin tasks 23 Admin: TB and “guessed” $/TB (does not include cost of application, overhead, not “substance”) • • • • Google: Yahoo! DB Wall St. 1 :100TB 1 : 50TB 1 : 5TB 1 : 1TB 5k$/TB/y 20k$/TB/y 60k$/TB/y 400k$/TB/y (reported) • hardware dominant cost only @ Google. • How can we waste hardware to save people cost? 24 How do I manage 10,000 nodes? • You can’t manage 10,000 x (for any x). • They manage themselves. – You manage exceptional exceptions. • Auto Manage – Plug & Play hardware – Auto-load balance & placement storage & processing – Simple parallel programming model – Fault masking 25 How do I program 10,000 nodes? • You can’t program 10,000 x (for any x). • They program themselves. – You write embarrassingly parallel programs – Examples: SQL, Web, Google, Inktomi, HotMail,…. – PVM and MPI prove it must be automatic (unless you have a PhD)! • Auto Parallelism is ESSENTIAL 26 Summary • Disks will become supercomputers so – Lots of computing to optimize the arm – Can put app close to the data (better modularity, locality) – Storage appliances (self-organizing) • The arm/capacity tradeoff: “waste” space to save access. – – – – Compression (saves bandwidth) Mirrors Online backup/restore Online archive (vault to other drives or geoplex if possible) • Not disks replace tapes: Storage appliances replace tapes. • Self-organizing storage servers (file systems) (prototypes of this software exist) 27