The 5 Minute Rule Jim Gray Microsoft Research Gray@Microsoft.com http://www.Research.Microsoft.com/~Gray/talks Kilo Mega Giga Tera Peta Exa 103 106 109 1012 1015 1018 today, we are here 1 Storage Hierarchy (9 levels) • Cache 1, 2 • Main (1, 2, 3 if nUMA). • Disk (1 (cached), 2) • Tape (1 (mounted), 2) 2 Meta-Message: Technology Ratios Are Important • If everything gets faster & cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: – communication speed & cost 1,000x – processor speed & cost 100x – storage size & cost 100x • Things staying about the same – speed of light (more or less constant) – people (10x more expensive) – storage speed (only 10x better) 3 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs Speed 1012 109 106 104 Cache Nearline Tape Offline Main 102 Tape Disc Secondary Online Online Secondary Tape Tape 100 Disc Main Offline Nearline Tape Tape -2 $/MB Typical System (bytes) 1015 Price vs Speed 10 Cache 103 10-4 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) 10-9 10-6 10-3 10 0 10 3 Access Time (seconds) 4 Storage Ratios Changed • 10x better access time • 10x more bandwidth • 4,000x lower media price • DRAM/DISK 100:1 to 10:10 to 50:1 Disk Performance vs Time (accesses/ second & Capacity) Disk Performance vs Time 100 10000 1 1980 1990 Year 1 2000 10 10 1 1980 1 1990 Year 0.1 2000 1000 100 $/MB 10 Accesses per Second 10 bandwidth (MB/s) access time (ms) 100 Disk Capackty (GB) 100 Storage Price vs Time 10 1 0.1 0.01 1980 1990 Year 2000 5 Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Sort Disc Wait Disc Wait Sort OS Memory Wait B-Cache Data Miss I-Cache Miss D-Cache Miss 6 The Pico Processor 1 MM 3 1 M SPECmarks Pico Processor 10 pico-second ram megabyte 10 nano-second ram 10 gigabyte 1 terabyte 10 microsecond ram 10 millisecond disc 100 terabyte 10 second tape archive 100 petabyte 106 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes! 7 Storage Latency: How Far Away is the Data? 10 9 Andromeda Tape /Optical Robot 10 6 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min 8 The Five Minute Rule • Trade DRAM for Disk Accesses • Cost of an access (DriveCost / Access_per_second) • Cost of a DRAM page ( $/MB / pages_per_MB) • Break even has two terms: • Technology term and an Economic term PagesPerMBofDRAM PricePerDi skDrive 1 BreakEvenReferenceInterval AccessPerSecondPerDi sk PricePerMB ofDRAM • Grew page size to compensate for changing ratios. • Still at 5 minute for random, 1 minute sequential BreakEvenReferenceInterval PagesPerMBofDRAM PricePerDi skDrive 1 AccessPerSecondPerDi sk PricePerMB ofDRAM 9 Shows Best Page Index Page Size ~16KB Index Page Utility vs Page Size and Disk Performance Index Page Utility vs Page Size and Index Elemet Size 1.00 0.90 0.90 0.80 0.80 Utility 16 byte entries 32 byte 0.70 10 MB/s 0.70 5 MB/s 0.60 0.60 64 byte 0.50 0.40 Utility 1.00 128 byte 2 4 8 16 0.40 32 3 MB/s 0.50 2 4 8 16 32 64 128 128 40 MB/s 0.65 0.74 0.83 0.91 0.97 0.99 0.94 16 B 0.64 0.72 0.78 0.82 0.79 0.69 0.54 10 MB/s 0.64 0.72 0.78 0.82 0.79 0.69 0.54 32 B 0.54 0.62 0.69 0.73 0.71 0.63 0.50 5 MB/s 0.62 0.69 0.73 0.71 0.63 0.50 0.34 64 B 0.44 0.53 0.60 0.64 0.64 0.57 0.45 3 MB/s 0.51 0.56 0.58 0.54 0.46 0.34 0.22 128 B 0.34 0.43 0.51 0.56 0.56 0.51 0.41 1 MB/s 0.40 0.44 0.44 0.41 0.33 0.24 0.16 Page Size (KB) 64 1MB/s Page Size (KB) 10 Standard Storage Metrics • Capacity: – RAM: MB and $/MB: today at 10MB & 100$/MB – Disk: GB and $/GB: today at 10 GB and 200$/GB – Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline) • Access time (latency) – RAM: 100 ns – Disk: 10 ms – Tape: 30 second pick, 30 second position • Transfer rate – RAM: – Disk: – Tape: 1 GB/s 5 MB/s - - - Arrays can go to 1GB/s 5 MB/s - - - striping is problematic 11 New Storage Metrics: Kaps, Maps, SCAN? • Kaps: How many kilobyte objects served per second – The file server, transaction processing metric – This is the OLD metric. • Maps: How many megabyte objects served per second – The Multi-Media metric • SCAN: How long to scan all the data – the data mining and utility metric • And – Kaps/$, Maps/$, TBscan/$ 12 For the Record (good 1998 devices packaged in system ) http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf Unit capacity (GB) Unit price $ $/GB Latency (s) Bandwidth (Mbps) Kaps Maps Scan time (s/TB) $/Kaps $/Maps $/TBscan DRAM 1 5000 5000 1.E-7 500 5.E+5 5.E+2 2 1.E-10 1.E-7 $0.11 DISK 9 900 100 1.E-2 5 1.E+2 4.76 1800 1.E-7 2.E-6 $2 TAPE robot 35 X 14 10000 20 3.E+1 5 3.E-2 3.E-2 98000 3.E-3 3.E-3 $296 13 How To Get Lots of Maps, SCANs • parallelism: use many little devices in parallel At 10 MB/s: 1.2 days to scan 1 Terabyte 1,000 x parallel: 100 seconds SCAN. 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. • Beware of the media myth • Beware of the access time myth 14 The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card). 15 Tape Farms for Tertiary Storage Not Mainframe Silos 100 robots 1M$ 50TB 50$/GB 3K Maps 10K$ robot 14 tapes 27 hr Scan 500 GB 5 MB/s 20$/GB Scan in 27 hours. independent tape robots 30 Maps many (like a disc farm) 16 The Metrics: Disk and Tape Farms Win GB/K$ 1,000,000 Kaps 100,000 Maps Data Motel: Data checks in, but it never checks ou SCANS/Day 10,000 1,000 100 10 1 0.1 0.01 1000 x Disc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLT Tape Farm 17 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap: => 1.5 $/GB 30 $/tape 20 GB/tape (100x cheaper than disc). 18 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out! 19 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server – shorter queues – parallel transfer – lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays Wait Transfer Transfer Rotate Rotate Seek Seek 20