The 5 Minute Rule Jim Gray Microsoft Research Kilo

advertisement
The 5 Minute Rule
Jim Gray
Microsoft Research
Gray@Microsoft.com
http://www.Research.Microsoft.com/~Gray/talks
Kilo
Mega
Giga
Tera
Peta
Exa
103
106
109
1012
1015
1018
today, we are here
1
Storage Hierarchy (9 levels)
• Cache 1, 2
• Main (1, 2, 3 if nUMA).
• Disk (1 (cached), 2)
• Tape (1 (mounted), 2)
2
Meta-Message:
Technology Ratios Are Important
• If everything gets faster & cheaper
at the same rate
THEN nothing really changes.
• Things getting MUCH BETTER:
– communication speed & cost 1,000x
– processor speed & cost 100x
– storage size & cost 100x
• Things staying about the same
– speed of light (more or less constant)
– people (10x more expensive)
– storage speed (only 10x better)
3
Today’s Storage Hierarchy :
Speed & Capacity vs Cost Tradeoffs
Size vs Speed
1012
109
106
104
Cache
Nearline
Tape Offline
Main
102
Tape
Disc
Secondary
Online
Online
Secondary
Tape
Tape
100
Disc
Main
Offline
Nearline
Tape
Tape
-2
$/MB
Typical System (bytes)
1015
Price vs Speed
10
Cache
103
10-4
10-9 10-6 10-3 10 0 10 3
Access Time (seconds)
10-9 10-6 10-3 10 0 10 3
Access Time (seconds)
4
Storage Ratios Changed
• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
• DRAM/DISK 100:1 to 10:10 to 50:1
Disk Performance vs Time
(accesses/ second & Capacity)
Disk Performance vs Time
100
10000
1
1980
1990
Year
1
2000
10
10
1
1980
1
1990
Year
0.1
2000
1000
100
$/MB
10
Accesses per
Second
10
bandwidth (MB/s)
access time (ms)
100
Disk Capackty
(GB)
100
Storage Price vs Time
10
1
0.1
0.01
1980
1990
Year
2000
5
Thesis: Performance =Storage Accesses
not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
Where the time goes:
clock ticks used by AlphaSort Components
Sort
Disc Wait
Disc Wait Sort
OS
Memory Wait
B-Cache
Data Miss
I-Cache
Miss
D-Cache
Miss
6
The Pico Processor
1 MM
3
1 M SPECmarks
Pico Processor
10 pico-second ram
megabyte
10 nano-second ram 10 gigabyte
1 terabyte
10 microsecond ram
10 millisecond disc
100 terabyte
10 second tape archive 100 petabyte
106 clocks/
fault to bulk ram
Event-horizon on chip.
VM reincarnated
Multi-program cache
Terror Bytes!
7
Storage Latency: How Far
Away is the Data?
10 9
Andromeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
8
The Five Minute Rule
• Trade DRAM for Disk Accesses
• Cost of an access (DriveCost / Access_per_second)
• Cost of a DRAM page ( $/MB / pages_per_MB)
• Break even has two terms:
• Technology term and an Economic term
PagesPerMBofDRAM
PricePerDi skDrive
1
BreakEvenReferenceInterval 

AccessPerSecondPerDi sk PricePerMB ofDRAM
• Grew page size to compensate for changing ratios.
• Still at 5 minute for random, 1 minute sequential
BreakEvenReferenceInterval 
PagesPerMBofDRAM
PricePerDi skDrive
1

AccessPerSecondPerDi sk PricePerMB ofDRAM
9
Shows Best Page Index Page Size ~16KB
Index Page Utility vs Page Size
and Disk Performance
Index Page Utility vs Page Size
and Index Elemet Size
1.00
0.90
0.90
0.80
0.80
Utility
16 byte entries
32 byte
0.70
10 MB/s
0.70
5 MB/s
0.60
0.60
64 byte
0.50
0.40
Utility
1.00
128 byte
2
4
8
16
0.40
32
3 MB/s
0.50
2
4
8
16
32
64
128
128
40 MB/s 0.65 0.74 0.83 0.91 0.97 0.99 0.94
16 B
0.64 0.72 0.78 0.82 0.79 0.69 0.54
10 MB/s 0.64 0.72 0.78 0.82 0.79 0.69 0.54
32 B
0.54 0.62 0.69 0.73 0.71 0.63 0.50
5 MB/s
0.62 0.69 0.73 0.71 0.63 0.50 0.34
64 B
0.44 0.53 0.60 0.64 0.64 0.57 0.45
3 MB/s
0.51 0.56 0.58 0.54 0.46 0.34 0.22
128 B 0.34 0.43 0.51 0.56 0.56 0.51 0.41
1 MB/s
0.40 0.44 0.44 0.41 0.33 0.24 0.16
Page Size (KB)
64
1MB/s
Page Size (KB)
10
Standard Storage Metrics
• Capacity:
– RAM: MB and $/MB: today at 10MB & 100$/MB
– Disk: GB and $/GB: today at 10 GB and 200$/GB
– Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline)
• Access time (latency)
– RAM: 100 ns
– Disk:
10 ms
– Tape: 30 second pick, 30 second position
• Transfer rate
– RAM:
– Disk:
– Tape:
1 GB/s
5 MB/s - - - Arrays can go to 1GB/s
5 MB/s - - - striping is problematic
11
New Storage Metrics:
Kaps, Maps, SCAN?
• Kaps: How many kilobyte objects served per second
– The file server, transaction processing metric
– This is the OLD metric.
• Maps: How many megabyte objects served per
second
– The Multi-Media metric
• SCAN: How long to scan all the data
– the data mining and utility metric
• And
– Kaps/$, Maps/$, TBscan/$
12
For the Record
(good 1998 devices packaged in system
)
http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf
Unit capacity (GB)
Unit price $
$/GB
Latency (s)
Bandwidth (Mbps)
Kaps
Maps
Scan time (s/TB)
$/Kaps
$/Maps
$/TBscan
DRAM
1
5000
5000
1.E-7
500
5.E+5
5.E+2
2
1.E-10
1.E-7
$0.11
DISK
9
900
100
1.E-2
5
1.E+2
4.76
1800
1.E-7
2.E-6
$2
TAPE robot
35 X 14
10000
20
3.E+1
5
3.E-2
3.E-2
98000
3.E-3
3.E-3
$296
13
How To Get Lots of Maps, SCANs
• parallelism: use many little devices in parallel
At 10 MB/s:
1.2 days to scan
1 Terabyte
1,000 x parallel:
100 seconds SCAN.
1 Terabyte
10 MB/s
Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
• Beware of the media myth
• Beware of the access time myth
14
The Disk Farm On a Card
The 100GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth
14"
Life is cheap, its the accessories that cost ya.
Processors are cheap, it’s the peripherals that cost ya
(a 10k$ disc card).
15
Tape Farms for Tertiary Storage
Not Mainframe Silos
100 robots
1M$
50TB
50$/GB
3K Maps
10K$ robot
14 tapes
27 hr Scan
500 GB
5 MB/s
20$/GB Scan in 27 hours.
independent tape robots
30 Maps many
(like a disc farm)
16
The Metrics:
Disk and Tape Farms Win
GB/K$
1,000,000
Kaps
100,000
Maps
Data Motel:
Data checks in,
but it never checks ou
SCANS/Day
10,000
1,000
100
10
1
0.1
0.01
1000 x Disc Farm
STC Tape Robot
6,000 tapes, 8 readers
100x DLT Tape Farm
17
Tape & Optical:
Beware of the Media Myth
Optical is cheap: 200 $/platter
2 GB/platter
=> 100$/GB (2x cheaper than disc)
Tape is cheap:
=> 1.5 $/GB
30 $/tape
20 GB/tape
(100x cheaper than disc).
18
Tape & Optical Reality:
Media is 10% of System Cost
Tape needs a robot (10 k$ ... 3 m$ )
10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ )
100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
Robots have poor access times
Not good for Library of Congress (25TB)
Data motel: data checks in but it never checks out!
19
The Access Time Myth
The Myth: seek or pick time dominates
The reality: (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication: many cheap servers
better than one fast expensive server
– shorter queues
– parallel transfer
– lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays
Wait
Transfer Transfer
Rotate
Rotate
Seek
Seek
20
Download