Storage: Alternate Futures Jim Gray Yotta Zetta

advertisement
Storage: Alternate Futures
Jim Gray
Microsoft Research
http://Research.Microsoft.com/~Gray/talks
IBM Almaden, 1 December 1999
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
1
Kilo
Acknowledgments: Thank You!!
• Dave Patterson:
– Convinced me that processors are moving to the
devices.
• Kim Keeton and Erik Riedell
– Showed that many useful subtasks can be done by
disk-processors, and quantified execution interval
• Remzi Dusseau
– Re-validated Amdahl's laws
2
Outline
• The Surprise-Free Future (5 years)
–
–
–
–
–
500 mips cpus for 10$
1 Gb RAM chips
MAD at 50 Gbpsi
10 GBps SANs are ubiquitous
1 GBps WANs are ubiquitous
• Some consequences
–
–
–
–
–
Absurd (?) consequences.
Auto-manage storage
Raid10 replaces Raid5
Disc-packs
Disk is the archive media of choice
• A surprising future?
– Disks (and other useful things) become supercomputers.
– Apps run “in the disk”
3
The Surprise-free Storage Future
•
•
•
•
•
•
•
1 Gb RAM chips
MAD at 50 Gbpsi
Drives shrink one quantum
Standard IO
10 GBps SANs are ubiquitous
1 Gbps WANs are ubiquitous
5 bips cpus for 1K$
and 500 mips cpus for 10$
4
1 Gb RAM Chips
• Moving to 256 Mb chips now
• 1Gb will be “standard” in 5 years,
4 Gb will be premium product.
• Note:
– 256Mb = 32MB: the smallest memory
– 1 Gb = 128 MB: the smallest memory
5
System On A Chip
• Integrate Processing with memory on one chip
–
–
–
–
chip is 75% memory now
1MB cache >> 1960 supercomputers
256 Mb memory chip is 32 MB!
IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip
– system bus is a kind of network
– ATM, FiberChannel, Ethernet,.. Logic on chip.
– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
6
500 mips System On A Chip
for 10$
• 486 now 7$
233 MHz ARM for 10$ system on a chip
http://www.cirrus.com/news/products99/news-product14.html
AMD/Celeron 266 ~ 30$
• In 5 years, today’s leading edge will be
–
–
–
–
System on chip (cpu, cache, mem ctlr, multiple IO)
Low cost
Low-power
Have integrated IO
• High end is 5 BIPS cpus
7
Standard IO in 5 Years
• Probably
• Replace PCI with something better
will still need a mezzanine bus standard
• Multiple serial links directly from processor
• Fast (10 GBps/link) for a few meters
• System Area Networks (SANS) ubiquitous
(VIA morphs to SIO?)
8
Ubiquitous 10 GBps SANs
in 5 years
• 1Gbps Ethernet are reality now.
– Also FiberChannel ,MyriNet, GigaNet,
ServerNet,, ATM,…
• 10 Gbps x4 WDM deployed now (OC192)
1 GBps
– 3 Tbps WDM working in lab
• In 5 years, expect 10x,
progress is astonishing
• Gilder’s law:
Bandwidth grows 3x/year
120 MBps
(1Gbps)
http://www.forbes.com/asap/97/0407/090.htm
80 MBps
40 MBps
5 MBps
20 Mbsp
9
Thin Client’s mean HUGE servers
•
•
•
•
•
•
AOL hosting customer pictures
Hotmail allows 5 MB/user, 50 M users
Web sites offer electronic vaulting for SOHO.
IntelliMirror: replicate client state on server
Terminal server: timesharing returns
…. Many more.
10
Remember Your Roots?
11
MAD at 50 Gbpsi
• MAD: Magnetic Aerial Density:
3-10 Mbpsi in products
28 Mbpsi in lab
50 Mbpsi = paramagnetic limit
but…. People have ideas.
• Capacity: rise 10x in 5 years (conservative)
• Bandwidth: rise 4x in 5 years (density+rpm)
• Disk: 50GB to 500 GB,
• 60-80MBps
• 1k$/TB
• 15 minute to 3 hour scan time.
12
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 aps / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
1 TB
13
Disk vs Tape
• Disk
• Tape
–
–
–
–
–
–
–
–
–
–
47 GB
15 MBps
5 ms seek time
3 ms rotate latency
9$/GB for drive
3$/GB for ctlrs/cabinet
– 4 TB/rack
40 GB
5 MBps
30 sec pick time
Many minute seek time
5$/GB for media
10$/GB for drive+library
– 10 TB/rack
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
Guestimates
Cern: 200 TB
3480 tapes
2 col = 50GB
Rack = 1 TB
=20 drives
14
Standard Storage Metrics
• Capacity:
– RAM:
– Disk:
– Tape:
MB and $/MB: today at 512MB and 3$/MB
GB and $/GB: today at 50GB and 10$/GB
TB and $/TB: today at 50GB and 12k$/TB (nearline)
• Access time (latency)
– RAM:
– Disk:
– Tape:
100 ns
10 ms
30 second pick, 30 second position
• Transfer rate
– RAM:
– Disk:
– Tape:
1 GB/s
15 MB/s - - - Arrays can go to 1GB/s
5 MB/s - - - striping is problematic, but “works”
15
New Storage Metrics:
Kaps, Maps, SCAN?
• Kaps: How many kilobyte objects served per second
– The file server, transaction processing metric
– This is the OLD metric.
• Maps: How many megabyte objects served per second
– The Multi-Media metric
• SCAN: How long to scan all the data
– the data mining and utility metric
• And
– Kaps/$, Maps/$, TBscan/$
16
The Access Time Myth
• The Myth: seek or pick time dominates
• The reality: (1) Queuing dominates
•
(2) Transfer dominates BLOBs
•
(3) Disk seeks often short
• Implication: many cheap servers
better than one fast expensive server
– shorter queues
– parallel transfer
– lower cost/access and cost/byte
Transfer
Wait
Transfer
Rotate
Seek
• This is obvious for disk arrays
• This even more obvious for tape arrays
Rotate
Seek
19
Storage Ratios Changed
• DRAM/disk media price
ratio changed
• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
1
1980
1990
Year
0.1
2000
Storage Price vs Time
Megabytes per kilo-dollar
100
10,000.
1,000.
MB/k$
Accesses per Second
1.
Capacity (GB)
seeks per second
bandwidth: MB/s
10.
10
1970-1990
100:1
1990-1995
10:1
1995-1997
50:1
today ~ 0.1$pMB disk 30:1
3$pMB dram
Disk accesses/second
vs Time
Disk Performance vs Time
100
–
–
–
–
10
100.
10.
1.
1
1980
1990
Year
2000
0.1
1980
1990
Year
20
2000
Data on Disk
Can Move to RAM in 8 years
Storage Price vs Time
Megabytes per kilo-dollar
10,000.
30:1
MB/k$
1,000.
100.
10.
6 years 1.
0.1
1980
1990
Year
2000
21
Outline
• The Surprise-Free Future (5 years)
–
–
–
–
–
500 mips cpus for 10$
1 Gb RAM chips
MAD at 50 Gbpsi
10 GBps SANs are ubiquitous
1 GBps WANs are ubiquitous
• Some consequences
–
–
–
–
–
Absurd (?) consequences.
Auto-manage storage
Raid10 replaces Raid5
Disc-packs
Disk is the archive media of choice
• A surprising future?
– Disks (and other useful things) become supercomputers.
– Apps run “in the disk”.
22
The (absurd?) consequences
• 256 way nUMA?
• Huge main memories: now:
•
•
•
•
•
•
1 GB RAM chips
MAD at 50 Gbpsi
Drives shrink one quantum
10 GBps SANs are ubiquitous
500 mips cpus for 10$
5 bips cpus at high end
500MB - 64GB memories
then: 10GB - 1TB memories
• Huge disks
now: 5-50 GB 3.5” disks
then: 50-500 GB disks
• Petabyte storage farms
– (that you can’t back up or restore).
• Disks >> tapes
– “Small” disks:
One platter one inch 10GB
• SAN convergence
1 GBps point to point is easy
23
•
•
•
•
The Absurd? Consequences
Further segregate processing from storage
Poor locality
Much useless data movement
Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
Processors
~ 1 Tips
RAM Memory
~ 1 TB
Disks
24
~ 100TB
Storage Latency:
How Far Away is the Data?
10 9
Andromeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Olympia
2 Years
1.5 hr
This Hotel
10 min
This Room
My Head
1 min
25
Consequences
•
•
•
•
•
AutoManage Storage
Sixpacks (for arm-limited apps)
Raid5-> Raid10
Disk-to-disk backup
Smart disks
26
Auto Manage Storage
• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb
– A DataAdmin per 5TB
– SysAdmin per 100 clones (varies with app).
• Problem:
– 5TB is 60k$ today, 10k$ in a few years.
– Admin cost >> storage cost???
• Challenge:
– Automate ALL storage admin tasks
27
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 aps / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
1 TB
28
Extreme case: 1TB disk:
Alternatives
• Use all the heads in parallel
– Scan in 30 minutes
– Still one Kaps/5GB
500 MB/s
1 TB
200 Kaps
• Use one platter per arm
– Share power/sheetmetal
– Scan in 30 minutes
– One KAPS per GB
500 MB/s
200GB
each
1,000 Kaps
29
Drives shrink (1.8”, 1”)
•
•
•
•
150 kaps for 500 GB is VERY cold data
3 GB/platter today, 30 GB/platter in 5years.
Most disks are ½ full
TPC benchmarks use 9GB drives
(need arms or bandwidth).
• One solution: smaller form factor
– More arms per GB
– More arms per rack
– More arms per Watt
30
Prediction: 6-packs
• One way or another, when disks get huge
– Will be packaged as multiple arms
– Parallel heads gives bandwidth
– Independent arms gives bandwidth & aps
• Package shares power, package, interfaces…
31
Stripes, Mirrors, Parity (RAID 0,1, 5)
• RAID 0: Stripes
– bandwidth
0,3,6,..
1,4,7,..
2,5,8,..
• RAID 1: Mirrors, Shadows,…
– Fault tolerance
– Reads faster, writes 2x slower
0,1,2,..
0,1,2,..
• RAID 5: Parity
– Fault tolerance
– Reads faster
– Writes 4x or 6x slower.
0,2,P2,.. 1,P1,4,.. P0,3,5,..
32
RAID 10 (strips of mirrors) Wins
“wastes space, saves arms”
RAID 5:
• Performance
– 225 reads/sec
– 70 writes/sec
– Write
• 4 logical IO,
• 2 seek + 1.7 rotate
• SAVES SPACE
• Performance
degrades on failure
RAID1
• Performance
– 250 reads/sec
– 100 writes/sec
– Write
• 2 logical IO
• 2 seek 0.7 rotate
• SAVES ARMS
• Performance
improves on failure
33
The Storage Rack
Today
• 140 arms
• 4TB
• 24 racks
24 storage processors
6+1 in rack
• Disks = 2.5 GBps IO
• Controllers = 1.2 GBps IO
• Ports
500 MBps IO
34
Storage Rack in 5 years?
• 140 arms
• 50TB
• 24 racks
24 storage processors
6+1 in rack
• Disks
= 14 GBps IO
• Controllers = 5 GBps IO
• Ports
1 GBps IO
• My suggestion: move
the processors into
the storage racks.
35
It’s hard to archive a PetaByte
It takes a LONG time to restore it.
•
•
•
•
Store it in two (or more) places online (on disk?).
Scrub it continuously (look for errors)
On failure, refresh lost copy from safe copy.
Can organize the two copies differently
(e.g.: one by time, one by space)
36
Crazy Disk Ideas
• Disk Farm on a card: surface mount disks
• Disk (magnetic store) on a chip:
(micro machines in Silicon)
• Full Apps (e.g. SAP, Exchange/Notes,..)
in the disk controller
(a processor with 128 MB dram)
The Innovator's Dilemma: When New
Technologies Cause Great Firms to Fail
Clayton M. Christensen
.ISBN: 0875845851
ASIC
37
The Disk Farm On a Card
•
•
•
•
•
•
•
•
The 500GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
50 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth
14"
38
Functionally Specialized Cards
• Storage
P mips processor
ASIC
Today:
P=50 mips
• Network
M MB DRAM
M= 2 MB
In a few years
ASIC
P= 200 mips
M= 64 MB
• Display
ASIC
39
Data Gravity
Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in
– Modem
– Display
– Microphones (speech recognition)
& cameras (vision)
– Storage: Data storage and analysis
40
It’s Already True of Printers
Peripheral = CyberBrick
• You buy a printer
• You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
41
Kilo
Mega
Disks Become Supercomputers
Giga
Tera
Peta
Exa
• 100x in 10 years
2 TB 3.5” drive
• Shrink to 1” is 200GB
• Disk replaces tape?
Zetta
Yotta
• Disk is super computer!
42
All Device Controllers will be Cray 1’s
• TODAY
– Disk controller is 10 mips risc engine
with 2MB DRAM
– NIC is similar power
• SOON
– Will become 100 mips systems
with 100 MB DRAM.
Central
Processor &
Memory
• They are nodes in a federation
(can run Oracle on NT in disk controller).
• Advantages
–
–
–
–
–
Uniform programming model
Great tools
Security
Economics (cyberbricks)
Move computation to data (minimize traffic)
Tera Byte
Backplane
43
With Tera Byte Interconnect
and Super Computer Adapters
• Processing is incidental to
– Networking
– Storage
– UI
• Disk Controller/NIC is
– faster than device
– close to device
– Can borrow device
package & power
Tera Byte
Backplane
• So use idle capacity
for computation.
• Run app in device.
• Both Kim Keeton (UCB) and
Erik Riedel (CMU) thesis investigate this
show benefits of this approach.
44
Implications
Conventional
• Offload device handling to
NIC/HBA
• higher level protocols:
I2O, NASD, VIA, IP, TCP…
• SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical
• Move app to
NIC/device controller
• higher-higher level
protocols: CORBA /
COM+.
• Cluster parallelism is
VERY important.
Tera Byte
Backplane
45
How Do They Talk to Each Other?
Applications
Each node has an OS
Each node has local resources: A federation.
Each node does not completely trust the others.
Nodes use RPC to talk to each other
– CORBA? COM+? RMI?
– One or all of the above.
Applications
?
RPC
streams
datagrams
• Huge leverage in high-level interfaces.
• Same old distributed system story.
SIO
?
RPC
streams
datagrams
•
•
•
•
SIO
SAN
46
Basic Argument for x-Disks
• Future disk controller is a super-computer.
– 1 bips processor
– 128 MB dram
– 100 GB disk plus one arm
• Connects to SAN via high-level protocols
– RPC, HTTP, DCOM, Kerberos, Directory Services,….
– Commands are RPCs
– management, security,….
– Services file/web/db/… requests
– Managed by general-purpose OS with good dev environment
• Move apps to disk to save data movement
– need programming environment in controller
47
The Slippery Slope
Nothing =
Sector Server
• If you add function to server
• Then you
add more function to server
• Function gravitates to data.
Everything =
App Server
48
Why Not a Sector Server?
(let’s get physical!)
• Good idea, that’s what we have today.
• But
–
–
–
–
–
cache added for performance
Sector remap added for fault tolerance
error reporting and diagnostics added
SCSI commends (reserve,.. are growing)
Sharing problematic (space mgmt, security,…)
• Slipping down the slope to a 2-D block server
49
Why Not a 1-D Block Server?
Put A LITTLE on the Disk Server
• Tried and true design
– HSC - VAX cluster
– EMC
– IBM Sysplex (3980?)
• But look inside
–
–
–
–
–
–
–
–
–
Has a cache
Has space management
Has error reporting & management
Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…
Has locking
Has remote replication
Has an OS
Security is problematic
Low-level interface moves too many bytes
50
Why Not a 2-D Block Server?
Put A LITTLE on the Disk Server
• Tried and true design
– Cedar -> NFS
– file server, cache, space,..
– Open file is many fewer msgs
• Grows to have
–
–
–
–
–
–
Directories + Naming
Authentication + access control
RAID 0, 1, 2, 3, 4, 5, 10, 50,…
Locking
Backup/restore/admin
Cooperative caching with client
• File Servers are a BIG hit: NetWare™
– SNAP! is my favorite today
51
Why Not a File Server?
Put a Little on the Disk Server
• Tried and true design
– Auspex, NetApp, ...
– Netware
• Yes, but look at NetWare
– File interface gives you app invocation interface
– Became an app server
• Mail, DB, Web,….
– Netware had a primitive OS
• Hard to program, so optimized wrong thing
52
Why Not Everything?
Allow Everything on Disk Server
(thin client’s)
• Tried and true design
–
–
–
–
–
Mainframes, Minis, ...
Web servers,…
Encapsulates data
Minimizes data moves
Scaleable
• It is where everyone ends up.
• All the arguments against are short-term.
53
The Slippery Slope
Nothing =
Sector Server
• If you add function to server
• Then you
add more function to server
• Function gravitates to data.
Everything =
App Server
54
Outline
• The Surprise-Free Future (5 years)
– Astonishing hardware progress.
• Some consequences
–
–
–
–
–
Absurd (?) consequences.
Auto-manage storage
Raid10 replaces Raid5
Disc-packs
Disk is the archive media of choice
• A surprising future?
– Disks (and other useful things) become supercomputers.
– Apps run “in the disk”
55
Related documents
Download