Data Centric Computing Jim Gray Yotta Zetta

advertisement
Data Centric Computing
Jim Gray
Microsoft Research
Research.Microsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 14 Oct 1999
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
1
Kilo
Put Everything
in Future (Disk) Controllers
(it’s not “if”, it’s “when?”)
Jim Gray
Microsoft Research
http://Research.Micrsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 14 Oct 1999
Acknowledgements:
Dave Patterson explained this to me long ago
Leonard Chung
Helped me sharpen
Kim Keeton
these arguments
Erik Riedel
Catharine Van Ingen
2
First Disk 1956
• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer &
accounting software
(tubes not transistors)
3
1.6 meters
10 years later
4
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Disk Evolution
• Capacity:100x in 10 years
1 TB 3.5” drive in 2005
20 GB as 1” micro-drive
• System on a chip
• High-speed SAN
Yotta
• Disk replacing tape
• Disk is super computer! 5
Disks are becoming computers
•
•
•
•
•
•
•
•
Smart drives
Camera with micro-drive
Replay / Tivo / Ultimate TV
Phone with micro-drive
MP3 players
Tablet
Xbox
Many more…
Applications
Web, DBMS, Files
OS
Disk Ctlr +
1Ghz cpu+
1GB RAM
Comm:
Infiniband,
Ethernet,
radio… 6
Data Gravity Processing Moves to Transducers
smart displays, microphones, printers, NICs, disks
Processing decentralized
Today:
Moving to data sources
Moving to power sources
ASIC
P=50 mips
M= 2 MB
Storage
Moving to sheet metal
Network
ASIC
In a few years
P= 500 mips
Display
ASIC
? The end of computers ?
M= 256 MB
7
It’s Already True of Printers
Peripheral = CyberBrick
• You buy a printer
• You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
8
The (absurd?) consequences of Moore’s Law
• 256 way nUMA?
• Huge main memories: now:
•
•
•
•
•
•
1 GB RAM chips
MAD at 200 Gbpsi
Drives shrink one quantum
10 GBps SANs are ubiquitous
1 bips cpus for 10$
10 bips cpus at high end
500MB - 64GB memories
then: 10GB - 1TB memories
• Huge disks
now: 20-200 GB 3.5” disks
then: .1 - 1 TB disks
• Petabyte storage farms
– (that you can’t back up or restore).
• Disks >> tapes
– “Small” disks:
One platter one inch 10GB
• SAN convergence
1 GBps point to point is easy
9
The Absurd Design?
•
•
•
•
Further segregate processing from storage
Poor locality
Much useless data movement
Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
Processors
~ 1 Tips
RAM
~ 1 TB
Disks
10
~ 100TB
What’s a Balanced System?
(40+ disk arms / cpu)
System Bus
PCI Bus
PCI Bus
11
Observations re TPC C, H systems
•
•
•
•
•
•
More than ½ the hardware cost is in disks
Most of the mips are in the disk controllers
20 mips/arm is enough for tpcC
50 mips/arm is enough for tpcH
Need 128MB to 256MB/arm
Ref:
– Gray& Shenoy: “Rules of Thumb…”
– Keeton, Riedel, Uysal, PhD thesis.
? The end of computers ?
13
When each disk has 1bips,
no need for ‘cpu’
16
Implications
Conventional
• Offload device handling to
NIC/HBA
• higher level protocols:
I2O, NASD, VIA, IP, TCP…
• SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical
• Move app to
NIC/device controller
• higher-higher level
protocols: CORBA /
COM+.
• Cluster parallelism is
VERY important.
Terabyte/s
Backplane
17
Interim Step: Shared Logic
•
•
•
•
Brick with 8-12 disk drives
200 mips/arm (or more)
2xGbpsEthernet
General purpose OS
(except NetApp )
• 10k$/TB to 50k$/TB
• Shared
–
–
–
–
–
Sheet metal
Power
Support/Config
Security
Network ports
Snap™
~1TB
12x80GB NAS
NetApp™
~.5TB
8x70GB NAS
Maxstor™
~2TB
12x160GB NAS
18
Gordon Bell’s Seven Price Tiers
10$:
100$:
1,000$:
10,000$:
100,000$:
1,000,000$:
10,000,000$:
wrist watch computers
pocket/ palm computers
portable computers
personal computers
(desktop)
departmental computers (closet)
site computers
(glass house)
regional computers (glass castle)
Super-Server: Costs more than 100,000 $
“Mainframe” Costs more than 1M$
Must be an array of
processors,
disks, tapes
comm ports
20
Bell’s Evolution of Computer Classes
Technology enable two evolutionary paths:
1. constant performance, decreasing cost
2. constant price, increasing performance
Log Price
Mainframes (central)
Minis (dep’t.)
WSs
PCs (personals)
Time
1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8
1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62
??
21
NAS vs SAN
• Network Attached Storage
–
–
–
–
File servers
Database servers
Application servers
(it’s a slippery slope: as Novell showed)
• Storage Area Network
–
–
–
–
High level
Interfaces are
better
A lower life form
Block server: get block / put block
Wrong abstraction level (too low level)
Security is VERY hard to understand.
• (who can read that disk block?)
SCSI and
iSCSI are
popular.
22
How Do They Talk to Each Other?
Applications
Each node has an OS
Each node has local resources: A federation.
Each node does not completely trust the others.
Nodes use RPC to talk to each other
– WebServices/SOAP? CORBA? COM+? RMI?
– One or all of the above.
Applications
?
RPC
streams
datagrams
• Huge leverage in high-level interfaces.
• Same old distributed system story.
SIO
?
RPC
streams
datagrams
•
•
•
•
SIO
SAN
23
The Slippery Slope
Nothing =
Sector Server
• If you add function to server
• Then you
add more function to server
• Function gravitates to data.
Everything =
App Server
25
Why Not a Sector Server?
(let’s get physical!)
• Good idea, that’s what we have today.
• But
–
–
–
–
–
cache added for performance
Sector remap added for fault tolerance
error reporting and diagnostics added
SCSI commends (reserve,.. are growing)
Sharing problematic (space mgmt, security,…)
• Slipping down the slope to a 2-D block server
26
Why Not a 1-D Block Server?
Put A LITTLE on the Disk Server
• Tried and true design
– HSC - VAX cluster
– EMC
– IBM Sysplex (3980?)
• But look inside
–
–
–
–
–
–
–
–
–
Has a cache
Has space management
Has error reporting & management
Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…
Has locking
Has remote replication
Has an OS
Security is problematic
Low-level interface moves too many bytes
27
Why Not a 2-D Block Server?
Put A LITTLE on the Disk Server
• Tried and true design
– Cedar -> NFS
– file server, cache, space,..
– Open file is many fewer msgs
• Grows to have
–
–
–
–
–
–
Directories + Naming
Authentication + access control
RAID 0, 1, 2, 3, 4, 5, 10, 50,…
Locking
Backup/restore/admin
Cooperative caching with client
28
Why Not a File Server?
Put a Little on the 2-D Block Server
• Tried and true design
– NetWare, Windows,
Linux, NetApp, Cobalt, SNAP,...
WebDav
• Yes, but look at NetWare
– File interface grew
– Became an app server
• Mail, DB, Web,….
– Netware had a primitive OS
• Hard to program, so optimized wrong thing
29
Why Not Everything?
Allow Everything on Disk Server
(thin client’s)
• Tried and true design
–
–
–
–
–
Mainframes, Minis, ...
Web servers,…
Encapsulates data
Minimizes data moves
Scaleable
• It is where everyone ends up.
• All the arguments against are short-term.
30
The Slippery Slope
Nothing =
Sector Server
• If you add function to server
• Then you
add more function to server
• Function gravitates to data.
Everything =
App Server
31
Disk = Node
•
•
•
•
has magnetic storage (1TB?)
has processor & DRAM
has SAN attachment
has execution
Applications
Services
environment
DBMS
RPC, ...
File System
SAN driver
Disk driver
OS Kernel
32
Hardware
• Homogenous machines
leads to quick response
through reallocation
• HP desktop machines,
320MB RAM, 3u high, 4
100GB IDE Drives
• $4k/TB (street),
2.5processors/TB, 1GB
RAM/TB
• 3 weeks from ordering to
operational
33
Slide courtesy of Brewster Kahle, @ Archive.org
Disk as Tape
• Tape is unreliable, specialized,
slow, low density, not improving fast,
and expensive
• Using removable hard drives to replace tape’s
function has been successful
• When a “tape” is needed, the drive is put in a
machine and it is online. No need to copy from
tape before it is used.
• Portable, durable, fast, media cost = raw tapes,
dense. Unknown longevity: suspected good.
Slide courtesy of Brewster Kahle, @ Archive.org
34
Disk As Tape: What format?
•
•
•
•
Today I send NTFS/SQL disks.
But that is not a good format for Linux.
Solution: Ship NFS/CIFS/ODBC servers (not disks)
Plug “disk” into LAN.
– DHCP then file or DB server via standard interface.
– Web Service in long term
35
Some Questions
•
•
•
•
•
•
•
Will the disk folks deliver?
What is the product?
How do I manage 1,000 nodes (disks)?
How do I program 1,000 nodes (disks)?
How does RAID work?
How do I backup a PB?
How do I restore a PB?
36
Will the disk folks deliver? Maybe!
Hard Drive Unit Shipments
Total Hard Drive Unit Shipments
200
150
100
50
01
20
00
20
99
19
98
19
97
19
96
19
95
19
94
19
93
19
92
19
91
19
90
19
89
19
88
19
87
19
86
0
19
Units in Millions
250
Source: DiskTrend/IDC
Not a pretty picture (lately)
37
Most Disks are Personal
• 85% of disks are desktop/mobile (not SCSI)
• Personal media is AT LEAST 50% of the problem.
• How to manage your shoebox of:
–
–
–
–
–
Documents
Voicemail
Photos
Music
Videos
.pdf
.tif
.ppt
Music
6.9 GB
1.8K files
180 CDs
My Books
98 MB
.xls
Working
2.3 GB
432 folders
2.9K files
.gif
.jpg
Archive
5.1 GB
477 folders
18.7 K files
Mail
Video
GB
.7
2.6 GB
msgs
43K
10 hours
Low res
.doc/html
.gif
.jpg
.pdf
.tif
.xls
.doc/html
27.1K files & 42K .msg
Files (by number)38
17.7 GB (by size)
What is the Product?
(see next section on media management)
• Concept: Plug it in and it works!
•
•
•
•
•
•
•
•
•
Music/Video/Photo appliance (home)
Game appliance
“PC”
power
File server appliance
Data archive/interchange appliance
Web appliance
Email appliance
Application appliance
Router appliance
network
39
Auto Manage Storage
• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb
– A DataAdmin per 5TB
– SysAdmin per 100 clones (varies with app).
• Problem:
– 5TB is 50k$ today, 5k$ in a few years.
– Admin cost >> storage cost !!!!
• Challenge:
– Automate ALL storage admin tasks
40
How do I manage 1,000 nodes?
• You can’t manage 1,000 x (for any x).
• They manage themselves.
– You manage exceptional exceptions.
• Auto Manage
–
–
–
–
Plug & Play hardware
Auto-load balance & placement storage & processing
Simple parallel programming model
Fault masking
• Some positive signs:
– Few admins at Google
Yahoo!
Hotmail
10k nodes
? nodes,
10k nodes,
2 PB ,
0.3 PB,
0.3 PB
41
How do I program 1,000 nodes?
• You can’t program 1,000 x (for any x).
• They program themselves.
– You write embarrassingly parallel programs
– Examples: SQL, Web, Google, Inktomi, HotMail,….
– PVM and MPI prove it must be automatic (unless you have a PhD)!
• Auto Parallelism is ESSENTIAL
42
Plug & Play Software
• RPC is standardizing: (SOAP/HTTP, COM+,
RMI/IIOP)
– Gives huge TOOL LEVERAGE
– Solves the hard problems :
•
•
•
•
naming,
security,
directory service,
operations,...
• Commoditized programming environments
–
–
–
–
FreeBSD, Linix, Solaris,…+ tools
NetWare + tools
WinCE, WinNT,…+ tools
JavaOS + tools
• Apps gravitate to data.
43
• General purpose OS on dedicated ctlr can run apps.
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
44
CyberBricks
•
•
•
•
•
Disks are becoming supercomputers.
Each disk will be a file server then SOAP server
Multi-disk bricks are transitional
Long-term brick will have OS per disk.
Systems will be built from bricks.
• There will also be
–
–
–
–
Network Bricks
Display Bricks
Camera Bricks
….
52
Data Centric Computing
Jim Gray
Microsoft Research
Research.Microsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 14 Oct 1999
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
53
Kilo
Communications Excitement!!
Point-to-Point
Immediate
Time
Shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Net
Work
+ DB
Data
Base
Its ALL going electronic
Information is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
54
Slide borrowed from Craig Mundie
Information Excitement!
• But comm just carries information
• Real value added is
– information capture & render
speech, vision, graphics, animation, …
– Information storage retrieval,
– Information analysis
55
Information At Your Fingertips
• All information will be in an online database (somewhere)
• You might record everything you
– read: 10MB/day, 400 GB/lifetime (5 disks today)
– hear: 400MB/day, 16 TB/lifetime (2 disks/year today)
– see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday)
• Data storage, organization, and analysis is challenge.
• text, speech, sound, vision, graphics, spatial, time…
• Information at Your Fingertips
– Make it easy to capture
– Make it easy to store & organize & analyze
– Make it easy to present & access
56
How much information is there?
• Soon everything can be
recorded and indexed
Everything
!
• Most bytes will never be seen
Recorded
by humans.
All Books
• Data summarization, trend
MultiMedia
detection anomaly detection
All LoC books
are key technologies
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
Zetta
Exa
Peta
(words)
Tera
.Movi
e
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Yotta
Giga
Mega
57
Kilo
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Immediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Locate
Process
Analyze
Summarize
58
Disk Storage Cheaper than Paper
• File Cabinet:
cabinet (4 drawer)
250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3 ¢/sheet
• Disk:
disk (160 GB =)
ASCII: 100 m pages
0.0001 ¢/sheet
Image: 1 m photos
0.03 ¢/sheet
• Store everything on disk
300$
(10,000x cheaper)
(100x cheaper)
59
Gordon Bell’s MainBrain™
Digitize Everything
A BIG shoebox?
•
•
•
•
•
•
Scans
Music:
Photos:
Video:
Docs:
Mail:
20 k “pages” tiff@ 300 dpi
2 k “tacks”
13 k images
10 hrs
3 k (ppt, word,..)
50 k messages
1 GB
7 GB
2 GB
3 GB
2 GB
1 GB
16 GB
60
Gary Starkweather
•
•
•
•
•
•
Scan EVERYTHING
400 dpi TIFF
70k “pages” ~ 14GB
OCR all scans (98% recognition ocr accuracy)
All indexed (5 second access to anything)
All on his laptop.
61
• Q: What
happens
when the personal terabyte arrives?
• A: Things will run SLOWLY….
unless we add good software
62
Summary
• Disks will morph to appliances
• Main barriers to this happening
– Lack of Cool Apps
– Cost of Information management
63
Download