Store Everything Online In A Database

advertisement
Store
Everything
Online
In A Database
Jim Gray
Microsoft Research
Gray@Microsoft.com
http://research.microsoft.com/~gray/talks
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
1
Outline
•Store Everything
•Online (Disk not Tape)
•In a Database
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
2
How Much is Everything?
Yotta
Everything
• Soon everything can be
!
recorded and indexed
Recorded
All Books
• Most bytes will never be
MultiMedia
seen by humans.
• Data summarization, trend
All LoC books
detection anomaly
(words)
detection are key
technologies
.Movi
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
e
A Photo
A Book
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo3
Storage capacity Disk TB Shipped per Year
beating Moore’s law
1E+7
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
3 k$/TB today (raw disk)
1k$/TB by end of 2002
Moore's Law:
58.7%/y
1E+4
1E+3
1988
Moores law
Revenue
TB growth
Price decline
disk TB
growth:
112%/y
1991
1994
1997
58.70% /year
7.47%
112.30% (since 1993)
50.70% (since 1993)
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
2000
4
Outline
•Store Everything
•Online (Disk not Tape)
•In a Database
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
5
Online Data
• Can build 1PB of NAS disk for 5M$ today
• Can SCAN (read or write) entire PB in 3 hours.
• Operate it as a data pump: continuous sequential
scan
• Can deliver 1PB for 1M$ over Internet
– Access charge is 300$/Mbps bulk rate
• Need to Geoplex data (store it in two places).
• Need to filter/process data near the source,
– To minimize network costs.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
6
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 access per second / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
1 TB
7
Disk vs Tape
Tape
Disk
–
–
–
–
–
–
–
–
–
–
–
40 GB
10 MBps
80 GB
10 sec pick time
35 MBps
30-120 second seek time
5 ms seek time
2$/GB for media
3 ms rotate latency
8$/GB for drive+library
3$/GB for drive
2$/GB for ctlrs/cabinet – 10 TB/rack
15 TB/rack
– 1 week scan
– 1 hour scan
Guestimates
Cern: 200 TB
3480 tapes
2 col = 50GB
Rack = 1 TB
=12 drives
The price advantage of disk is growing
the performance advantage of disk is huge!
At 10K$/TB, disk is competitive with nearline tape.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
8
Building a Petabyte Disk Store
• Cadillac ~ 500k$/TB =
plus FC switches plus…
• TPC-C SANs (Brand PC 18GB/…)
• Brand PC local SCSI
500
• Do it yourself ATA
500M$/PB
800M$/PB
60 M$/PB
20M$/PB
5M$/PB
400
300
200
100
0
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
Premium
SAN
Dell/3ware
DIY
9
Cheap Storage and/or Balanced System
• Low cost storage
(2 x 3k$ servers) 5K$ TB
2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE)
raid5 costs 6K$/TB
• Balanced server (5k$/.64 TB)
– 2x800Mhz (2k$)
– 512 MB
– 8 x 80 GB drives (2K$)
– Gbps Ethernet + switch (300$/port)
– 9k$/TB 18K$/mirrored TB
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
2x800 Mhz
512 MB
10
Next step in the Evolution
• Disks become supercomputers
– Controller will have 1bips, 1 GB ram, 1 GBps net
– And a disk arm.
• Disks will run full-blown app/web/db/os stack
• Distributed computing
• Processors migrate to transducers.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
11
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
12
Outline
•Store Everything
•Online (Disk not Tape)
•In a Database
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
13
Why Not file = object + GREP?
• It works if you have thousands of objects
(and you know them all)
• But hard to search millions/billions/trillions
with GREP
• Hard to put all attributes in file name.
– Minimal metadata
• Hard to do chunking right.
• Hard to pivot on space/time/version/attributes.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
14
The Reality:
it’s build vs buy
• If you use a file system
you will eventually build a database system:
–
–
–
–
–
–
–
–
metadata,
Query,
parallel ops,
security,….
reorganize,
recovery,
distributed,
replication,
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
15
OK: so I’ll put lots of objects in a file
Do It Yourself Database
• Good news:
– Your implementation will be
10x faster than the general purpose one
easier to understand and use than the general purpose on.
• Bad news:
– It will cost 10x more to build and maintain
– Someday you will get bored maintaining/evolving it
– It will lack some killer features:
•
•
•
•
•
•
Parallel search
Self-describing via metadata
SQL, XML, …
Replication
Online update – reorganization
Chunking is problematic (what granularity, how to aggregate)
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
16
Top 10 reasons to put Everything in a DB
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Someone else writes the million lines of code
Captures data and Metadata,
Standard interfaces give tools and quick learning
Allows Schema Evolution without breaking old apps
Index and Pivot on multiple attributes
space-time-attribute-version….
Parallel terabyte searches in seconds or minutes
Moves processing & search close to the disk arm
(moves fewer bytes (qestons return datons).
Chunking is easier (can aggregate chunks at server).
Automatic geo-replication
Online update and reorganization.
Security
If you pick the right vendor, ten years from now, there will
be software that can read the data.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
17
DB Centric Examples
• TerraServer
– All images and all data in the database (chunked as small
tiles).
www.TerraServer.Microsoft.com/
– http://research.microsoft.com/~gray/Papers/MSR_TR_99_29_TerraServer.doc
• SkyServer & Virtual Sky
– Both image and semantic data in a relational store.
– Parallel search & NonProcedural access are important.
–
http://research.microsoft.com/~gray/Papers/MS_TR_99_30_Sloan_Digital_Sky_Survey.doc
– http://dart.pha.jhu.edu/sdss/getMosaic.asp?Z=1&A=1&T=4&H=1&S=10&M=30
– http://virtualsky.org/servlet/Page?F=3&RA=16h+10m+1.0s&DE=%2B0d+42m+
45s&T=4&P=12&S=10&X=5096&Y=4121&W=4&Z=1&tile.2.1.x=55&tile.2.1.y=20
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
18
OK… Why don’t they use our stuff?
• Wrong metaphor: HDF with hyper-slab is
better match.
• Impedence match: getting stuff in/out of
DB is too hard
• We sold them OODBs and they did not
work (unreliable, poor performance, no
tools).
• …
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
19
So, why will the future be different?
• They have MUCH more data (10^8 files?)
• Java / C# eases impedance mismatch:
rowsets == ragged arrays.
• Tools are better
– Optimizers are better
– CPU and disk parallelism actually works now
– Statistical packages are better.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
20
Outline
•Store Everything
•Online (Disk not Tape)
•In a Database
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
21
But… The title of the talk was…
“The Future of
Distributed Database Systems”
Nobody wants to share his database.
blocks, files, tables are wrong abstraction for networks.
(too low level)
“Objects are the right abstraction”
So, UDDI
/ WSDL / SOAP is the solution (not SQL)
XML is the wire format, XLANG is the workflow protocol,
Query will be in there somewhere.
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
22
DDB technology GREAT in a Cluster
• Uniform architecture
• Trust among nodes
• High bandwidth-low latency communication
• Programs have single system image
• Queries run in parallel
• Global optimizer does query decomposition
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
23
But in a Distributed System
• Heterogenous architecture makes query
planning much harder
• No trust
• Communication is slow and expensive
(minimize it).
•  Higher level abstraction to minimize
round trips
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
24
DDB the Trust Issue
• Customers serve
themselves
• Customers serve
• Follow
the rules posted
themselves
on the door
• Follow the rules
• No Overhead, no staff!
posted on the dorr
• Clerks serve Customers
• Take order, fill order, fill out
invoice, collect money.
• Overhead: staff, training,
rules,…
Client/Server Groceries
DDB Grocery
http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt
25
Download