Store Everything Online In A Database Jim Gray Microsoft Research Gray@Microsoft.com http://research.microsoft.com/~gray/talks http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 1 Outline •Store Everything •Online (Disk not Tape) •In a Database http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 2 How Much is Everything? Yotta Everything • Soon everything can be ! recorded and indexed Recorded All Books • Most bytes will never be MultiMedia seen by humans. • Data summarization, trend All LoC books detection anomaly (words) detection are key technologies .Movi See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt http://www.sims.berkeley.edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli e A Photo A Book Zetta Exa Peta Tera Giga Mega Kilo3 Storage capacity Disk TB Shipped per Year beating Moore’s law 1E+7 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. ExaByte 1E+6 1E+5 3 k$/TB today (raw disk) 1k$/TB by end of 2002 Moore's Law: 58.7%/y 1E+4 1E+3 1988 Moores law Revenue TB growth Price decline disk TB growth: 112%/y 1991 1994 1997 58.70% /year 7.47% 112.30% (since 1993) 50.70% (since 1993) http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 2000 4 Outline •Store Everything •Online (Disk not Tape) •In a Database http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 5 Online Data • Can build 1PB of NAS disk for 5M$ today • Can SCAN (read or write) entire PB in 3 hours. • Operate it as a data pump: continuous sequential scan • Can deliver 1PB for 1M$ over Internet – Access charge is 300$/Mbps bulk rate • Need to Geoplex data (store it in two places). • Need to filter/process data near the source, – To minimize network costs. http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 6 The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 access per second / 5 GB (VERY cold data) • It’s a tape! 100 MB/s 200 Kaps http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 1 TB 7 Disk vs Tape Tape Disk – – – – – – – – – – – 40 GB 10 MBps 80 GB 10 sec pick time 35 MBps 30-120 second seek time 5 ms seek time 2$/GB for media 3 ms rotate latency 8$/GB for drive+library 3$/GB for drive 2$/GB for ctlrs/cabinet – 10 TB/rack 15 TB/rack – 1 week scan – 1 hour scan Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives The price advantage of disk is growing the performance advantage of disk is huge! At 10K$/TB, disk is competitive with nearline tape. http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 8 Building a Petabyte Disk Store • Cadillac ~ 500k$/TB = plus FC switches plus… • TPC-C SANs (Brand PC 18GB/…) • Brand PC local SCSI 500 • Do it yourself ATA 500M$/PB 800M$/PB 60 M$/PB 20M$/PB 5M$/PB 400 300 200 100 0 http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt EMC SAN Dell/3ware DIY 9 Cheap Storage and/or Balanced System • Low cost storage (2 x 3k$ servers) 5K$ TB 2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE) raid5 costs 6K$/TB • Balanced server (5k$/.64 TB) – 2x800Mhz (2k$) – 512 MB – 8 x 80 GB drives (2K$) – Gbps Ethernet + switch (300$/port) – 9k$/TB 18K$/mirrored TB http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 2x800 Mhz 512 MB 10 Next step in the Evolution • Disks become supercomputers – Controller will have 1bips, 1 GB ram, 1 GBps net – And a disk arm. • Disks will run full-blown app/web/db/os stack • Distributed computing • Processors migrate to transducers. http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 11 It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?). A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 12 Outline •Store Everything •Online (Disk not Tape) •In a Database http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 13 Why Not file = object + GREP? • It works if you have thousands of objects (and you know them all) • But hard to search millions/billions/trillions with GREP • Hard to put all attributes in file name. – Minimal metadata • Hard to do chunking right. • Hard to pivot on space/time/version/attributes. http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 14 The Reality: it’s build vs buy • If you use a file system you will eventually build a database system: – – – – – – – – metadata, Query, parallel ops, security,…. reorganize, recovery, distributed, replication, http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 15 OK: so I’ll put lots of objects in a file Do It Yourself Database • Good news: – Your implementation will be 10x faster than the general purpose one easier to understand and use than the general purpose on. • Bad news: – It will cost 10x more to build and maintain – Someday you will get bored maintaining/evolving it – It will lack some killer features: • • • • • • Parallel search Self-describing via metadata SQL, XML, … Replication Online update – reorganization Chunking is problematic (what granularity, how to aggregate) http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 16 Top 10 reasons to put Everything in a DB 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Someone else writes the million lines of code Captures data and Metadata, Standard interfaces give tools and quick learning Allows Schema Evolution without breaking old apps Index and Pivot on multiple attributes space-time-attribute-version…. Parallel terabyte searches in seconds or minutes Moves processing & search close to the disk arm (moves fewer bytes (qestons return datons). Chunking is easier (can aggregate chunks at server). Automatic geo-replication Online update and reorganization. Security If you pick the right vendor, ten years from now, there will be software that can read the data. http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 17 DB Centric Examples • TerraServer – All images and all data in the database (chunked as small tiles). www.TerraServer.Microsoft.com/ – http://research.microsoft.com/~gray/Papers/MSR_TR_99_29_TerraServer.doc • SkyServer & Virtual Sky – Both image and semantic data in a relational store. – Parallel search & NonProcedural access are important. – http://research.microsoft.com/~gray/Papers/MS_TR_99_30_Sloan_Digital_Sky_Survey.doc – http://dart.pha.jhu.edu/sdss/getMosaic.asp?Z=1&A=1&T=4&H=1&S=10&M=30 – http://virtualsky.org/servlet/Page?F=3&RA=16h+10m+1.0s&DE=%2B0d+42m+ 45s&T=4&P=12&S=10&X=5096&Y=4121&W=4&Z=1&tile.2.1.x=55&tile.2.1.y=20 http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 18 Outline •Store Everything •Online (Disk not Tape) •In a Database http://research.microsoft.com/~gray/talks/Science_Data_Centers.ppt 19