Building Peta Byte Data Stores Jim Gray Microsoft Research Research.Microsoft.com/~Gray The Asilomar Report on Database Research Phil Bernstein, Michael Brodie, Stefano Ceri, David DeWitt, Mike Franklin, Hector Garcia-Molina, Jim Gray, Jerry Held, Joe Hellerstein, H. V. Jagadish, Michael Lesk, Dave Maier, Jeff Naughton, Hamid Pirahesh, Mike Stonebraker, and Jeff Ullman September 1998 … the field needs to radically broaden its research focus to attack the issues of capturing, storing, analyzing, and presenting the vast array of online data. … -- broadening the definition of database management to embrace all the content of the Web and other online data stores, and rethinking our fundamental assumptions in light of technology shifts. … encouraging more speculative and long-range work, moving conferences to a poster format, and publishing all research literature on the Web. http://research.microsoft.com/~gray/Asilomar_DB_98.html So, how are we doing? • • • • Capture, store, analyze, present terabytes? Making web data accessible? Publishing on the web (CoRR?) Posters-Workshops vs Conferences-Journals? Outline • Technology: – 1M$/PB: store everything online (twice!) • End-to-end high-speed networks – Gigabit to the desktop So: You can store everything, Anywhere in the world Online everywhere • Research driven by apps: – TerraServer – National Virtual Astronomy Observatory. How Much Information Is there? • Soon everything can be recorded and indexed • Most data never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. Everything ! Recorded All Books Yotta Zetta Exa MultiMedia Peta All LoC books (words) .Movi e A Photo Tera Giga Mega www.lesk.com/mlesk/ksg97/ksg.html A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Kilo Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doubling 1.E+09 ops per second/$ doubles every 1.0 years 1.E+06 1.E+03 1.E+00 1.E-03 doubles every 7.5 years doubles every 2.3 years 1.E-06 1880 1900 1920 1940 1960 1980 2000 Storage capacity beating Moore’s law Disk TB Shipped per Year 1E+7 ExaByte 1E+6 • 4 k$/TB today (raw disk) 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. 1E+5 disk TB growth: 112%/y Moore's Law: 58.7%/y 1E+4 1E+3 1988 Moores law Revenue TB growth Price decline 1991 1994 1997 58.70% /year 7.47% 112.30% (since 1993) 50.70% (since 1993) 2000 Cheap Storage and/or Balanced System • Low cost storage (2 x 3k$ servers) 6K$ TB 2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE) • Balanced server (5k$/.64 TB) – – – – – 2x800Mhz (2k$) 512 MB 8 x 80 GB drives (2.4K$) Gbps Ethernet + switch (500$/port) 10k$ TB, 20K$/RAIDED TB 2x800 Mhz 512 MB Hot Swap Drives for Archive or Data Interchange • 35 MBps write (so can write N x 80 GB in 40 minutes) • 80 GB/overnite = ~N x 3 MB/second @ 19.95$/nite 13$ 250$ The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 access per second / 5 GB (VERY cold data) • It’s a tape! 100 MB/s 200 Kaps 1 TB Disk vs Tape Disk Tape – 80 GB – 35 MBps – 5 ms seek time – 3 ms rotate latency – 4$/GB for drive 3$/GB for ctlrs/cabinet – 4 TB/rack – – – – – – 1 hour scan – 1 week scan 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media 8$/GB for drive+library – 10 TB/rack Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?). A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) Next step in the Evolution • Disks become supercomputers – Controller will have 1bips, 1 GB ram, 1 GBps net – And a disk arm. • Disks will run full-blown app/web/db/os stack • Distributed computing • Processors migrate to transducers. Terabyte (Petabyte) Processing Requires Parallelism parallelism: use many little devices in parallel 1,000 x parallel: At 10 MB/s: 1.2 days to scan 100 seconds scan. 1 Terabyte 1 Terabyte 10 MB/s Use 100 processors & 1,000 disks Parallelism Must Be Automatic • There are thousands of MPI programmers. • There are hundreds-of-millions of people using parallel database search. • Parallel programming is HARD! • Find design patterns and automate them. • Data search/mining has parallel design patterns. Gilder’s Law: 3x bandwidth/year for 25 more years • Today: – 10 Gbps per channel – 4 channels per fiber: 40 Gbps – 32 fibers/bundle = 1.2 Tbps/bundle • • • • In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps Sense of scale • How fat is your pipe? • Fattest pipe on MS campus is the WAN! 94 MBps Coast to Coast 300 MBps OC48 = G2 Or memcpy() 90 MBps PCI 20MBps disk / ATM / OC3 Redmond/Seattle, WA Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop New York HSCC (high speed connectivity consortium) DARPA Arlington, VA San Francisco, CA 5626 km 10 hops Outline • Technology: – 1M$/PB: store everything online (twice!) • End-to-end high-speed networks – Gigabit to the desktop So: You can store everything, Anywhere in the world Online everywhere • Research driven by apps: – TerraServer – National Virtual Astronomy Observatory. Interesting Apps • EOS/DIS • TerraServer • Sloan Digital Sky Survey Kilo Mega Giga Tera Peta Exa 103 106 109 1012 1015 1018 today, we are here The Challenge -- EOS/DIS • Antarctica is melting -- 77% of fresh water liberated – sea level rises 70 meters – Chico & Memphis are beach-front property – New York, Washington, SF, LA, London, Paris • Let’s study it! Mission to Planet Earth • EOS: Earth Observing System (17B$ => 10B$) – 50 instruments on 10 satellites 1999-2003 – Landsat (added later) • EOS DIS: Data Information System: – 3-5 MB/s raw, 30-50 MB/s processed. – 4 TB/day, – 15 PB by year 2007 The Process Flow • Data arrives and is pre-processed. – instrument data is calibrated, gridded averaged – Geophysical data is derived • Users ask for stored data OR to analyze and combine data. • Can make the pull-push split dynamically Pull Processing Other Data Push Processing Key Architecture Features • • • • • • 2+N data center design Scaleable OR-DBMS Emphasize Pull vs Push processing Storage hierarchy Data Pump Just in time acquisition 2+N data center design • • • • duplex the archive (for fault tolerance) let anyone build an extract (the +N) Partition data by time and by space (store 2 or 4 ways). Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs). • Clients and Partitions interact via standard protocols – HTTP+XML, Data Pump • Some queries require reading ALL the data (for reprocessing) • Each Data Center scans ALL the data every 2 days. – Data rate 10 PB/day = 10 TB/node/day = 120 MB/s • Compute on demand small jobs • • • less than 100 M disk accesses less than 100 TeraOps. (less than 30 minute response time) • For BIG JOBS scan entire 15PB database • Queries (and extracts) “snoop” this data pump. • • • • • Just-in-time acquisition 30% Hardware prices decline 20%-40%/year So buy at last moment Buy best product that day: commodity Depreciate over 3 years so that facility is fresh. (after 3 years, cost is 23% of original). 60% decline peaks at 10M$ 10 10 10 10 5 EOS DIS Disk Storage Size and Cost assume 40% price decline/year Data Need TB 4 3 2 Storage Cost M$ 10 1 1994 1996 1998 2000 2 PB @ 100M$ 2002 2004 2006 2008 Problems • Management (and HSM) • Design and Meta-data • Ingest • Data discovery, search, and analysis • Auto Parallelism • reorg-reprocess What this system taught me • Traditional storage metrics – KAPS: KB objects accessed per second – $/GB: Storage cost • New metrics: – MAPS: megabyte objects accessed per second – SCANS: Time to scan the archive – Admin cost dominates (!!) – Auto parallelism is essential. Outline • Technology: – 1M$/PB: store everything online (twice!) • End-to-end high-speed networks – Gigabit to the desktop So: You can store everything, Anywhere in the world Online everywhere • Research driven by apps: – TerraServer – National Virtual Astronomy Observatory. Microsoft TerraServer: http://TerraServer.Microsoft.com/ • Build a multi-TB SQL Server database • Data must be – – – – 1 TB Unencumbered Interesting to everyone everywhere And not offensive to anyone anywhere – – – – 1.5 M place names from Encarta World Atlas 7 M Sq Km USGS doq (1 meter resolution) 10 M sq Km USGS topos (2m) 1 M Sq Km from Russian Space agency (2 m) • Loaded • On the web (world’s largest atlas) • Sell images with commerce server. Background • Earth is 500 Tera-meters square – USA is 10 tm2 • 100 TM2 land in 70ºN to 70ºS • We have pictures of 9% of it • Someday – multi-spectral image – of everywhere – once a day / hour – 7 tsm from USGS – 1 tsm from Russian Space Agency • • • • Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks (200x200 pixels) Store chunks in DB Navigate with – Encarta™ Atlas • globe • gazetteer .2x.2 km2 tile .4x.4 km2 image .8x.8 km2 image 1.6x1.6 km2 image TerraServer 4.0 Configuration 3 Active Database Servers SQL\Inst1 - Topo & Relief Data Compaq Compaq Compaq Controller Controller Controller E L S Compaq Compaq DL360 DL360 DL360 DL360 DL360 DL360 DL360 DL360 SQL\Inst2 – Aerial Imagery SQL\Inst3 – Aerial Imagery Logical Volume Structure One rack per database All volumes triple mirrored (3x) MetaData on 15k rpm 18.2 GB drives Image Data on 10k rpm 72.8 GB drives MetaData 101GB Image1 339 GB Image2 339 GB Image3 339 GB Image4 339 GB Controller F G Controller H I Controller Controller M N T U Controller Controller O P V U Compaq 8500 SQL\Inst1 Compaq 8500 SQL\Inst2 Compaq 8500 Web Servers 8 2-proc “Photon” DL360 SQL\Inst3 Compaq 8500 Passive Srvr 2 spare volumes allocated per cluster 6 Additional 339 GB volumes to be added by year end (2 per Db Server) File Group Admin Gazetteer Image Meta Search Grand Total Rows (millions) 1 17 254 254 46 572 Total Size (GB) 0 GB 5 GB 2,237 GB 70 GB 10 GB 2,322 GB Data Size (GB) 0.1 GB 1 GB 2,220 GB 53 GB 5 GB 2,280 GB Index Size (GB) 0 GB 3 GB 17 GB 17 GB 5 GB 42 GB TerraServer 4.0 Schema External Group Image Source Search Job Search Dest AltCountry Country Name External Link SourceMeta Scale Job Search Job Log AltState State Name External Geo ImageMeta Load Job JobQueue AltPlace Place Name Image Search Imagery JobSystem Media Feature Type Small PlaceName Famous Category Image Type TerraServer MediaFile Pyramid Famous Place NoImage Terra Database Search Imagery Gazetteer Admin LoadMgmt BAD OLD Load DLT Tape DLT Tape “tar” NT \Drop’N’ DoJob LoadMgr DB Wait 4 Load Backup LoadMgr LoadMgr ESA Alpha Server 4100 100mbit EtherSwitch 60 4.3 GB Drives Alpha Server 4100 ImgCutter \Drop’N’ \Images Enterprise Storage Array STC DLT Tape Library 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives Alpha Server 8400 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place ... Remote Management Internet Data Center Tukwila, WA Load Process Terminal Server Active Server Pages Loading Scheduling System 2 TB Database Terra Scale Executive Briefing Center, Redmond WA Mounted Tar98 2 TB Database 2 TB Database SQL Server SQL Server SQL Server Stored Procs Stored Procs Stored Procs Corporate Network Compaq ProLiant 8500 450 GB Staging Area Read Image Files Terra Cutter After a Year: 30M Count • 15 TB of data (raw) 3B records • 2.3 billion Hits • 2.0 billion DB Queries • 1.7 billion Images sent (2 TB of download) • 368 million Page Views • 99.93% DB Availability • 4rd design now Online • Built and operated by team of 4 people TerraServer Daily Traffic Jun 22, 1998 thru June 22, 1999 Sessions Hit Page View DB Query Image 20M 10M 0 Down Time TotalTime (Hours) (Hours:minutes) 8640 6:00 7920 5:30 7200 5:00 6480 Operations 4:30 5760 4:00 5040 4320 3600 2880 Up 3:30 3:00 2:30 Scheduled 2:00 2160 1:30 1440 1:00 720 0:30 0 0:00 HW+Software TerraServer.Microsoft.NET A Web Service Before .NET Html Page Internet Image Tile Web Browser TerraServer Web Site TerraServer SQL Db With .NET Application Program Internet GetAreaByPoint GetAreaByRect TerraServer GetPlaceListByName Web GetPlaceListByRect GetTileMetaByLonLatPt GetTileMetaByTileId GetTile ConvertLonLatToNearestPlace ConvertPlaceToLonLatPt . . . Service TerraServer SQL Db TerraServer Recent/Current Effort • • • • • • • • Added USGS Topographic maps (4 TB) High availability (4 node cluster with failover) Integrated with Encarta Online The other 25% of the US DOQs (photos) Adding digital elevation maps Open architecture: publish SOAP interfaces. Adding mult-layer maps (with UC Berkeley) Geo-Spatial extension to SQL Server Outline • Technology: – 1M$/PB: store everything online (twice!) • End-to-end high-speed networks – Gigabit to the desktop So: You can store everything, Anywhere in the world Online everywhere • Research driven by apps: – TerraServer – National Virtual Astronomy Observatory. Astronomy is Changing (and so are other sciences) • • • • • • Astronomers have a few PB Doubles every 2 years. Data is public after 2 years. So: Everyone has ½ the data Some people have 5%more “private data” So, it’s a nearly level playing field: – Most accessible data is public. (inter) National Virtual Observatory • • • • • • • Almost all astronomy datasets will be online Some are big (>>10 TB) Total is a few Petabytes Bigger datasets coming Data is “public” Scientists can mine these datasets Computer Science challenge: Organize these datasets Provide easy access to them. The Sloan Digital Sky Survey SLIDES BY Alex Szlay A project run by the Astrophysical Research Consortium (ARC) The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA Goal: To create a detailed multicolor map of the Northern Sky over 5 years, with a budget of approximately $80M Data Size: 40 TB raw, 1 TB processed Features of the SDSS Special 2.5m telescope, located at Apache Point, NM 3 degree field of view. Zero distortion focal plane. Two surveys in one: Photometric survey in 5 bands. Spectroscopic redshift survey. Huge CCD Mosaic 30 CCDs 2K x 2K (imaging) 22 CCDs 2K x 400 (astrometry) Two high resolution spectrographs 2 x 320 fibers, with 3 arcsec diameter. R=2000 resolution with 4096 pixels. Spectral coverage from 3900Å to 9200Å. Automated data reduction Over 70 man-years of development effort. (Fermilab + collaboration scientists) Very high data volume Expect over 40 TB of raw data. About 3 TB processed catalogs. Data made available to the public. Scientific Motivation Create the ultimate map of the Universe: The Cosmic Genome Project! Study the distribution of galaxies: What is the origin of fluctuations? What is the topology of the distribution? Measure the global properties of the Universe: How much dark matter is there? Local census of the galaxy population: How did galaxies form? Find the most distant objects in the Universe: What are the highest quasar redshifts? Cosmology Primer The Universe is expanding: the galaxies move away from us spectral lines are redshifted The fate of the universe depends on the balance between gravity and the expansion velocity v = Ho r Hubble’s law = density/critical if <1, expand forever Most of the mass in the Universe is dark matter, and it may be cold (CDM) d> * The spatial distribution of galaxies is correlated, due to small ripples in the early Universe. P(k): power spectrum The ‘Naught’ Problem What are the global parameters of the Universe? H0 0 0 the Hubble constant the density parameter the cosmological constant 55-75 km/s/Mpc 0.25-1 0 - 0.7 Their values are still quite uncertain today... Goal: measure these parameters with an accuracy of a few percent High Precision Cosmology! The Cosmic Genome Project The SDSS will create the ultimate map of the Universe, with much more detail than any other measurement before daCosta etal 1995 deLapparent, Geller and Huchra 1986 Gregory and Thompson 1978 SDSS Collaboration 2002 Area and Size of Redshift Surveys 1.00E+09 SDSS photo-z 1.00E+08 No of objects 1.00E+07 SDSS main SDSS abs line 1.00E+06 SDSS red 1.00E+05 CfA+ SSRS 2dF LCRS 1.00E+04 SAPM 1.00E+03 1.00E+04 2dFR 1.00E+05 1.00E+06 QDOT 1.00E+07 1.00E+08 Volume in M pc 3 1.00E+09 1.00E+10 1.00E+11 The Spectroscopic Survey Measure redshifts of objects distance SDSS Redshift Survey: 1 million galaxies 100,000 quasars 100,000 stars Two high throughput spectrographs spectral range 3900-9200 Å. 640 spectra simultaneously. R=2000 resolution. Automated reduction of spectra Very high sampling density and completeness Objects in other catalogs also targeted First Light Images Telescope: First light May 9th 1998 Equatorial scans The First Stripes Camera: 5 color imaging of >100 square degrees Multiple scans across the same fields Photometric limits as expected NGC 6070 The First Quasars Three of the four highest redshift quasars have been found in the first SDSS test data ! SDSS Data Products Object catalog parameters of >108 objects Redshift Catalog parameters of 106 objects 400 GB 2 GB Atlas Images 5 color cutouts of >109 objects 1.5 TB Spectra in a one-dimensional form 106 60 GB Derived Catalogs - clusters - QSO absorption lines 60 GB 4x4 Pixel All-Sky Map heavily compressed 5 x 105 1 TB All raw data saved in a tape vault at Fermilab Parallel Query Implementation • Getting 200MBps/node thru SQL today • = 4 GB/s on 20 node cluster. User Interface Analysis Engine Master SX Engine DBMS Federation DBMS Slave Slave Slave DBMS Slave DBMS RAID DBMS RAID DBMS RAID RAID Who will be using the archive? Power Users sophisticated, with lots of resources research is centered around the archive data moderate number of very intensive queries mostly statistical, large output sizes General Astronomy Public frequent, but casual lookup of objects/regions the archives help their research, but not central to it large number of small queries a lot of cross-identification requests Wide Public browsing a ‘Virtual Telescope’ can have large public appeal need special packaging could be a very large number of requests How will the data be analyzed? The data are inherently multidimensional => positions, colors, size, redshift Improved classifications result in complex N-dimensional volumes => complex constraints, not ranges Spatial relations will be investigated => nearest neighbors => other objects within a radius Data Mining: finding the ‘needle in the haystack’ => separate typical from rare => recognize patterns in the data Output size can be prohibitively large for intermediate files => import output directly into analysis tools Different Kind of Spatial Data • All Objects on Celestial Sphere Surface – Position a point by 2 spherical angles (RA, DEC). – Position by Cartesian {x,y,z} – easier to search ‘within 1 arc-minute’. • Hierarchy of Spherical Triangles for Indexing. – SDSS tree is 5 levels deep 8192 triangles Experiment with Relational DBMS • See if SQL’s Good Indexing and Scanning Compensates for Poor Object Support. • Leverage Fast/Big/Cheap Commodity Hardware. • Ported 40 GB Sample Database (from SDSS Sample Scan) to SQL Server 2000 • Building public web site and data server 20 Astronomy Queries • Implemented spatial access extension to SQL (HTM) • Implement 20 Astronomy Queries in SQL (see paper for details). • 15M rows 378 cols, 30 GB. Can scan it in 8 minutes (disk IO limited). • Many queries run in seconds • Create Covering Indexes on queried columns. • Create ‘Neighbors’ Table listing objects within 1 arcminute (5 neighbors on the average) for spatial joins. • Install some more disks! Query to Find Gravitational Lenses Find all objects within 1 arc-minute of each other that have very similar colors (the color ratios u-g, g-r, r-i are less than 0.05m) 1 arc-minute SQL Query to Find Gravitational Lenses select count(*) from sxTag T, sxTag U, neighbors N where T.UObj_id = N.UObj_id and U.UObj_id = N.neighbor_UObj_id and N.UObj_id < N.neighbor_UObj_id -- no dups and T.u>0 and T.g>0 and T.r>0 and T.i>0 and U.u>0 and U.g>0 and U.r>0 and U.i>0 and ABS((T.u-T.g)-(U.u-U.g))<0.05 -- similar color and ABS((T.g-T.r)-(U.g-U.r))<0.05 and ABS((T.r-T.i)-(U.r-U.i))<0.05 Finds 5223 objects, executes in 6 minutes. SQL Results so far. • Have run 17 of 20 Queries so far. • Most Queries are IO bound, scanning at 80MB/sec on 4 disks in 6 minutes (at the PCI bus limit) • Covering indexes reduce execution to < 30 secs. • Common to get Grid Distributions: select convert(int,ra*30)/30.0, -- ra bucket convert(int,dec*30)/30.0, -- dec bucket count(*) -- bucket count from Galaxies where (u-g)>1 and r<21.5 group by (1), (2) Drop Page Fields Here Distribution of Galaxies Galaxy Density - 2arcmin cells m of Cnt 30 25 20 count 15 10 5 1.17 ra 216.8 216.5 216.2 215.9 215.6 215.3 ra 215 -0.43 214.7 214.4 214.1 213.8 213.5 213.2 212.9 0.37 212.6 212.3 212 0 -1.23 dec Outline • Technology: – 1M$/PB: store everything online (twice!) • End-to-end high-speed networks – Gigabit to the desktop So: You can store everything, Anywhere in the world Online everywhere • Research driven by apps: – TerraServer – National Virtual Astronomy Observatory.