SPIN-2 Tom Barclay Jim Gray, Don Slutz, Greg Smith, many others Microsoft Research Scaleup - Big Database Build a 1 TB SQL Server database Data must be – – – – 1 TB Unencumbered Interesting to everyone everywhere And not offensive to anyone anywhere Loaded – 1.1 M place names from Encarta World Atlas – 1 M Sq Km from USGS (1 meter resolution) – 2 M Sq Km from Russian Space agency (2 m) Will be on web (world’s largest atlas) Sell images with commerce server. What’s a Terabyte? 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images 150 miles of book shelf 15 miles of book shelf 7 miles of book shelf 10 days of video 16 earth images (100m) Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 5 M$ of tape silo 1998: 100 k$ of magnetic disc 50 K$ nearline tape Terror Byte !! 10,000 discs 10,000 tapes 60 discs 30 tapes 3 Some Other Terror-Byte Databases TerraServer Sloan Digital Sky Survey: – 40 TB raw, 2 TB cooked – EOS/DIS (picture of planet each week) – 15 PB by 2007 Federal Mega Giga Reserve Clearing house: images of checks Tera – 15 PB by 2006 (7 year history) Nuclear Kilo Stockpile Stewardship Program – 10 Exabytes (???!!) Peta Exa Zetta Yotta TerraServer is: “A shameless advertisement of WNT and SQL Server Scalability” An on-line demo and sales tool directed at IT customers and ISVs A test of the Sphinx VLDB features: – Load performance – Online Backup/Restore – Query Performance A “cool 90s app” – Image and Text data – Web-lication – Electronic Commerce Application Requirements BIG —1 TB of data. PUBLIC — available on the world wide web. INTERESTING — to a wide audience ACCESSIBLE — using standard browsers (IE, Netscape) REAL — a real application (users can buy imagery) FREE —cannot require NDA or money to access FAST — impress customers for BackOffice, StorageWorks EASY — Inexpensive to develop, deploy, and maintain Project Partners Motivation Demo scope & quality of Spin-2 imagery Open new markets SPIN-2 for imagery sales Demo DEC Alpha & StorageWorks™ Scalability Recognized as superior h/w vendor Distribute DOQs to a wider audience Lower cost of distribution Demo Scalability of NT & SQL Server Database & App UI Coverage: Range from 70ºN to 70ºS 35% U.S., 1% outside U.S. Source Imagery: – 3.5 TB 1sq meter/pixel Aerial (USGS 60,000 46Mb B&W- 151Mb Color IR files) – 700 GB 1.56 meter/pixelSatellite (Spin-2 2400 300 Mb B&W) Display Imagery: 80 m 225 x 150 pixel images, 1.6 m x 3 sub-sampled views Nav Tools: – 1.5 m place names – “Click-on” Coverage map – Expedia & Virtual Globe map Concept: User navigates an ‘almost seamless’ image of earth 225x150m tile 1.8x1.2km 8m browse 1.8x1.2km 16m thumbnail 1.8x1.2km 32m “city view” TerraServer Demo Intranet Beta Sites: – http://terraweb1 – http://terraweb2 Internal Beta Schedule – Mon April 27 - June 23 What Microsoft & DEC Contribute Microsoft’s contribution: – – – – Build an “internet UI” Design the app and the database Slice & Dice & Load the data. Build “electronic stores” for USGS’ for Aerial Images to operate to sell & distribute images – Run a “robust”web site 18 months Digital contribution: – Provide high-performance processors – provide high capacity, reliable storage. – Provide technical advice World’s Largest PC! – 324 disks (2.4 TB) – 8 x 440 mhz Alpha CPU – 10 GB RAM Site Configuration StorageTek Enterprise Storage Array Alpha 8400 9 HSZ70 Ultra-SCSI Dual redundant Controllers (8x440) 10GB Ram 324 9.1 Seagate Disks 6 DLT7000 Quantum Drives FWD SCSI Compaq 5500 4x200mhz Web Servers Compaq 5500 4x200mhz Web Servers To the Web Software Web Client Image Server Active Server Pages Internet Information Server 4.0 Java Viewer broswer MTS Terra-Server Stored Procedures HTML The Internet Internet Info Server 4.0 Sphinx (SQL Server) Microsoft Automap ActiveX Server Terra-Server DB Automap Server Terra-Server Web Site Internet Information Server 4.0 Microsoft Site Server EE Image Delivery SQL Server Application 7 Image Provider Site(s) “Chopped” How We Did It big images into small “tiles” – Sub-sampled tiles to create zoom levels – Tile sizes map to Lat/Lon system – Unique ID assigned to each Tile location (Z-transform of lat/long or UTM) – Unique ID clusters adjacent tiles onto the same database & index pages Wrote – – – – Load Management program Runs image cutting job Loads meta and image data into SQL Multiple Loaders can run in parallel Web Active Server Page controls load process USGS Editing Process 1 “QUAD” DOQ Photo (3.75’ x 3.75’) 1 2 3 4 7 5 8 6 9 10 11 12 13 16 14 15 17 18 Quad Cut 3x6 Jump, Thumb-nails & Browse Images 1 Degree Latitude 1 Quadrangle (7.5’ x 7.5’) 1 9 DOQQ Origin Point 1 Degree Longitude 8 64 DOQ Tiles Spin-2 Image Editing Process 48 x 96 cells per sq degree Image aligned to left corner of grid system Non-image squares (all white) are discarded Cut Images are extracted SubSample Jump 32m Thumb 16m 8m Tiles are cut 5x5, scrambled output Jpeg Browse Spin-2 Meta Data Semi-colon delimited fields, ASCII encoding 1 records per line 1Field File name (of image) City1 State1 Country Number of Rows Number of Columns Shooting Height Height of Sun Date of survey (mm/dd/yyyy) Time of survey (GMT) (hr:mn:ss) Upper Left Latitude Upper Left Longitude Lower Right Latitude Lower Right Longitude Camera System1 Pixel size1 Copyright1 is not required, if not present, then a blank field is present Database Design and Load Build a 1 TB (2**40B) SQL Server Database Database includes – Gazetteer data for searching – Image data pyramid and metadata Load the Database – – – – – Chop the big images into tiles BCP data and metadata in Allow for restart and undo of loads Create indexes Check consistency of the data Keep it Simple, no Tricks, Test the Scaling The Image Pyramid Zooming in on the Washington Monument 1:1 1:1 64:1 Jump image 1 pixel = 32x32 m2 Dithered Browse image 1 pixel = 16x16 m2 Dithered Thumb image 1 pixel = 8x8 m2 USGS Tile image DOQ of Washington Monument 1 pixel = 1 sq meter ‘Logical’ Schema Country PlaceType State Place Image Data & Meta Data Lat/Long (U/ZGridId) Theme Meta Information TileLog ImgMeta TileMeta FeatureType Gazetteer Star schema Index on • image, place, type • image, state, type • image, state, country, type • image, place, state, type • image, place, country, type all lookups are fast Jump Img BrowseImg Thumb Img TileImg Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Gazetteer Design Classic Snowflake Schema Top 10 Hint to RE for Cursor Select PlaceGrid Place CountrySearch AlternateName CountryID GazSrcID 1148 Country CountryID CountryName UNcode 264 StateSerach AlternateName CountryID StateID FreatureID GazSrcID State StateID CountryID StateName 1083 PlaceID ImageFlag AlternateName Name CountryID StateID TypeID GazSourcID Latitude Longitude UGridID ZGridID DOQdate SPIN2date 3776 1,089,897 ZGridID BestPlaceName XDistance YDistrance 50,000,000 FeatureType TypeID Description 13 GazetteerSource GazSrcID Description 1 Image Data Design Image pyramid stored in DBMS (250 M recs) OriginalMetaData ImageMeta OrigMetaID SrcID ImageSource Agency SourcePhotoID SourcePhotoDate SourceDEMDate MetaDataDate ProductionSystem ProductionDate DataFileSize Compression HeaderBytes … 80 other fields ImgMetaID OrigMetaID ImgStatus ImgDate ImgTypeID JumpPixHeight JumpPixWidth BrowsePixHeight BrowsePixWidth ThumbPixWidth ThumbPixHeight CutCol CutRow MidLat MidLong NELat NELong NWLat NWLong SELat SELong SWLat SWLong UGridID UTMZone XUtmID YUtmID XGridID YGridID ZGridID 650 k SPIN2 2 M USGS ImgSource ImgType ImgTypeID ImgFileDesc ImgFileExt MimeStr SrcID SrcName SrcTblName SrcDescription GridSysID ImgTypeID 4 2 650 k SPIN2 2 M USGS Pick Log UGridHits Name Description Link PickDate URL Time <extensive list of action parameters URL UGridID ZTileGridID count 10 TileMeta xxx xxx Jump Browse Thumb Tile UGridID ZGridID ZTileGridID ImgData ImgDate ImgTypeID ImgMetaID SrcID EncryptKey File Name UGridID ZGridID ZTileGridID ImgData ImgDate ImgTypeID ImgMetaID SrcID EncryptKey File Name UGridID ZGridID ZTileGridID ImgData ImgDate ImgTypeID ImgMetaID SrcID EncryptKey File Name UGridID ZGridID ZTileGridID ImgData 1 ImgDate ImgTypeID ImgMetaID SrcID EncryptKey File Name .65 M SPIN2 1.5 M USGS .65 M SPIN2 1.5 M USGS .65 M SPIN2 1.5 M USGS 16 M SPIN2 96 M USGS ImgMetaID OrigMetaID SrcID ImgStatus ImgDate ImgTypeID TilePixHeight TilePixWidth CutCol CutRow MidLat MidLong NELat NELong NWLat NWLong SELat SELong SWLat SWLong UGridID UTMZone XUtmID YUtmID XGridID YGridID ZGridID 16 M SPIN2 96 M USGS TerraServer File Group Design Make 28 RAID5 sets from 324 disks Each raid set has 11 disks (16 spare drives) Make 4 595GB NT volumes Each striped over 7 Raid sets on 7 controllers Create 26 20,000MB files on F:, 27 on G: DB is File Group of 53 files (1.011TB) F: G: H: I: Physical Database 53 Files. 20,000MB each 16,960,000 extents 135,680,000 pages Separate tables for DOQ, Spin ‘Themes’ Each image stored in column of type ‘image’ All tile images in one (big) table A number of indexes too TerraServer Tables USGS DOQ Data – 48,000 DOQQ images (45-55mb / image) – Creates 864,000 Jump, Thumb, & Browse images (3.5 m rows) – Creates 55.3 m Tile images (110.6 m rows) SPIN-2 Data – 3200 278 MB images (approximate size) – Creates 620,800 Jump, Thumb, & Browse images (2.5 m rows) – Creates 15.5 m Tile images (31 m rows) Gazetteer Data – 1.1 m named places (Encarta World Atlas) – 45 m cell names Total Rows = 193.7 M The Loading Process Includes Cutting Images, building BCP files, BCP meta data, BCP image data First Load 1/97-5/97 for Scalability Day – 190 GB actual image data, 800 GB duplicates – Pre-beta Sphinx Second Load 12/97-4/98 for Web Server – 750 GB actual image data, all images recut Image Preperation and Load DLT Tape DLT Tape “tar” NT DoJob \Drop’N’ LoadMgr DB Wait 4 Load Backup LoadMgr LoadMgr ESA Alpha Server 4100 100mbit EtherSwitch 60 4.3 GB Drives Alpha Server 4100 ImgCutter \Drop’N’ \Images Enterprise Storage Array STC DLT Tape Library 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives Alpha Server 8400 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place ... *.IMD & *.JPG Pre-Process Data NT Backup Read *.IMD files Generate Ids Generate ZLatLong Sort by ZLatLong Image Meta Tile Meta Load Thumb Img Load Browse Img Load Tile Img Read Image Meta Read Image Data BCP into ImgTbl Read Image Meta Read Image Data BCP into ImgTbl Read Tile Meta Read Tile Data BCP into TileTbl “SRC”ThumbImg ThumbImgId int ImgMetaId int ZLatLong int SrcId int ImgTypeId int PixWidth int PixHeight int ImgData Blob “SRC”BrowseImg BrowseImgId int ImgMetaId int ZLatLong int SrcId int ImgTypeId int PixWidth int PixHeight int ImgData Blob Meta & Image Load Process “SRC”TileImg TileImgId TileMetaId ZLatLong SrcId ImgTypeId PixWidth PixHeight ImgData int int int int int int int Blob Load Tile Meta Load Img Meta Read Image Meta BCP into TileMeta Read Image Meta BCP into TileMeta TileMeta TileMetaId ImgMetaId OrigMetaId SrcId ImgTypeId XGridId YGridId Hemisphere Continent xxLat xxLong ZLatLong int int int int int int int smallint smallint smallint smallint int ImgMeta ImgMetaId int OrigMetaId int SrcId int ImgTypeId int XGridId int YGridId int ImgDate Date Hemisphere smallint Continent smallint xxLat smallint xxLong smallint ZLatLong int MetaStr vchar(255) The Load Manager A Workflow System. Manages Job ‘Steps’. Built as an SQL Database App. Collects Stats. Would use Data Transformation Services today Load Statistics 601 DOQ Jobs, 818 Spin Jobs – Each job does 3 meta BCP, 4 Image BCP steps 5676 Image BCP Steps – 106 million total images loaded – 546 GB total. 5.4 KB avg image size For Tile Images (96% of the database) – avg 68,000 images/step. max 757,000 – avg 33 minutes/step. max 596 – total time 796 hours (33 days) System Maintenance: Backup & Recovery Industrial – – – – Strength High Performance Online Backups Simple, Error Free Media Handling Minimal Recovery Time Project Phases & Characteristics Load Phase – Ongoing Massive Data Loads – Updates to Fix Errors in Meta-Data – Backups at Key Milestones Deployed – – – – 7 x 24 Some Updates to Existing Data Small Loads as More Data Arrives Infrequent Large Loads SQL Server 7.0 Backup/Restore Fast Online Features Backup Under Load – Minimal Impact Just the Data Backup Part of the Database Minimize Recovery Time – Differential Backups, Log Backups – Restore Only Damaged Files Backup ISVs Address Limitations Legato NetWorker™ Computer Associates ArcServe™ Seagate Backup Exec™ Others… These Products support SQL Server 6.5 None Support SQL Server 7.0 yet. Deployed 6/98... ISV Supports SQL Server 7.0 High Performance Backup API ISV Supports Full Range of SQL Server 7.0 Backup/Restore Features Tape Library SQL Server Backup API Backup Software Backup API Performance Throughput NUL PIPE (no write) Backup API (no write) 0 20 40 60 80 100 120 50 60 MB/sec Avg CPU Usage NUL PIPE (no write) Backup API (no write) 0 10 20 30 Percent 40 Verifying Backup/Restore Minimal Risk Restore to a Separate System at DECWest – Early Problems with Unreadable Tapes TerraServer Test System Another Terabyte of Disk! TerraServer Backup/Restore Factoids Backup/Restore Rate 200 GB/Hr (57 MB/sec) Time Required for Full Database Backup: 5 Hours Number 36 of DLT Tape Cartridges: Other Details Active Server pages – faster and easier than DB stored procedures. Commerce Server is interesting – Images the Inventory no SKU, millions of them – USGS built their own they are very smart, but it is easy masquerade as a credit-card reader. The earth is a geoid, and Every Geographer has a coordinate system (or two). Tapes are still a nightmare. Everyone is a UI expert. Thank You! SPIN-2 Microsoft BackOffice