SPIN-2 Tom Barclay Jim Gray, Don Slutz, Greg Smith, many others Microsoft Research

advertisement
SPIN-2
Tom Barclay
Jim Gray, Don Slutz, Greg Smith, many others
Microsoft Research
Scaleup - Big Database


Build a 1 TB SQL Server database
Data must be
–
–
–
–

1 TB
Unencumbered
Interesting to everyone everywhere
And not offensive to anyone anywhere
Loaded
– 1.1 M place names from Encarta World Atlas
– 1 M Sq Km from USGS (1 meter resolution)
– 2 M Sq Km from Russian Space agency (2 m)


Will be on web (world’s largest atlas)
Sell images with commerce server.
What’s a Terabyte?
1 Terabyte
1,000,000,000 business letters
100,000,000 book pages
50,000,000 FAX images
10,000,000 TV pictures (mpeg)
4,000 LandSat images
150 miles of book shelf
15 miles of book shelf
7 miles of book shelf
10 days of video
16 earth images (100m)
Library of Congress (in ASCII) is 25 TB
1980: 200 M$ of disc
5 M$ of tape silo
1998: 100 k$ of magnetic disc
50 K$ nearline tape
Terror Byte !!
10,000 discs
10,000 tapes
60 discs
30 tapes
3
Some Other Terror-Byte Databases
 TerraServer
 Sloan
Digital Sky Survey:
– 40 TB raw, 2 TB cooked
– EOS/DIS (picture of planet each week)
– 15 PB by 2007
 Federal
Mega
Giga
Reserve Clearing house: images of checks
Tera
– 15 PB by 2006 (7 year history)
 Nuclear
Kilo
Stockpile Stewardship Program
– 10 Exabytes (???!!)
Peta
Exa
Zetta
Yotta
TerraServer is:
“A shameless advertisement of WNT and SQL Server Scalability”
An on-line demo and sales tool directed at IT
customers and ISVs
A test of the Sphinx VLDB features:
– Load performance
– Online Backup/Restore
– Query Performance
A “cool 90s app”
– Image and Text data
– Web-lication
– Electronic Commerce
Application Requirements

BIG —1 TB of data.

PUBLIC — available on the world wide web.

INTERESTING — to a wide audience

ACCESSIBLE — using standard browsers (IE, Netscape)

REAL — a real application (users can buy imagery)

FREE —cannot require NDA or money to access

FAST — impress customers for BackOffice, StorageWorks

EASY — Inexpensive to develop, deploy, and maintain
Project Partners
Motivation
Demo scope & quality
of Spin-2 imagery
Open new markets SPIN-2
for imagery sales
Demo DEC Alpha
& StorageWorks™
Scalability
Recognized as superior h/w vendor
Distribute DOQs to a
wider audience
Lower cost of
distribution
Demo Scalability
of NT &
SQL Server
Database & App UI


Coverage: Range from 70ºN to 70ºS
35% U.S., 1% outside U.S.
Source Imagery:
– 3.5 TB 1sq meter/pixel Aerial (USGS 60,000 46Mb B&W- 151Mb Color IR files)
– 700 GB 1.56 meter/pixelSatellite (Spin-2 2400 300 Mb B&W)


Display Imagery: 80 m 225 x 150 pixel
images, 1.6 m x 3 sub-sampled views
Nav Tools:
– 1.5 m place names
– “Click-on” Coverage map
– Expedia & Virtual Globe map

Concept: User navigates
an ‘almost seamless’
image of earth
225x150m tile
1.8x1.2km 8m browse
1.8x1.2km 16m thumbnail
1.8x1.2km 32m “city view”
TerraServer Demo
 Intranet
Beta Sites:
– http://terraweb1
– http://terraweb2
 Internal
Beta
– Mon April 27 - May 30
 Whitepaper:
– terraserver.doc
What Microsoft & DEC Contribute
 Microsoft’s
contribution:
–
–
–
–
Build an “internet UI”
Design the app and the database
Slice & Dice & Load the data.
Build “electronic stores” for USGS’ for Aerial
Images to operate to sell & distribute images
– Run a “robust”web site 18 months
 Digital
contribution:
– Provide high-performance processors
– provide high capacity, reliable storage.
– Provide technical advice

World’s Largest PC!
– 324 disks (2.4 TB)
– 8 x 440 mhz Alpha CPU
– 10 GB RAM
Site Configuration
StorageTek
Enterprise Storage Array
Alpha
8400
9 HSZ70 Ultra-SCSI
Dual redundant Controllers
(8x440)
10GB
Ram
324 9.1 Seagate Disks
6 DLT7000
Quantum Drives
FWD SCSI
Compaq
5500
4x200mhz
Web
Servers
Compaq
5500
4x200mhz
Web
Servers
To the Web
Software
Web Client
Image
Server
Active Server Pages
Internet
Information
Server 4.0
Java
Viewer
broswer
MTS
Terra-Server
Stored Procedures
HTML
The Internet
Internet Info
Server 4.0
Sphinx
(SQL Server)
Microsoft Automap
ActiveX Server
Terra-Server DB
Automap Server
Terra-Server Web Site
Internet Information
Server 4.0
Microsoft
Site Server EE
Image Delivery SQL Server
Application
7
Image Provider Site(s)
TerraServer E-Commerce
Microsoft does not collect or share in the revenues generated
by TerraServer image sales!

USGS Store
– Built by USGS
– MSCS V2.0 Based
– Standard Shopping Basket
approach
– Purchase Digital Ortho Quads
used by MS to build
TerraServer
– Pricing subject to quantity
– Image you were viewing given
away for free (public domain
data)

Spin-2 Store
–
–
–
–
Microsoft SP built
MSCS V2.0 Based
Moving to V3.0 for RTW
Buy Small, Medium, Large
Digital image
– Can get Photographic print thru
SPIN-2 relationship with Kodak
– Digital images are “sized” to
make photographic prints look
good
TerraServer Schedule

Load Data 12/15/97 - 4/5/98
– Public beta with Aerial Images 1/3/98 - 6/24/98

Data & App Fine Tuning 4/10/98 - 5/31/98
–
–
–
–
Remove SPIN-2 Duplicates
Find all missing images
Implement on-line update program
Fix final app bugs
Move to the IDC: 6/1/98 - 6/10/98
 Launch at Federal Enterprise Day 6/24/98

– http://terraserver.microsoft.com
 “Chopped”
How We Did It
big images into small “tiles”
– Sub-sampled tiles to create zoom levels
– Tile sizes map to Lat/Lon system
– Unique ID assigned to each Tile location
 (Z-transform
of lat/long or UTM)
– Unique ID clusters adjacent tiles onto the same database
& index pages
 Wrote
–
–
–
–
Load Management program
Runs image cutting job
Loads meta and image data into SQL
Multiple Loaders can run in parallel
Web Active Server Page controls load process
USGS Editing Process
1 “QUAD”
DOQ Photo
(3.75’ x 3.75’)
1
2
3
4
7
5
8
6
9
10
11 12
13
16
14 15
17 18
Quad Cut 3x6
Jump, Thumb-nails &
Browse Images
1 Degree Latitude
1 Quadrangle
(7.5’ x 7.5’)
1
9
DOQQ Origin Point
1 Degree Longitude
8
64
DOQ
Tiles
Spin-2 Image Editing Process
48 x 96 cells per sq degree
Image aligned to left
corner of grid system
Non-image squares (all
white) are discarded
Cut Images are extracted
SubSample
Jump
32m
Thumb
16m
8m
Tiles are cut
5x5, scrambled
output Jpeg
Browse
Database Design and Load


Build a 1 TB (2**40B) SQL Server Database
Database includes
– Gazetteer data for searching
– Image data pyramid and metadata

Load the Database
–
–
–
–
–

Chop the big images into tiles
BCP data and metadata in
Allow for restart and undo of loads
Create indexes
Check consistency of the data
Keep it Simple, no Tricks, Test the Scaling
The Image Pyramid
 Zooming
in on the Washington Monument
1:1
1:1
64:1
Jump image
1 pixel =
32x32 m2
Dithered Browse image
1 pixel = 16x16 m2
Dithered Thumb image
1 pixel = 8x8 m2
USGS Tile image
DOQ of Washington Monument
1 pixel = 1 sq meter
‘Logical’ Schema
Country
PlaceType
State
Place
Image Data &
Meta Data
Lat/Long
(U/ZGridId)
Theme Meta
Information
TileLog
ImgMeta
TileMeta
FeatureType
Gazetteer
Star schema
Index on
• image, place, type
• image, state, type
• image, state, country, type
• image, place, state, type
• image, place, country, type
all lookups are fast
Jump Img
BrowseImg
Thumb Img
TileImg
Lookup by UGrid or ZGrid ID plus resolution
Lookups are fast.
Indices are in DRAM (auto-magically by SQL)
SQL manages all the tiles and indices
Images are brought in on demand
Gazetteer Design
 Classic
Snowflake Schema
 Top 10 Hint to RE for Cursor Select
PlaceGrid
Place
CountrySearch
AlternateName
CountryID
GazSrcID
1148
Country
CountryID
CountryName
UNcode
264
StateSerach
AlternateName
CountryID
StateID
FreatureID
GazSrcID
State
StateID
CountryID
StateName
1083
PlaceID
ImageFlag
AlternateName
Name
CountryID
StateID
TypeID
GazSourcID
Latitude
Longitude
UGridID
ZGridID
DOQdate
SPIN2date
3776
1,089,897
ZGridID
BestPlaceName
XDistance
YDistrance
50,000,000
FeatureType
TypeID
Description
13
GazetteerSource
GazSrcID
Description
1
Image Data Design
 Image
pyramid stored in DBMS (250 M recs)
OriginalMetaData
ImageMeta
OrigMetaID
SrcID
ImageSource
Agency
SourcePhotoID
SourcePhotoDate
SourceDEMDate
MetaDataDate
ProductionSystem
ProductionDate
DataFileSize
Compression
HeaderBytes
…
80 other fields
ImgMetaID
OrigMetaID
ImgStatus
ImgDate
ImgTypeID
JumpPixHeight
JumpPixWidth
BrowsePixHeight
BrowsePixWidth
ThumbPixWidth
ThumbPixHeight
CutCol
CutRow
MidLat
MidLong
NELat
NELong
NWLat
NWLong
SELat
SELong
SWLat
SWLong
UGridID
UTMZone
XUtmID
YUtmID
XGridID
YGridID
ZGridID
650 k SPIN2
2 M USGS
ImgSource
ImgType
ImgTypeID
ImgFileDesc
ImgFileExt
MimeStr
SrcID
SrcName
SrcTblName
SrcDescription
GridSysID
ImgTypeID
4
2
650 k SPIN2
2 M USGS
Pick
Log
UGridHits
Name
Description
Link
PickDate
URL
Time
<extensive
list of action
parameters
URL
UGridID
ZTileGridID
count
10
TileMeta
xxx
xxx
Jump
Browse
Thumb
Tile
UGridID
ZGridID
ZTileGridID
ImgData
ImgDate
ImgTypeID
ImgMetaID
SrcID
EncryptKey
File Name
UGridID
ZGridID
ZTileGridID
ImgData
ImgDate
ImgTypeID
ImgMetaID
SrcID
EncryptKey
File Name
UGridID
ZGridID
ZTileGridID
ImgData
ImgDate
ImgTypeID
ImgMetaID
SrcID
EncryptKey
File Name
UGridID
ZGridID
ZTileGridID
ImgData
1
ImgDate
ImgTypeID
ImgMetaID
SrcID
EncryptKey
File Name
.65 M SPIN2
1.5 M USGS
.65 M SPIN2
1.5 M USGS
.65 M SPIN2
1.5 M USGS
16 M SPIN2
96 M USGS
ImgMetaID
OrigMetaID
SrcID
ImgStatus
ImgDate
ImgTypeID
TilePixHeight
TilePixWidth
CutCol
CutRow
MidLat
MidLong
NELat
NELong
NWLat
NWLong
SELat
SELong
SWLat
SWLong
UGridID
UTMZone
XUtmID
YUtmID
XGridID
YGridID
ZGridID
16 M SPIN2
96 M USGS
TerraServer File Group Design
 Make
28 RAID5 sets from 324 disks
Each raid set has 11 disks (16 spare drives)
 Make
4 595GB NT volumes
Each striped over 7 Raid sets on 7 controllers
 Create
26 20,000MB files on F:, 27 on G:
 DB is File Group of 53 files (1.011TB)
F:
G:
H:
I:
Physical Database
 53
Files. 20,000MB each
 16,960,000 extents
 135,680,000 pages
 Separate tables for DOQ, Spin ‘Themes’
 Each image stored in column of type ‘image’
 All tile images in one (big) table
 A number of indexes too
TerraServer Tables

USGS DOQ Data
– 48,000 DOQQ images (45-55mb / image)
– Creates 864,000 Jump, Thumb, & Browse images (3.5 m rows)
– Creates 55.3 m Tile images (110.6 m rows)

SPIN-2 Data
– 3200 278 MB images (approximate size)
– Creates 620,800 Jump, Thumb, & Browse images (2.5 m rows)
– Creates 15.5 m Tile images (31 m rows)

Gazetteer Data
– 1.1 m named places (Encarta World Atlas)
– 45 m cell names

Total Rows = 193.7 M
The Loading Process
 Includes
Cutting Images, building BCP
files, BCP meta data, BCP image data
 First Load 1/97-5/97 for Scalability Day
– 190 GB actual image data, 800 GB duplicates
– Pre-beta Sphinx
 Second
Load 12/97-4/98 for Web Server
– 750 GB actual image data, all images recut
Image Preperation and Load
DLT
Tape
DLT
Tape
“tar”
NT
DoJob
\Drop’N’
LoadMgr
DB
Wait 4
Load
Backup
LoadMgr
LoadMgr
ESA
Alpha
Server
4100
100mbit
EtherSwitch
60
4.3 GB
Drives
Alpha
Server
4100
ImgCutter
\Drop’N’
\Images
Enterprise Storage Array
STC
DLT
Tape
Library
108
9.1 GB
Drives
108
9.1 GB
Drives
108
9.1 GB
Drives
Alpha
Server
8400
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
...
*.IMD & *.JPG
Pre-Process Data
NT Backup
Read *.IMD files
Generate Ids
Generate ZLatLong
Sort by ZLatLong
Image Meta
Tile Meta
Load Thumb Img
Load Browse Img
Load Tile Img
Read Image Meta
Read Image Data
BCP into ImgTbl
Read Image Meta
Read Image Data
BCP into ImgTbl
Read Tile Meta
Read Tile Data
BCP into TileTbl
“SRC”ThumbImg
ThumbImgId int
ImgMetaId int
ZLatLong
int
SrcId
int
ImgTypeId
int
PixWidth
int
PixHeight
int
ImgData
Blob
“SRC”BrowseImg
BrowseImgId int
ImgMetaId
int
ZLatLong
int
SrcId
int
ImgTypeId
int
PixWidth
int
PixHeight
int
ImgData
Blob
Meta & Image
Load Process
“SRC”TileImg
TileImgId
TileMetaId
ZLatLong
SrcId
ImgTypeId
PixWidth
PixHeight
ImgData
int
int
int
int
int
int
int
Blob
Load Tile Meta
Load Img Meta
Read Image Meta
BCP into TileMeta
Read Image Meta
BCP into TileMeta
TileMeta
TileMetaId
ImgMetaId
OrigMetaId
SrcId
ImgTypeId
XGridId
YGridId
Hemisphere
Continent
xxLat
xxLong
ZLatLong
int
int
int
int
int
int
int
smallint
smallint
smallint
smallint
int
ImgMeta
ImgMetaId
int
OrigMetaId
int
SrcId
int
ImgTypeId
int
XGridId
int
YGridId
int
ImgDate
Date
Hemisphere smallint
Continent
smallint
xxLat
smallint
xxLong
smallint
ZLatLong
int
MetaStr vchar(255)
The Load Manager
A
Workflow System. Manages Job ‘Steps’.
 Built as an SQL Database App. Collects Stats.
 Would use Data Transformation Services today
Load Statistics
 601
DOQ Jobs, 818 Spin Jobs
– Each job does 3 meta BCP, 4 Image BCP steps
 5676
Image BCP Steps
– 106 million total images loaded
– 546 GB total. 5.4 KB avg image size
 For
Tile Images (96% of the database)
– avg 68,000 images/step. max 757,000
– avg 33 minutes/step. max 596
– total time 796 hours (33 days)
Total mB Loaded Each Day
 Bottleneck
varied: image supply, preprocess,
network traffic, BCP.
TerraServer Load - mB/Day
30000
mB Loaded
25000
20000
15000
10000
5000
0
11/26/97
12/16/97
1/5/98
1/25/98
2/14/98
Day
3/6/98
3/26/98
4/15/98
Average BCP Rate Each Day
 Rate
measured just over BCP calls
 Single BCP stream. Client sometimes remote
TerraServer Avg BCP rate by day
BCP Rate (kB/sec)
6000
5000
4000
3000
2000
1000
0
11/26/97
12/16/97
1/5/98
1/25/98
2/14/98
Day
3/6/98
3/26/98
4/15/98
BCP rates for Tile Images
 BCP
improvements with new builds
 There were also higher rates (10-13 mB/sec)
for local clients
BCP rate kB/sec
TerraServer TileImg BCP rates (below 6000 kB/sec)
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0
200
400
600
800
1000
Time (Hours since 12/11/97)
1200
1400
1600
System Maintenance:
Backup &
Recovery
 Industrial
–
–
–
–
Strength
High Performance
Online Backups
Simple, Error Free Media Handling
Minimal Recovery Time
Project Phases & Characteristics
 Load
Phase
– Ongoing Massive Data Loads
– Updates to Fix Errors in Meta-Data
– Backups at Key Milestones
 Deployed
–
–
–
–
7 x 24
Some Updates to Existing Data
Small Loads as More Data Arrives
Infrequent Large Loads
SQL Server 7.0 Backup/Restore

Fast
 Online
Features
Backup Under Load
– Minimal Impact
 Just
the Data
 Backup Part of the Database
 Minimize Recovery Time
– Differential Backups, Log Backups
– Restore Only Damaged Files
SQL Server 7.0 Backup/Restore
Limitations
 No
Tape Robot Support
 Limited
Media Management
 Doesn’t Back Up the Whole NT Platform
Backup ISVs Address
Limitations
 Legato
NetWorker™
 Computer Associates ArcServe™
 Seagate Backup Exec™
 Others…
These Products support SQL Server 6.5
None Support SQL Server 7.0 yet.
Load Phase
3/97 - 6/98
 SQL
Server Backup Not Used
 Backup the Database as a Set of NT files
 Shut Down SQL Server During Backups
Tape Library
Backup
Software
Deployed
6/98...
 ISV
Supports SQL Server 7.0 High
Performance Backup API
 ISV Supports Full Range of SQL Server 7.0
Backup/Restore Features
Tape Library
SQL
Server
Backup API
Backup
Software
Backup API Performance
Throughput
NUL
PIPE (no write)
Backup API (no write)
0
20
40
60
80
100
120
50
60
MB/sec
Avg CPU Usage
NUL
PIPE (no write)
Backup API (no write)
0
10
20
30
Percent
40
Verifying Backup/Restore
 Minimal
Risk
Restore to a Separate
System at DECWest
– Early Problems with Unreadable Tapes
TerraServer
Test System
Another Terabyte of Disk!
TerraServer
Backup/Restore
Factoids
 Backup/Restore
Rate
200 GB/Hr (57 MB/sec)
 Time
Required for Full Database Backup:
5 Hours
 Number
36
of DLT Tape Cartridges:
Other Details

Active Server pages
– faster and easier than DB stored procedures.

Commerce Server is interesting
– Images the Inventory
 no SKU,
 millions of them
– USGS built their own
 they are very smart, but it is easy
 masquerade as a credit-card reader.
The earth is a geoid, and
 Every Geographer has a coordinate system (or two).
 Tapes are still a nightmare.
 Everyone is a UI expert.

Thank You!
SPIN-2
Microsoft
BackOffice
Download