- Microsoft Research

advertisement
Building Mult-Petabyte
Online Databases
Jim Gray
Microsoft Research
Gray@Microsoft.com
http://research.Microsoft.com/~Gray
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
• Research driven by apps:
– EOS/DIS
– TerraServer
– National Virtual Astronomy Observatory.
Reality Check
• Good news
– In the limit, processing & storage & network is free
– Processing & network is infinitely fast
• Bad news
– Most of us live in the present.
– People are getting more expensive.
Management/programming cost exceeds hardware cost.
– Speed of light not improving.
– WAN prices have not changed much in last 8 years.
How Much Information
Is there?
• Soon everything can be
recorded and indexed
• Most data never be seen by humans
• Precious Resource:
Human attention
Auto-Summarization
Auto-Search
is key technology.
Everything
!
Recorded
All Books
Yotta
Zetta
Exa
MultiMedia
Peta
All LoC books
(words)
.Movi
e
A Photo
Tera
Giga
Mega
www.lesk.com/mlesk/ksg97/ksg.html
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Kilo
Trends:
ops/s/$ Had Three Growth Phases
1890-1945
Mechanical
Relay
7-year doubling
1945-1985
Tube, transistor,..
2.3 year doubling
1985-2000
Microprocessor
1.0 year doubling
1.E+09
ops per second/$
doubles every
1.0 years
1.E+06
1.E+03
1.E+00
1.E-03
doubles every
7.5 years
doubles every
2.3 years
1.E-06
1880
1900
1920
1940
1960
1980
2000
Sort Speedup
• Performance doubled every year for the last 15 years.
• But now it is 100s or 1,000s of processors and disks
• Got 40%/y (70x) from technology and
60%/y (1,000x) from parallelism (partition and pipeline)
• See http://research.microsoft.com/barc/SortBenchmark/
SPsort
1.E+08
Sort Re cords/se cond vs T ime
SPsort/ IB
1.E+07
1.E+06
Records Sorted per Second
Doubles Every Year
NOW
IBM RS6000
1.E+06
IBM 3090
Sandia/Compaq
/NT
Ordinal+SGI
NT/PennySort
Alpha
Compaq/NT
1.E+03
1.E+05
Cray YMP
Sequent
1.E+04
1.E+03
Intel
HyperCube
Penny
NT sort
1.E+00
Kitsuregawa
Hardware Sorter
Tandem
1.E+02
1985
GB Sorted per Dollar
Doubles Every Year
Bitton M68000
1990
1995
2000
1.E-03
1985
1990
1995
2000
Terabyte (Petabyte) Processing
Requires Parallelism
parallelism: use many little devices in parallel
1,000 x parallel:
At 10 MB/s:
1.2 days to scan
100 seconds scan.
1 Terabyte
1 Terabyte
10 MB/s
Use
100 processors &
1,000 disks
Parallelism Must Be Automatic
• There are thousands of MPI programmers.
• There are hundreds-of-millions of people using
parallel database search.
• Parallel programming is HARD!
• Find design patterns and automate them.
• Data search/mining has parallel design patterns.
Storage capacity
beating Moore’s law
Disk TB Shipped per Year
1E+7
ExaByte
1E+6
• 4 k$/TB today (raw disk)
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
Moores law
Revenue
TB growth
Price decline
1991
1994
1997
58.70% /year
7.47%
112.30% (since 1993)
50.70% (since 1993)
2000
Cheap Storage
• 4 k$/TB disks (16 x 60 GB disks @ 210$ each)
40
1000
900
40
Price
disk capacity
capacity
Price vs
vs disk
900
800
800
700
IDE
SCSI
600
700
SCSI
IDE
y = 15.895x + 13.446
raw
raw k$/TB
3030
k$/TB
300
y = 13.322x - 1.4332
200
y = 5.7156x + 47.857
400
300
SCSI
IDE
20
$
400
500
SCSI
IDE
2525
$
$ $
500
600
35
35
20
15
1015
7
510
100
200
0
0
1000
10
0
0
20
40
y 30
=unit
3.0635x
+
Raw Disk
Size GB
50
40.542
20
40
60
Raw Disk unit Size GB
60
5
0
10
20
30
40
Disk unit size GB
0
80
0
20
40
Disk unit size GB
50
60
60
80
240 GB, 2k$ (now)
320 GB by year end.
• 4x60 GB IDE
(2 hot plugable)
– (1,100$)
• SCSI-IDE bridge
– 200k$
• Box
–
–
–
–
500 Mhz cpu
256 MB SRAM
Fan, power, Enet
700$
• Or 8 disks/box
480 GB for ~3K$ ( or 300 GB RAID)
Hot Swap Drives for Archive or
Data Interchange
• 35 MBps write
(so can write
N x 60 GB
in 40 minutes)
• 60 GB/overnite
= ~N x 2 MB/second
@ 19.95$/nite
17$
200$
Cheap Storage or Balanced System
• Low cost storage
(2 x 1.5k$ servers) 7K$ TB
2x (1K$ system + 8x60GB disks + 100MbEthernet)
• Balanced server (7k$/.64 TB)
–
–
–
–
–
2x800Mhz (2k$)
256 MB (200$)
8 x 80 GB drives (2.8K$)
Gbps Ethernet + switch (1.5k$)
11k$ TB, 22K$/RAIDED TB
2x800 Mhz
256 MB
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 access per second / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
1 TB
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
Disk vs Tape
Disk
Tape
– 80 GB
– 35 MBps
– 5 ms seek time
– 3 ms rotate latency
– 4$/GB for drive
3$/GB for ctlrs/cabinet
– 4 TB/rack
–
–
–
–
–
– 1 hour scan
– 1 week scan
40 GB
10 MBps
10 sec pick time
30-120 second seek time
2$/GB for media
8$/GB for drive+library
– 10 TB/rack
Guestimates
Cern: 200 TB
3480 tapes
2 col = 50GB
Rack = 1 TB
=12 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
At 10K$/TB, disk is competitive with nearline tape.
Gilder’s Law:
3x bandwidth/year for 25 more years
• Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle
•
•
•
•
In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps = USA 1996 WAN bisection bandwidth
Aggregate bandwidth doubles every 8 months!
1 fiber = 25 Tbps
Sense of scale
• How fat is your pipe?
• Fattest pipe on MS
campus is the WAN!
94 MBps Coast to Coast
300 MBps OC48 = G2
Or
memcpy()
90 MBps PCI
20MBps disk / ATM / OC3
Redmond/Seattle, WA
Information Sciences Institute
Microsoft
Qwest
University of Washington
Pacific Northwest Gigapop
New York
HSCC (high speed connectivity consortium)
DARPA
Arlington, VA
San Francisco,
CA
5626 km
10 hops
Networking
• WANS are getting faster than LANS
G8 = OC192 = 10Gbps is “standard”
• Link bandwidth improves 4x per 3 years
• Speed of light (60 ms round trip in US)
• Software stacks
have always been the problem.
Time = SenderCPU + ReceiverCPU + bytes/bandwidth
This has been
the problem
The Promise of SAN/VIA/Infiniband
10x in 2 years
• Yesterday:
http://www.ViArch.org/
– 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2
cpus
– round-trip latency ~250 µs
• Now
250
Time µs to
Send 1KB
200
150
Transmit
receivercpu
sender cpu
100
– Wires are 10x faster
Myrinet, Gbps Ethernet, ServerNet,…
– Fast user-level
communication
• tcp/ip ~ 100 MBps 10% cpu
• round-trip latency is 15 us
• 1.6 Gbps demoed on a WAN
50
0
100Mbps
Gbps
SAN
What’s a Balanced System?
System Bus
PCI Bus
PCI Bus
Rules of Thumb in Data Engineering
•
•
•
•
•
•
•
Moore’s law -> an address bit per 18 months.
Storage grows 100x/decade (except 1000x last decade!)
Disk data of 10 years ago now fits in RAM (iso-price).
Device bandwidth grows 10x/decade – so need parallelism
RAM:disk:tape price is 1:10:30 going to 1:10:10
Amdahl’s speedup law: S/(S+P)
Amdahl’s IO law: bit of IO per instruction/second
(tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars)
• Amdahl’s memory law: byte per instruction/second (going to 10)
(1 TB RAM per TOP: 1 TeraDollars)
• PetaOps anyone?
• Gilder’s law: aggregate bandwidth doubles every 8 months.
• 5 Minute rule: cache disk data that is reused in 5 minutes.
• Web rule: cache everything!
http://research.Microsoft.com/~gray/papers/
MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc
Scalability: Up and Out
Up
•“Scale Up”
–Use “big iron” (SMP)
–Cluster into packs for availability
•“Scale Out” clones & partitions
–Use commodity servers
–Add clones & partitions as needed
Out
An Architecture for Internet Services?
• Need to be able to add capacity
– New processing
– New storage
– New networking
• Need continuous service
– Online change of all components (hardware and software)
– Multiple service sites
– Multiple network providers
• Need great development tools
– Change the application several times per year.
– Add new services several times per year.
Farm
Premise: Each Site is a
• Buy computing by the slice (brick):
Building 11
– Rack of servers + disks.
• Grow by adding slices
Internal WWW
Staging Servers
(7)
Log Processing
Av e CFG: 4xP6,
1 GB RAM,
180 GB HD
Av e Cost: $128K
FY98 Fcst: 2
• Two styles:
The Microsoft.Com Site
SQLNet
Feeder LAN
Router
Liv e SQL Serv ers
MOSWest
Admin LAN
Live SQL Server
All servers in Building11
are accessable from
corpnet.
w w w .microsoft.com
(4)
register.microsoft.com
(2) Ave CFG: 4xP6,
Ave CFG: 4xP6,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 3
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $83K
FY98 Fcst: 12
Av e CFG: 4xP6,
512 RAM,
50 GB HD
Av e Cost: $35K
FY98 Fcst: 2
home.microsoft.com
(4)
Av e CFG: 4xP6
512 RAM
28 GB HD
Av e Cost: $35K
FY98 Fcst: 17
FDDI Ring
(MIS1)
home.microsoft.com
(3)
FDDI Ring
(MIS2)
Av e CFG: 4xP6,
256 RAM,
30 GB HD
Av e Cost: $25K
FY98 Fcst: 2
Router
Internet
register.msn.com
(2)
Switched
Ethernet
search.microsoft.com
(1)
Japan Data Center
w w w .microsoft.com SQL SERVERS
(2)
premium.microsoft.com
(3)
Av e CFG: 4xP6,
(1)
Av e CFG: 4xP6,
512 RAM,
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $35K
FY98 Fcst: 1
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 1
160 GB HD
Av e Cost: $80K
FY98 Fcst: 1
msid.msn.com
(1)
Switched
Ethernet
FTP
Download Serv er
(1)
HTTP
Download Serv ers
(2)
search.microsoft.com
(2)
Router
Secondary
Gigaswitch
support.microsoft.com
search.microsoft.com
(1)
(3)
Router
support.microsoft.com
(2)
13
DS3
(45 Mb/Sec Each)
Ave CFG: 4xP5,
512 RAM,
30 GB HD
Ave Cost: $28K
FY98 Fcst: 0
register.microsoft.com
(2)
register.microsoft.com
(1)
(100Mb/Sec Each)
Internet
Router
FTP.microsoft.com
(3)
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
Router
Av e CFG: 4xP5,
256 RAM,
20 GB HD
Av e Cost: $29K
FY98 Fcst: 2
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $28K
FY98 Fcst: 7
activex.microsoft.com
(2)
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $28K
FY98 Fcst: 3
Router
home.microsoft.com
(2)
SQL SERVERS
(2)
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $80K
FY98 Fcst: 1
Router
premium.microsoft.com
(1)
FDDI Ring
(MIS3)
FTP
Download Serv er
(1)
Router
Router
msid.msn.com
(1)
512 RAM,
30 GB HD
Av e Cost: $35K
FY98 Fcst: 1
msid.msn.com
(1)
search.microsoft.com
(3)
cdm.microsoft.com
(1)
Av e CFG: 4xP5,
256 RAM,
12 GB HD
Av e Cost: $24K
FY98 Fcst: 0
Av e CFG: 4xP6,
1 GB RAM,
160 GB HD
Av e Cost: $83K
FY98 Fcst: 2
msid.msn.com
(1)
w w w .microsoft.com
(4)
512 RAM,
30 GB HD
Ave Cost: $43K
FY98 Fcst: 10
Av e CFG: 4xP6,
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 17
w w w .microsoft.com
(3)
w w w .microsoft.compremium.microsoft.com
(1)
Av e CFG: 4xP6,
Av e CFG: 4xP6, (3)
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 1
SQL Consolidators
DMZ Staging Serv ers
Router
SQL Reporting
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $80K
FY98 Fcst: 2
European Data Center
IDC Staging Serv ers
MOSWest
FTP Servers
Ave CFG: 4xP5,
512 RAM,
Download 30 GB HD
Replication Ave Cost: $28K
FY98 Fcst: 0
premium.microsoft.com
(2)
– Spread data and
computation
to new slices
Ave CFG: 4xP5,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 12
w w w .microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
Ave CFG: 4xP6,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 9
\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd
12/15/97
– Clones: anonymous servers
– Parts+Packs: Partitions fail over within a pack
• In both cases, remote farm for disaster recovery
2
Ethernet
(100 Mb/Sec Each)
Everyone scales out
What’s the Brick?
• 1M$/slice
– IBM S390?
– Sun E 10,000?
• 100 K$/slice
– HPUX/AIX/Solaris/IRIX/EMC
• 10 K$/slice
– Utel / Wintel 4x
• 1 K$/slice
– Beowulf / Wintel 1x
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
• Research driven by apps:
– EOS/DIS
– TerraServer
– National Virtual Astronomy Observatory.
Interesting Apps
• EOS/DIS
• TerraServer
• Sloan Digital Sky Survey
Kilo
Mega
Giga
Tera
Peta
Exa
103
106
109
1012
1015
1018
today, we are here
The Challenge -- EOS/DIS
• Antarctica is melting -- 77% of fresh water liberated
– sea level rises 70 meters
– Chico & Memphis are beach-front property
– New York, Washington, SF, LA, London, Paris
• Let’s study it! Mission to Planet Earth
• EOS: Earth Observing System (17B$ => 10B$)
– 50 instruments on 10 satellites 1999-2003
– Landsat (added later)
• EOS DIS: Data Information System:
– 3-5 MB/s raw, 30-50 MB/s processed.
– 4 TB/day,
– 15 PB by year 2007
Designing EOS/DIS
• Expect that millions will use the system (online)
Three user categories:
– NASA 500 -- funded by NASA to do science
– Global Change 10 k - other earth scientists
– Internet 500 M - everyone else
High school students
Grain speculators
Environmental Impact Reports
New applications
=> discovery & access must be automatic
• Allow anyone to set up a peer- node (DAAC & SCF)
• Design for Ad Hoc queries, Not Standard Data Products
design for pull vs push.
computation demand is enormous
(pull:push is 100: 1)
Key Architecture Features
•
•
•
•
•
•
2+N data center design
Scaleable OR-DBMS
Emphasize Pull vs Push processing
Storage hierarchy
Data Pump
Just in time acquisition
Obvious Point:
EOS/DIS will be a cluster of SMPs
• It needs 16 PB storage
= 200K disks in current technology
= 400K tapes in current technology
• It needs 100 TeraOps of processing
= 100K processors (current technology)
and ~ 100 Terabytes of DRAM
• Startup requirements are 10x smaller
– smaller data rate
– almost no re-processing work
2+N data center design
•
•
•
•
duplex the archive (for fault tolerance)
let anyone build an extract (the +N)
Partition data by time and by space (store 2 or 4 ways).
Each partition is a free-standing OR-DBBMS
(similar to Tandem, Teradata designs).
• Clients and Partitions interact
via standard protocols
– HTTP+XML,
Data Pump
• Some queries require reading ALL the data
(for reprocessing)
• Each Data Center scans ALL the data every 2 days.
– Data rate 10 PB/day = 10 TB/node/day = 120 MB/s
• Compute on demand small jobs
•
•
•
less than 100 M disk accesses
less than 100 TeraOps.
(less than 30 minute response time)
• For BIG JOBS scan entire 15PB database
• Queries (and extracts) “snoop” this data pump.
Problems
• Management (and HSM)
• Design and Meta-data
• Ingest
• Data discovery, search, and analysis
• Auto Parallelism
• reorg-reprocess
What this system taught me
• Traditional storage metrics
– KAPS: KB objects accessed per second
– $/GB: Storage cost
• New metrics:
– MAPS: megabyte objects accessed per second
– SCANS: Time to scan the archive
– Admin cost dominates (!!)
– Auto parallelism is essential.
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
• Research driven by apps:
– EOS/DIS
– TerraServer
– National Virtual Astronomy Observatory.
Microsoft TerraServer:
http://TerraServer.Microsoft.com/
• Build a multi-TB SQL Server database
• Data must be
–
–
–
–
1 TB
Unencumbered
Interesting to everyone everywhere
And not offensive to anyone anywhere
–
–
–
–
1.5 M place names from Encarta World Atlas
7 M Sq Km USGS doq (1 meter resolution)
10 M sq Km USGS topos (2m)
1 M Sq Km from Russian Space agency (2 m)
• Loaded
• On the web (world’s largest atlas)
• Sell images with commerce server.
Microsoft TerraServer Background
• Earth is 500 Tera-meters square
– USA is 10 tm2
• 100 TM2 land in 70ºN to 70ºS
• We have pictures of 6% of it
• Someday
– multi-spectral image
– of everywhere
– once a day / hour
– 3 tsm from USGS
– 2 tsm from Russian Space Agency
•
•
•
•
Compress 5:1 (JPEG) to 1.5 TB.
Slice into 10 KB chunks
Store chunks in DB
Navigate with
– Encarta™ Atlas
• globe
• gazetteer
– StreetsPlus™ in the USA
.2x.2 km2 tile
.4x.4 km2 image
.8x.8 km2 image
1.6x1.6 km2 image
USGS Digital Ortho Quads (DOQ)
• US Geologic Survey
• 6 Tera Bytes
(14 TB raw but there is redundancy)
• Most data not yet published
• Based on a CRADA
– TerraServer makes data available.
1x1 meter
6 TB
Continental
US
New Data
Coming
USGS “DOQ”
Russian Space Agency(SovInfomSputnik)
SPIN-2 (Aerial Images is Worldwide Distributor)
SPIN-2
•
•
•
•
•
•
1.5 Meter Geo Rectified imagery of (almost) anywhere
Almost equal-area projection
De-classified satellite photos (from 200 KM),
More data coming (1 m)
Selling imagery on Internet.
Putting 2 tm2 onto Microsoft TerraServer.
Hardware
Internet
Map Site
Server
Servers
SPIN-2
100 Mbps
Ethernet Switch
DS3
Web Servers
6TB Database Server
AlphaServer 8400 4x400. 10
GB RAM
324 StorageWorks disks
10 drive tape library
(STC Timber Wolf DLT7000 )
Software
Web Client
Image
Server
Active Server Pages
Internet
Information
Server 4.0
Java
Viewer
browser
MTS
Terra-Server
Stored Procedures
HTML
The Internet
Internet Info
Server 4.0
SQL Server 7
Microsoft Automap
ActiveX Server
TerraServer DB
Automap Server
TerraServer Web Site
Internet Information
Server 4.0
Microsoft
Site Server EE
Image Delivery SQL Server
Application
7
Image Provider Site(s)
BAD OLD Load
DLT
Tape
DLT
Tape
“tar”
NT
\Drop’N’
DoJob
LoadMgr
DB
Wait 4
Load
Backup
LoadMgr
LoadMgr
ESA
Alpha
Server
4100
100mbit
EtherSwitch
60
4.3 GB
Drives
Alpha
Server
4100
ImgCutter
\Drop’N’
\Images
Enterprise Storage Array
STC
DLT
Tape
Library
108
9.1 GB
Drives
108
9.1 GB
Drives
108
9.1 GB
Drives
Alpha
Server
8400
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
...
New Image Load and Update
DLT
Tape
“tar”
Active Server Pages
Cut & Load
Scheduling
System
Metadata
Load DB
Dither
Image Pyramid
From base
Image
Cutter
Merge
ODBC
Tx
TerraLoader
ODBC TX
ODBC Tx
TerraServer
SQL
DBMS
After a Year:
30M
Count
• 2 TB of data
750 M records
• 2.3 billion Hits
• 2.0 billion DB Queries
• 1.7 billion Images sent
(2 TB of download)
• 368 million Page Views
• 99.93% DB Availability
• 3rd design now Online
• Built and operated by
team of 4 people
TerraServer Daily Traffic
Jun 22, 1998 thru June 22, 1999
Sessions
Hit
Page View
DB Query
Image
20M
10M
0
Down Time
TotalTime (Hours)
(Hours:minutes)
8640
6:00
7920
5:30
7200
5:00
6480
Operations
4:30
5760
4:00
5040
4320
3600
2880
Up
3:30
3:00
2:30
Scheduled
2:00
2160
1:30
1440
1:00
720
0:30
0
0:00
HW+Software
TerraServer Current Effort
•
•
•
•
•
•
•
•
Added USGS Topographic maps (4 TB)
The other 25% of the US DOQs (photos)
Adding digital elevation maps
Integrated with Encarta Online
Open architecture: publish XML and C# interfaces.
Adding mult-layer maps (with UC Berkeley)
High availability (4 node cluster with failover)
Geo-Spatial extension to SQL Server
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
• Research driven by apps:
– EOS/DIS
– TerraServer
– National Virtual Astronomy Observatory.
(inter) National Virtual Observatory
•
•
•
•
•
•
•
Almost all astronomy datasets will be online
Some are big (>>10 TB)
Total is a few Petabytes
Bigger datasets coming
Data is “public”
Scientists can mine these datasets
Computer Science challenge:
Organize these datasets
Provide easy access to them.
The Sloan Digital Sky Survey
A project run by the Astrophysical Research Consortium (ARC)
The University of Chicago
Princeton University
The Johns Hopkins University
The University of Washington
Fermi National Accelerator Laboratory
US Naval Observatory
The Japanese Participation Group
The Institute for Advanced Study
SLOAN Foundation, NSF, DOE, NASA
Goal: To create a detailed multicolor map of the Northern Sky
over 5 years, with a budget of approximately $80M
Data Size: 40 TB raw, 1 TB processed
Scientific Motivation
Create the ultimate map of the Universe:
 The Cosmic Genome Project!
Study the distribution of galaxies:
 What is the origin of fluctuations?
 What is the topology of the distribution?
Measure the global properties of the Universe:
 How much dark matter is there?
Local census of the galaxy population:
 How did galaxies form?
Find the most distant objects in the Universe:
 What are the highest quasar redshifts?
The ‘Naught’ Problem
What are the global parameters of the Universe?
H0
0
0
the Hubble constant
the density parameter
the cosmological constant
55-75 km/s/Mpc
0.25-1
0 - 0.7
Their values are still quite uncertain today...
Goal: measure these parameters with an accuracy of a few percent
High Precision Cosmology!
The Cosmic Genome Project
The SDSS will create the ultimate map
of the Universe, with much more detail
than any other measurement before
daCosta
etal 1995
deLapparent, Geller and Huchra
1986
Gregory and Thompson 1978
SDSS Collaboration 2002
The Spectroscopic Survey
Measure redshifts of objects  distance
SDSS Redshift Survey:
1 million galaxies
100,000 quasars
100,000 stars
Two high throughput spectrographs
spectral range 3900-9200 Å.
640 spectra simultaneously.
R=2000 resolution.
Automated reduction of spectra
Very high sampling density and completeness
Objects in other catalogs also targeted
The First Quasars
Three of the four highest redshift
quasars have been found in the
first SDSS test data !
SDSS Data Products
Object catalog
parameters of >108 objects
Redshift Catalog
parameters of 106 objects
400 GB
2 GB
Atlas Images
5 color cutouts of >109 objects
1.5 TB
Spectra
in a one-dimensional form 106
60 GB
Derived Catalogs
- clusters
- QSO absorption lines
60 GB
4x4 Pixel All-Sky Map
heavily compressed 5 x 105
1 TB
All raw data saved in a tape vault at Fermilab
Concept of the SDSS Archive
Operational
Archive
Science Archive
(products accessible to users)
(raw + processed data)
Other
Archives
Other
OtherArchives
Archives
Parallel Query Implementation
• Getting 200MBps/node thru SQL today
• = 4 GB/s on 20 node cluster.
User Interface
Analysis Engine
Master
SX Engine
DBMS Federation
DBMS
Slave
Slave
Slave
DBMS
Slave
DBMS
RAID
DBMS
RAID
DBMS
RAID
RAID
What we have been doing
with SDSS
– Database design
– Data loading
– Spatial data access method
1.E+7
1.E+6
1.E+5
u-g
1.E+4
Counts
• Helping move the data to SQL
Color Magnitude Diff/Ratio Distribution
g-r
r-i
1.E+3
i-z
1.E+2
1.E+1
1.E+0
-30
-20
-10
0
10
20
Magnitude Diff/Ratio
• Experimenting with queries on a 4 M object DB
– 20 questions like “find gravitational lens candidates”
– Queries use parallelism,
most run in a few seconds.(auto parallel)
– Some run in hours (neighbors within 1 arcsec)
– EASY to ask questions.
• Helping with an “outreach” website: SkyServer
• Personal goal:
Try data mining techniques to “re-discover” Astronomy
30
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
• Research driven by apps:
– EOS/DIS
– TerraServer
– National Virtual Astronomy Observatory.
Download