Microsoft Research
SKYSERVER
Jim Gray
Distinguished Engineer
Microsoft Research
San Francisco
Microsoft Research
Organization goal:
Advance state of the art
More than 700 staff, 55 areas
Labs in US, Europe, Asia
Internationally recognized teams
University organizational model
Open research environment
Close ties to universities
Close working relations with development.
My Research Goal
Information at your fingertips
Bring all scientific literature and data online
Focus on large database issues,
and scalable servers.
Experiments &
Instruments
Other Archives
Literature
Simulations
questions
facts
facts
?
answers
World Wide Telescope
Premise: Most Astronomy data is online
The Internet is the world’s best telescope
It has data on every part of the sky
In every measured spectral band:
As deep as the best instruments
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
It’s a smart telescope:
links data with literature.
SkyServer.SDSS.org
Built with Johns Hopkins U.
A modern archive
Raw data in file servers
Catalog data (derived objects) in Database
10 billon records, 2 TB
Also used for education
150 hours of online Astronomy
Interesting things
Based on Web Services
Spatial data search
Cloned by other surveys
(a design template)
Service Oriented Architecture
Data Federations of Web Services
Massive datasets live near their owners:
Near instrument software pipeline, apps
DB
Near data knowledge and curation
Each Archive publishes a web service
Schema: documents the data
DB
Methods on objects (queries)
Uniform access to multiple Archives
A common global schema
DB
DB
DB
Scientists get “personalized” extracts
SkyQuery Structure
Each SkyNode publishes
Portal
Schema Web Service
Plans Query (2 phase)
Data Query Web Service
Integrates answers
Is itself a web service
Image
Cutout
SDSS
SkyQuery
Portal
FIRST
2MASS
INT
Federation:
SkyQuery.Net
Combines 15 archives
Send query to portal,
portal joins data from archives.
Problem: want to do multi-step data analysis
(not just single query).
Solution: Allow personal databases on portal
Problem: some queries are monsters
Solution: “batch scheduler” on portal server,
Deposits answer in personal db.
Current Status: CERN → Pasadena
Multi Stream tpc/ip 7.1 Gbps ~900 MBps
New speed record @ http://ultralight.caltech.edu/lsr-winhec/
Single Stream tpc/ip 6.5 Gbps ~800 MBps
File Transfer Speed
~450 MBps
mbps per second
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0
2000
2001
2002
2003
2004
2005
Challenge: Move Data from CERN
to Remote Centers @ 1GBps
~PBps
Filter ~1 GBps
• Disk-to-Disk
Experiment
CERN
• gigabyte / second
Tier 1~5 GBps
data rates
• 80TB/day
~1 GBps
Tier 2
• 30 petabytes by 2008
~1 GBps
Tier 3 Physics
• 1 exabyte by 2014
data
cache
INP3
RAL
INFN
FNAL
…
Tier 2 Tier 2 Tier 2 Tier 2 Tier 2
Institute
Tier 4
.1 GBps
Institute
Institute
Institute
Workstations
Graphics courtesy of Harvey Newman @ Caltech
Summary
Microsoft Research is active inside and outside
Microsoft.
World Wide Telescope is coming
Exemplifies service oriented architecture
Built with web services and databases
Has interesting spatial database algorithms
10Gbps Networking is coming,
x-64 is coming
and we are investing to make them real.
Details on my website:
http://research.microsoft.com/~Gray
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.