Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray

advertisement
Managing Data for
the World Wide Telescope
aka: The Virtual Observatory
Jim Gray
Alex Szalay
SLAC Data Management Workshop
1
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes data
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Data Exploration Science
Data captured by instruments
Or data generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
2
Information Avalanche
• In science, industry, government,….
– better observational instruments and
– and, better simulations
producing a data avalanche
Image courtesy
C. Meneveau & A. Szalay @ JHU
• Examples
– BaBar: Grows 1TB/day
2/3 simulation Information
1/3 observational Information
– CERN: LHC will generate 1GB/s .~10 PB/y
– VLBA (NRAO) generates 1GB/s today
– Pixar: 100 TB/Movie
BaBar, Stanford
P&E Gene Sequencer From
http://www.genome.uci.edu/
• New emphasis on informatics:
– Capturing, Organizing,
Summarizing, Analyzing, Visualizing
3
Space Telescope
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling
4
FTP - GREP
• Download (FTP and GREP) are not adequate
–
–
–
–
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~3,000 disks
• At some point we need
indices to limit search
parallel data search and analysis
• This is where databases can help
• Next generation technique: Data Exploration
– Bring the analysis to the data!
5
The Speed Problem
• Many users want to search the whole DB
ad hoc queries, often combinatorial
• Want ~ 1 minute response
• Brute force (parallel search):
– 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB
• Indices (limit search, do column store)
– 1,000x less equipment: 1M$/PB
• Pre-compute answer
– No one knows how do it for all questions.
6
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Relax notion of optimal
(data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
• Combination of statistics & computer science
7
Analysis and Databases
• Much statistical analysis deals with
–
–
–
–
–
–
–
–
–
Creating uniform samples –
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
• Traditionally these are performed on files
• Most of these tasks are much better done inside a database
• Move Mohamed to the mountain, not the mountain to
8
Mohamed.
Organization & Algorithms
• Use of clever data structures (trees, cubes):
–
–
–
–
Up-front creation cost, but only N logN access cost
Large speedup during the analysis
Tree-codes for correlations (A. Moore et al 2001)
Data Cubes for OLAP (all vendors)
• Fast, approximate heuristic algorithms
– No need to be more accurate than cosmic variance
– Fast CMB analysis by Szapudi et al (2001)
• N logN instead of N3 => 1 day instead of 10 million years
• Take cost of computation into account
– Controlled level of accuracy
– Best result in a given time, given our computing resources
9
World Wide Telescope
Virtual Observatory
http://www.ivoa.net/
• Premise:
Most data is (or could be online)
• The Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data
to literature on them.
10
Why Astronomy?
• Community has lots of data
• Data is real and well documented
– High-dimensional (with confidence intervals)
– Spatial, temporal
•
Diverse and distributed
– Many different instruments from
many different places and
many different times
• Community wants to share/cross compare
– Can freely share data and algorithms.
– “DataMining, Not Data MINE!!” Mark Ellisman, UCSD
• They are well organized
• Community is small and homogeneous
• No commercial or privacy concerns
– All the problems are technical or social.
11
The WWT Components
• Data Sources
– Literature
– Archives
• Unified Definitions
– Units,
– Semantics/Concepts/Metrics,
Representations,
– Provenance
• Object model
• Classes and methods
• Portals
12
Data Sources
• Literature online and cross indexed
– Simbad, ADS, NED,
http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/
• Many curated archives online
– FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,…
– Typically files with English meta-data and some programs
• Groups, Researchers, Amateurs Publish
– Datasets online in various formats
– Data publications are ephemeral (may disappear)
– Many have unknown provenance
• Documentation varies; some good and some none. 13
Unified Definitions
• Universal Content Definitions
http://vizier.u-strasbg.fr/doc/UCD.htx
– Collated all table heads from all the literature
– 100,000 terms reduced to ~1,500
– Rough consensus that this is the right thing.
– Refinement in progress as people use UCDs
• Defines
– Units:
• gram, radian, second, janski...
– Semantic Concepts / Metrics
• Std error, Chi2 fit, magnitude, flux @ passband, velocity,
14
Provenance
• Most data will be derived.
• To do science,
need to trace derived data back to source.
• So programs and inputs must be registered.
• Must be able to re-run them.
• Example: Space Telescope Calibrated Data
– Run on demand
– Can specify software version (to get old answers)
• Scientific Data Provenance and Curation are
largely unsolved problems
(some ideas but no science).
15
Object Model
Your
• General acceptance of XML
program
• Recent acceptance of XML Schema
(XSD over DTD)
Web
Server
• Wait-and-See about SOAP/WSDL/…
– “ Web Services are just Corba with angle
brackets.”
– FTP is good enough for me.
• Personal opinion:
– Web Services are much more than
“Corba + <>”
– Huge focus on interop
– Huge focus on integrated tools
Your
program
Data
• But the community says “Show me!” In your
address
– Many technologists convinced,
space
but not yet the astronomers
Web
Service
16
Classes and Methods
Your
program
• First Class: VO table
http://www.us-vo.org/VOTable/
– Represents an answer set in XML
Web
Service
Data
In your
address
space
• Defined by an XML Schema (XSD)
• Metadata (in terms of UCDs)
• Data representation (numbers and text)
– First method
• Cone Search: Get objects in this cone
http://voservices.org/cone/
17
Other Classes
Your
program
• Space-Time class
– http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf
• Image Class (returns pixels)
– SdssCutout
– Simple Image Access Protocol
Web
Service
Data
In your
address
space
http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf
– HyperAtlas
http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf
• Spectral
– Simple Spectral Access Protocol
– 500K spectra available at http://voservices.net/wave
• Query Services
– ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/
– And http://SkyQuery.Net
• Registry:
– see below
18
The Registry
• UDDI seemed inappropriate
– Complex
– Irrelevant questions
– Relevant questions missing
• Evolved Dublin Core
– Represent Datasets, Services, Portals
– Needs to be machine readable
– Federation (DNS model)
– Push & Pull: register then harvest
• http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg
19
Demo
• SkyServer:
– navigator showing cutout web service
– List: showing many calls and variant use.
• SkyQuery:
– Show integration of various archives.
– Explain spatial join xMatch operator.
20
SkyServer.SDSS.org
• A modern Astronomy archive
– Raw Pixel data lives in file servers
– Catalog data (derived objects) lives in Database
– Online query to any and all
• Also used for education
– 150 hours of online Astronomy
– Implicitly teaches data analysis
• Interesting things
–
–
–
–
–
–
Spatial data search
Client query interface via Java Applet
Query interface via Emacs
Popular
Cloned by other surveys (a template design)
Web services are core of it.
21
SkyQuery
A Prototype WWT
• Started with SDSS data and schema
• Imported12 other datasets
into that spine schema.
(a day per dataset plus load time)
• Unified them with a portal
• Implicit spatial join among the datasets.
• All built on Web Services
– Pure XML
– Pure SOAP
– Used .NET toolkit
22
Federation: SkyQuery.Net
• Combine 4 archives initially
• Added 9 more
• Send query to portal,
portal joins data from archives.
• Problem: want to do multi-step data analysis
(not just single query).
• Solution: Allow personal databases on portal
• Problem: some queries are monsters
• Solution: “batch schedule” on portal server,
Deposits answer in personal database.
23
SkyQuery Structure
• Portal is
• Each SkyNode publishes
– Plans Query (2 phase) – Schema Web Service
– Integrates answers
– Database Web Service
– Is a web service
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
24
MyDB
http://skyservice.pha.jhu.edu/devel/casjobs/
• Portal allows federation of data but…
• Intermediate results may be large.
• Intermediate results
feed into next analysis step.
• Sending them back-and-forth to client is
costly and sometimes infeasible.
• Solution: create a working DB for client at
Portal: MyDB
25
MyDB
http://skyservice.pha.jhu.edu/devel/casjobs/
• Anyone can create a personal DB at
SkyServer portal.
– It is about 100 MB
– It is private
•
•
•
•
•
Simple queries done immediately
Complex queries done by batch scheduler
All queries can create/read/write MyDB tables
Very popular with “serious” users.
MyDB will be sharable with by a group.
26
Open SkyQuery
• SkyQuery being adopted by AstroGrid as
reference implementation for OGSA-DAI
(Open Grid Services Architecture, Data Access and Integration).
• SkyNode basic archive object
http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode
• SkyQuery Language (VoQL) is evolving.
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL
27
The WWT Components
Outline
What we learned
• Data Sources
• Astro is a community of 10,000
• Homogenous & Cooperative
• If you can’t do it for Astro,
do not bother with 3M bio-info.
• Agreement
– Literature
– Archives
• Unified Definitions
– Units,
– Semantics/Concepts/Metrics,
Representations,
– Provenance
•
•
•
•
– Takes time
– Takes endless meetings
• Big problems are non-technical
Object model
– Legacy is a big problem.
Classes and methods
• Plumbing and tools are there
Portals
But…
WWT is a poster child for
– What is the object model?
the Data Grid.
– What do you want to save?
– How document provenance?
28
Download