Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach

advertisement
Science In An Exponential World
Alexander Szalay, JHU
Jim Gray, Microsoft Reserach
Evolving Science
Thousand years ago:
Science was empirical
Describing natural phenomena
Last few hundred years:
Theoretical branch
Using models, generalizations
Last few decades:
A computational branch
Simulating complex phenomena
Today: Data exploration (e-science)
Synthesizing theory, experiment and
computation with advanced data
management and statistics
 new algorithms!
2
 . 
4G
c2
a



 a 
3
a2
 
Exponential World of Data
Astronomers have a few
hundred TB now
1 pixel (byte) / sq arc second ~
4TB
Multi-spectral, temporal, … → 1PB
1000
100
They mine it looking for
10
new (kinds of) objects or more of
interesting ones (quasars), density
variations in multi-D space, spatial
and parametric correlations
1
0.1
1970
Data doubles every year
Same access for everyone
1975
1980
1985
1990
1995
2000
CCDs
Glass
The Challenges
Exponential data growth:
Distributed collections
Soon Petabytes
Data
Collection
Discovery
and Analysis
New analysis paradigm:
Data federations,
Move analysis to data
Publishing
New publishing paradigm:
Scientists are publishers
and Curators
Publishing Data
Roles
Authors
Publishers
Curators
Consumers
Traditional
Scientists
Journals
Libraries
Scientists
Emerging
Collaborations
Project www site
Bigger Archives
Scientists
Exponential growth
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will never be centralized
More responsibility on projects
Becoming Publishers and Curators
Data will reside with projects
Analyses must be close to the data
Making Discoveries
Where are discoveries made?
At the edges and boundaries
Going deeper, collecting more data,
using more dimensions
Metcalfe’s law
Utility of computer networks grows as the
number of possible connections: O(N2)
Federating data
Federation of N archives has utility O(N2)
Possibilities for new discoveries
grow as O(N2)
Data Access is Hitting a Wall
FTP and GREP are not adequate
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (= 1 $/GB)
… 2 days and 1K$
… 3 years and 1M$
Oh!, and 1PB ~4,000 disks
At some point you need
indices to limit search
parallel data search and analysis
This is where databases can help
If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
Build custom procedures and functions in the database
Next-Generation Data Analysis
Looking for
Needles in haystacks – the Higgs particle
Haystacks: Dark matter, Dark energy
Needles are easier than haystacks
‘Optimal’ statistics have poor scaling
Correlation functions are N2, likelihood
techniques N3
For large data sets main errors are not statistical
As data and computers grow with Moore’s Law,
we can only keep up with N logN
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing resources
Requires combination of statistics and computer science
New algorithms
Our E-Science Projects
Sloan Digital Sky Survey/ SkyServer
Virtual Observatory
Wireless Sensor Networks
Analyzing Large Numerical Simulations
Fast Spatial Search Techniques
Commonalities
Web services
Analysis inside the database!
Why Is Astronomy Special?
Especially attractive for the wide public
It has no commercial value – “worthless!” (Jim Gray)
No privacy concerns, freely share results with others
Great for experimenting with algorithms
It is real and well documented
High-dimensional (with confidence intervals)
Spatial, temporal
Diverse and distributed
Many different instruments from
many different places and many different times
 Virtual Observatory
The questions are interesting
There is a lot of it (soon Petabytes)
Features of the SDSS
Goal
Create the most detailed map
of the Northern sky in 5 years
“The Cosmic Genome Project”
Two surveys in one
Photometric survey in 5 bands
Spectroscopic redshift survey
Automated data reduction
150 man-years of development
Very high data volume
40 TB of raw data
5 TB processed catalogs
Data is public
The University of Chicago
Princeton University
The Johns Hopkins University
The University of Washington
New Mexico State University
Fermi National Accelerator Laboratory
US Naval Observatory
The Japanese Participation Group
The Institute for Advanced Study
Max Planck Inst, Heidelberg
Sloan Foundation, NSF, DOE, NASA
The Imaging Survey
Drift scan of 10,000 square degrees
24k x 1M pixel “panoramic” images in 5
colors – broad-band filters (u,g,r,i,z)
2.5 Terapixels of image
The Spectroscopic Survey
Expanding universe
Redshift = distance
SDSS redshift survey
1 million galaxies
100,000 quasars
100,000 stars
Two high throughput spectrographs
Spectral range 3900-9200 Å
640 spectra simultaneously
R=2000 resolution, 1.3 Å
Features
Automated reduction of spectra
Very high sampling density
and completeness
SkyServer
Sloan Digital Sky Survey: Pixels + Objects
About 500 attributes per “object”, 400M objects
Spectra for 1M objects
Currently 2.4TB fully public
Prototype eScience lab
Moving analysis to the data
Fast searches: Color, spatial
Visual tools
1.E+07
Join 2.5 Terapix pixels with objects
Web hits/mo
SQL queries/mo
1.E+06
Prototype in data publishing
1.E+05
160 million web hits in 5 years
http://skyserver.sdss.org/
20
01
/
20 7
01
/1
0
20
02
/1
20
02
/4
20
02
/
20 7
02
/1
0
20
03
/1
20
03
/4
20
03
/
20 7
03
/1
0
20
04
/1
20
04
/4
20
04
/7
1.E+04
The SkyServer Experience
Sloan Digital Sky Survey: Pixels + Objects
About 500 attributes per “object”, 400M objects
Currently 2.4TB fully public
Prototype eScience lab (800 users)
Moving analysis to the data
Fast searches: Color, spatial
Visual tools
Join pixels with objects
Prototype in data publishing
180 million web hits in 5 years
930,000 distinct user
http://skyserver.sdss.org/
20
01
/
20 7
01
/1
0
20
02
/1
20
02
/4
20
02
/
20 7
02
/1
0
20
03
/1
20
03
/4
20
03
/
20 7
03
/1
0
20
04
/1
20
04
/4
20
04
/7
SkyServer Traffic
1.E+07
Web hits/mo
SQL queries/mo
1.E+06
1.E+05
1.E+04
Public Data Release
Versions
June 2001: EDR
EDR
Early Data Release
July 2003: DR1
Contains 30% of final data
150 million photo objects
3 versions of the data
DR1
DR1
DR2
Target, Best, Runs
DR3
Total catalog volume 5TB
Published releases served ‘forever’
EDR, DR1, DR2, …., now at DR5
Next: Include e-mail archives, annotations
DR2
DR2
DR3
DR3
………
O(N2) – only possible because of Moore’s Law!
DR3
Spatial Information For Users
What surveys covered this part of the sky?
What is the common area of these surveys?
Is this point in the survey?
Give me all objects in this region
Cross-matching these two catalogs
Give me the cumulative counts over areas
Compute fast spherical transforms of densities
Interpolate sparsely sampled functions
Spatial Queries In SQL
Regions and convexes
Boolean algebra of spherical polygons
Indexing using spherical quadtrees
Hierarchical Triangular Mesh
Fast Spatial Joins of billions of points
Zone algorithm
All implemented in T-SQL and C#, running
inside SQL Server 2005
Things Can Get Complex
A
B
A
B
Green area: A  (B- ε) should find B if it contains an A and not masked
Yellow area: A  (B±ε) is an edge case may find B if it contains an A.
Simulations
Cosmological simulations have 109
particles and produce over 30TB
of data (Millennium)
Build up dark matter halos
Track merging history of halos
Use it to assign star formation history
Combination with spectral synthesis
Realistic distribution of galaxy types
Too few realizations (now 50)
Hard to analyze the data afterwards -> need DB
What is the best way to compare to real data?
Trends
CMB Surveys
1990 COBE
2000 Boomerang
2002 CBI
2003 WMAP
2008 Planck
1000
10,000
50,000
1 Million
10 Million
Time Domain
QUEST
SDSS Extension survey
Dark Energy Camera
PanStarrs: 1PB by 2007
LSST: 100PB by 2020
Angular Galaxy Surveys
1970 Lick
1M
1990 APM
2M
2005 SDSS
200M
2008 VISTA
1000M
2012 LSST
3000M
Galaxy Redshift Surveys
1986 CfA
3500
1996 LCRS
23000
2003 2dF
250000
2005 SDSS
750000
Petabytes/year by the end of the decade…
Exploration Of Turbulence
We can finally “put it all together”
Large scale range, scale-ratio O(1,000)
Three-dimensional in space
Time-evolution and Lagrangian
approach (follow the flow)
Unique turbulence database
We are creating a database
of O(2,000) consecutive
snapshots of a 1,0243
simulation of turbulence
close to 100 Terabytes
Treat it as an experiment
Wireless Sensor Networks
Will use 200 wireless (Intel) sensors,
monitoring
Air temperature, moisture
Soil temperature, moisture,
at least in two depths (5cm, 20 cm)
Light (intensity, composition)
Gases (O2, CO2, CH4, …)
Long-term continuous data
Small (hidden) and affordable (many)
Less disturbance
>200 million measurements/year
Collaboration with Microsoft
Complex database of sensor data and samples
Current Sensor Database
Using sensor deployment at JHU (Szlavecz talk)
10 motes * 5 months = 8M data points
SQL Server 2005 database
Adopted from astronomy: NVO+Skyserver
Started with “20 queries”
Rich metadata stored in database
Data access via web services
Graphical interface
DataCube under construction
in collaboration with Stuart Ozer
(multidimensional summary of data)
The Big Picture
Experiments
and
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
Query and Visualization tools
Support/training
Performance
Execute queries in a minute
Batch query scheduling
Summary
Data growing exponentially
Requires a new model
Having more data makes it harder to extract knowledge
Information at your fingertips
Students see the same data as professionals
More data coming: Petabytes/year by 2010
Need scalable solutions
Move analysis to the data!
Same thing happening in all sciences
High energy physics, genomics/proteomics,
medical imaging, oceanography…
E-Science: An emerging new branch of science
We need multiple skills in a world of increasing specialization…
Microsoft Computational Science
Workshop
at the Johns Hopkins University
Oct 13-15, 2006
© 2006 Microsoft Corporation. All rights reserved.
Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.
Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Download