Data Explosion: Science with Terabytes Alex Szalay, JHU and Jim Gray, Microsoft Research

advertisement
Data Explosion:
Science with Terabytes
Alex Szalay, JHU
and Jim Gray, Microsoft Research
Living in an Exponential World
• Astronomers have a few hundred TB now
– 1 pixel (byte) / sq arc second ~ 4TB
– Multi-spectral, temporal, … → 1PB
• They mine it looking for
1000
new (kinds of) objects or
more of interesting ones (quasars),
density variations in 400-D space
correlations in 400-D space
•
•
•
•
Data doubles every year
Data is public after 1 year
So, 50% of the data is public
Same access for everyone
100
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
CCDs
Glass
The Challenges
Exponential data growth:
Distributed collections
Soon Petabytes
Data
Collection
Discovery
and Analysis
New analysis paradigm:
Data federations,
Move analysis to data
Publishing
New publishing paradigm:
Scientists are publishers
and Curators
New Science: Data Exploration
• Data growing exponentially in many different areas
– Publishing so much data requires a new model
• Multiple challenges for different communities
– publishing, data mining, data visualization, digital library,
educational, web services `poster-child’,
• Information at your fingertips:
– Students see the same data as professional astronomers
• More data coming: Petabytes/year by 2010
– We need scalable solutions
– Move analysis to the data!
• Same thing happening in all sciences
– High energy physics, genomics, cancer research,
medical imaging, oceanography, remote sensing, …
• Data Exploration: an emerging new branch of science
• Currently has no owner…
Advances at JHU
• Designed and built the science archive for the SDSS
–
–
–
–
Currently 2 Terabytes, soon to reach 3 TB
Built fast spatial search library
Created novel pipeline for data loading
Built the SkyServer, a public access website for SDSS
with over 45M web hits, millions of free-form SQL queries
• Built the first web-services used in science
– SkyQuery, ImgCutout, various visualization tools
• Leading the Virtual Observatory effort
• Heavy involvement in Grid Computing
• Exploring other areas
Collaborative Projects
•
•
•
•
•
•
•
•
Sloan Digital Sky Survey (11 inst)
National Virtual Observatory (17 inst)
International Virtual Observatory Alliance (14 countries)
Grid For Physics Networks (10 inst)
Wireless sensors for Soil Biodiversity (BES, Intel, UCB)
Digital Libraries (JHU, Cornell, Harvard, Edinburgh)
Hydrodynamic Turbulence (JHU Engineering)
Informal exchanges with NCBI
Directions
We understand how to mine a few terabytes
Directions:
1. We built an environment: now our tools allow new
breakthroughs in astrophysics
2. Open collaborations beyond astrophysics
(turbulence, sensor driven biodiversity,
bioinformatics, digital libraries, education …)
3. Attack problems on 100 Terabyte scale, prepare for
the Petabytes of tomorrow
The JHU Core Group
Faculty
• Alex Szalay
• Ethan Vishniac
• Charles Meneveau
Graduate Students
• Tanu Malik
• Adrian Pope
Postdoctoral Fellows
• Tamas Budavari
Research Staff
•George Fekete
•Vivek Haridas
•Nolan Li
•Will O’Mullane
•Maria Nieto-Santisteban
•Jordan Raddick
•Anirudha Thakar
•Jan Vandenberg
Examples
I. Astrophysics inside the database
II. Technology sharing in other areas
III. Beyond Terabytes
I. Astrophysics in the DB
• Studies of galaxy clustering
– Budavari, Pope, Szapudi
• Spectro Service: Publishing spectral data
– Budavari, Dobos
• Cluster finding with a parallel DB-oriented
workflow system
– Nieto-Santisteban, Malik, Thakar, Annis, Sekhri
• Complex spatial computations inside the DB
– Fekete, Gray, Szalay
• Visual tools with the DB
– ImgCutout (Nieto), Geometry viewer (Szalay),
Mirage+SQL (Carlisle)
The SDSS Photo-z Sample
All: 50M
mr<21 : 15M
10 stripes: 10M
0.1<z<0.3
-20 > Mr
2.2M
-20 > Mr >-21
-21 > Mr >-23
-21 > Mr >-22
1182k
931k
662k
-22 > Mr >-23
343k 254k 185k
316k
280k 326k 185k
127k
269k
The Analysis
• eSpICE : I.Szapudi, S.Colombi and S.Prunet
• Integrated with the database by T. Budavari
• Extremely fast processing:
– 1 stripe with about 1 million galaxies is processed in 3 mins
– Usual figure was 10 min for 10,000 galaxies => 70 days
• Each stripe processed separately for each cut
• 2D angular correlation function computed
• w(): average with rejection of
pixels along the scan
– Correlations due to flat field vector
– Unavoidable for drift scan
Angular Power Spectrum
• Use photometric redshifts for LRGs
• Create thin redshift slices and analyze angular clustering
• From characteristic features (baryon bumps, etc)
we obtain angular diameter vs distance -> Dark Energy
• Healpix pixelization in the database
• Each “redshift slice” is generated in 2 minutes
• Using Spice over 160,000 pixels in N1.7 time
Large Scale Power Spectrum
• Goal: measure cosmological parameters:
– Cosmological constant or Dark Energy?
• Karhunen-Loeve technique
– Subdivide slices into about 5K-15K cells
– Compute correlation matrix of galaxy counts among cells
from fiducial P(k)+ noise model
– Diagonalize matrix
– Expand data over KL basis
– Iterate over parameter values:
– Compute new correlation matrix
– Invert, then compute log likelihood
Vogeley and Szalay (1996)
Wb/ Wm
WMAP
SDSS only:
Wmh = 0.26 +/- 0.04
Wb/Wm = 0.29 +/- 0.07
With Wb=0.047+/-0.006 (WMAP):
Wmh = 0.21 +/- 0.03
Wb/Wm = 0.16 +/- 0.03
SDSS: Pope et al (2004)
WMAP: Verde et al. (2003), Spergel et al. (2003)
Wmh
Numerical Effort
•
•
•
•
•
•
•
Most of the time spent in data manipulation
Fast spatial searches over data and MC (SQL)
Diagonalization of 20Kx20K matrices
Inversions of few 100K 5Kx5K matrices
Has the potential to constrain the Dark Energy
Accuracy enabled by large data set sizes
But: new kind of problems:
– Errors driven by the systematics, not by sample size
– Scaling of analysis algorithms critical!
• Monte Carlo realizations with few 100M points in SQL
Cluster Finding
Five main steps (Annis et al. 2002)
1. Get Galaxy List
fieldPrep: Extracts from the main data set the measurements of interest.
2. Filter
brgSearch: Calculates the unweighted BCG likelihood for each galaxy
(unweighted by galaxy count) and discards unlikely galaxies.
3. Check Neighbors
bcgSearch: Weights BCG likelihood with the number of neighbors.
4. Pick Most Likely
bcgCoalesce: Determines whether a galaxy is the most likely galaxy in the
neighborhood to be the center of the cluster.
5. Discard Bogus
getCatalog: Removes suspicious results and produces and stores the final
cluster catalog.
SQL Server Cluster
P3
Applying a zone strategy, P gets partitioned homogenously
among 3 servers.
•
S1 provides 1 deg buffer on top
•
S2 provides 1 deg buffer on top and bottom
•
S3 provides 1 deg buffer on bottom
P
P2
Native to Server 3
P1
Native to Server 2
Native to Server 1
Total duplicated data = 4 x 13 deg2.
Total duplicated work = (1 object processed more than once) = 2 x 11 deg2
Maximum time spent by the thicker partition=2h 15’ (other 2 servers ~ 1h 50’)
SQL Server vs Files
SQL Server
Resolve a Target of 66 deg2 requires:
•
Step A: Find Candidates
- Input data = 108 MB covering 104 deg2=
(72 byte/row *1.574.656 row)
- Time= ~ 6 h on a dual 2.6 GHz
- Output data
= 1.5 MB covering 84
2
deg =40 byte/row * 40.123 row
•
Step B: Find Clusters
- Input Data = 1.5 MB
- Time = 20 minutes
- Output = 0.43 MB covering 66 deg2 = 40
byte/row * 11.249 row
Total time = 6h 20’
•
Some extra space is required for indexes
and some other auxiliary tables.
•
Scales linearly with no of servers
FILES
Resolve a Target of 66 deg2 requires:
- Input data = 66 * 4 * 16MB ~ 4GB
- Output data = 66 * 4 * 6KB=1.5 MB
- Time ~ 73 hours
Using 10 nodes ~ 7.3 hours
Notes:
Buffer
brgSearch z(0..1)
in steps of
Files
0.25 deg
0.01
SQL
0.5 deg
0.001
FILES would require 20 – 60 times longer to
solve this problem for a buffer of 0.5
with steps of 0.001
II. Technology Sharing
• Virtual Observatory
• SkyServer database/website templates
– Edinburgh, STScI, Caltech, Cambridge, Cornell
• OpenSkyQuery/OpenSkyNodes
– International standard for federating astro archives
– Interoperable SOAP implementations working
•
•
•
•
NVO Registry Web Service (O’Mullane, Greene)
Distributed logging and harvesting (Thakar, Gray)
MyDB: workbench for science (O’Mullane, Li)
Publish your own data
– A’la Spectro Service, but for images and databases
• SkyServer-> Soil Biodiversity
National Virtual Observatory
• NSF ITR project, “Building the Framework for the
National Virtual Observatory” is a collaboration of 17
funded and 3 unfunded organizations
–
–
–
–
–
•
•
•
•
Astronomy data centers
National observatories
Supercomputer centers
University departments
Computer science/information technology specialists
PIs : Alex Szalay (JHU), Roy Williams (Caltech)
Connect the disjoint pieces of data in the world
Bridge the technology gap for astronomers
Based on interoperable Web Services
International Collaboration
• Similar efforts now in 14 countries:
– USA, Canada, UK, France, Germany, Italy, Holland, Japan,
Australia, India, China, Russia, Hungary, South Korea, ESO
• Total awarded funding world-wide is over $60M
• Active collaboration among projects
– Standards, common demos
– International VO roadmap being developed
– Regular telecons over 10 timezones
• Formal collaboration
International Virtual Observatory Alliance (IVOA)
• Aiming to have production services by Jan 2005
Boundary Conditions
Standards driven by evolving new technologies
Exchange of rich and
DB connectivity, Web
structured data (XML…)
Services, Grid computing
Application to astronomy domain
–
–
–
–
–
Data dictionaries (UCDs)
Data models
Protocols
Registries and resource/service discovery
Provenance, data quality
Dealing with the astronomy legacy
FITS
data format
Software systems
Boundary
conditions
Main VO Challenges
• How to avoid trying to be everything for everybody?
• Database connectivity is essential
– Bring the analysis to the data
• Core web services
• Higher level applications built on top
• Use the 90-10 rule:
0.9
0.8
0.7
# of users
– Define the standards and interfaces
– Build the framework
– Build the 10% of services
that are used by 90%
– Let the users build the rest
from the components
1
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
# of services
0.4
0.5
Core Services
• Metadata information about resources
–
–
–
–
Waveband
Sky coverage
Translation of names to universal dictionary (UCD)
Registry
• Simple search patterns on the resources
– Spatial Search
– Image mosaic
– Unit conversions
• Simple filtering, counting, histograms
Higher Level Services
• Built on Core Services
• Perform more complex tasks
• Examples
–
–
–
–
–
–
Automated resource discovery
Cross-identifications
Photometric redshifts
Image segmentation
Outlier detections
Visualization facilities
• Expectation:
– Build custom portals in matter of days from existing building
blocks (like today in IRAF or IDL)
Web Services in Progress
• Registry
– Harvesting and querying
• Data Delivery
– Query driven Queue management
– Spectro service
– Logging services
• Graphics and visualization
– Query driven vs interactive
– Show spatial objects (Chart/Navi/List)
• Footprint/intersect
– It is a “fractal”
• Cross-matching
– SkyQuery and SkyNode
– Ferris-wheel
– Distributed vs parallel
MyDB: eScience Workbench
•
•
•
•
•
•
•
•
•
•
Prototype of bringing analysis to the data
Everybody gets a workspace (database)
Executes analysis at the data
Store intermediate results there
Long queries run in batch
Results shared within groups
Only fetch the final results
Extremely successful – matches the pattern of work
Next steps: multiple locations, single authentication
Farther down the road: parallel workflow system
eEducation Prototype
• SkyServer: Educational Projects, aimed at advanced
high school students, but covering middle school
• Teach how to analyze data, discover patterns,
not just astronomy
• 3.7 million project hits,
1.25 million page views
of educational content
• More than 4000 textbooks
• On the whole web site: 44 million web hits
• Largely a volunteer effort by many individuals
• Matches the 2020 curriculum
1
Ju
l-0
1
O
ct
-0
1
Ja
n02
Ap
r-0
2
Ju
l-0
2
O
ct
-0
2
Ja
n03
Ap
r-0
3
Ju
l-0
3
O
ct
-0
3
Ja
n04
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Ap
r-0
Page views
SkyServer project page views
Date
Soil Biodiversity
How does soil biodiversity affect ecosystem
functions, especially decomposition and nutrient
cycling in urban areas?
• JHU is part of the Baltimore Ecosystem Study,
one of the NSF LTER monitoring sites
• High resolution monitoring will capture
– Spatial heterogeneity of environment
– Change over time
Sensor Monitoring
• Plan: use 400 wireless (Intel) sensors,
monitoring
• Air temperature, moisture
• Soil temperature, moisture,
at least in two depths (5cm, 20 cm)
• Light (intensity, composition)
• Gases (O2, CO2, CH4, …)
•
•
•
•
•
Long-term continuous data
Small (hidden) and affordable (many)
Less disturbance
200 million measurements/year
Collaboration with Intel and UCB
(PI: Szlavecz, JHU)
• Complex database of sensor data and samples
III. Beyond Terabytes
• Numerical simulations of turbulence
– 100TB of multiple SQL Servers
– Storing each timestep, enabling backtracking to initial
conditions
– Also fundamental problem in cosmological simulations of
galaxy mergers
• Will teach us how to do scientific analysis of 100TBs
• By the end of the decade several PB / year
– One needs to demonstrate fault tolerance, fast enough
loading speeds…
Exploration of Turbulence
For the first time, we can now “put it all
together”
•
•
•
Large scale range, scale-ratio O(1,000)
Three-dimensional in space
Time-evolution and Lagrangian
approach (follow the flow)
Unique turbulence database
•
We will create a database of
O(2,000) consecutive snapshots
of a 1,0243 simulation of
turbulence:
Close to 100 Terabytes
•
Analysis cluster on top of DB
•
Treat it as a physics experiment,
change configurations every 2 months
128
128 compute
compute nodes,
nodes, dual
dual Xeon
Xeon
22 GB
GB RAM/node
RAM/node == 356
356 GB
GB total
total
30
30 GB
GB disk/node
disk/node == 3.8
3.8 Tbytes
Tbytes total
total
++ 22 TB
TB RAID5
RAID5 disk
disk system
system
88 Gigabit
Gigabit Ethernet
Ethernet switches
switches
12
12 ports
ports each,
each, 11 Gbyte/sec
Gbyte/sec
throughput
throughput across
across layers
layers
32
32 database
database servers,
servers, dual
dual Xeon
Xeon
3.2
3.2 TB
TB disk/node
disk/node == 102
102 TB
TB total
total
300
300 MB/sec/node
MB/sec/node == 9.6Gbyte/sec
9.6Gbyte/sec
aggregate
aggregate data
data access
access speed
speed
Computational
Computational Layer
Layer
Interconnect
Interconnect Layer
Layer
Data
Data Access
Access Layer
Layer
LSST
•
•
•
•
•
Large Synoptic Survey Telescope (2012)
Few PB/yr data rate
Repeat SDSS in 4 nights
Main issue is with data management
Data volume similar to high energy physics,
but need object granularity
• Very high resolution time series, moving objects…
• Need to build 100TB scale prototypes today
• Hierarchical organization of data products
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
•
•
•
new
SCIENCE!
Query and Vis tools
Support/training
Performance
–
–
Execute queries in a minute
Batch query scheduling
Download