Talk given to the ADASS 2005 conference in Oct 2005

advertisement
Massive Science with VO
& Grids
Bob Nichol
ICG, Portsmouth
Thanks to all my colleagues in AG2, VOtech,& PiCA
Special thanks to Chris Miller, Alex Gray, Gauri Kulkarni, Garry
Smith, Brent Bryan, Chris Genovese, Jeff Schneider
Outline
1. VO + Grid provides a powerful emerging infrastructure
for massive scientific calculations
2. Discussion of VO infrastructure and VOtechbroker
3. Examples:
• N-point correlation functions
• Nonparametric analyses and massive model fitting
ADASS 2005
2
+
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see t his picture.
QuickTime™ and a
TIF F (Uncompressed) decompressor
are needed to see this picture.
ADASS 2005
3
Euro-VO
• The Euro-VO Data Centre Alliance (DCA):
– A collaborative and operational network of European data centres
who publish data and metadata to the Euro-VO and who provide a
research infrastructure of GRID-enabled processing and storage
facilities.
• The Euro-VO Facility Centre (VOFC):
– An organization that provides the Euro-VO for centralized resource
registry, standards definition and promotion as well as community
support for VO technology take-up and scientific program support
using VO technologies and resources.
• The Euro-VO Technology Centre (VOTC)
– A distributed organisation that coordinates a set of research and
development projects on VO technology, systems and tools.
ADASS 2005
4
EuroVO: VOTech Project
• Aims:
– Complete all technical preparatory work necessary for the
construction of the European Virtual Observatory,
– Responsible for development of infrastructure and tools:
• Intelligent resource discovery (ontology and the semantic
web), data interoperability, data mining, and visualisation
capabilities.
– Provide the ability to offload mass scale computational
process onto the Enabling Grids for E-sciencE (EGEE)
backbone.
ADASS 2005
5
Existing infrastructure
• VOTech is tasked with building upon existing
infrastructure:
–IVOA for standards
–AstroGrid for middleware
• Web Services based,
• Presumably IVOA will continue to look towards
other standards bodies:
–World Wide Web Consortium (W3C)
–Global Grid Forum (GGF)
ADASS 2005
6
IVOA Standards
VOTable Format Definition Version 1.1:
– An XML language,
• Flexible storage and exchange format for tabular data: Emphasis
on astronomical tables,
– Allows meta data and data to be stored separately with links
to remote data.
• Resource Metadata for the VO Version 1.01:
– For describing what data and computational facilities are
available, and once identified how to use them.
• Unified Content Descriptor (UCD) (Proposed):
– A formal (and restricted) vocabulary for astronomical data.
• IVOA Identifiers Version 1.10 (Proposed):
– Syntax for globally unique resource names.
ADASS 2005
7
AstroGrid Components
• MySpace: Distributed file store for workflows,results,
• Common Execution Architecture (CEA):
– Codes need wrapping before use,
– Take command line apps and present as a Web Service.
• Algorithm Registry:
– Meta data from wrapped codes are published in a yellow pages, for
searching.
• Portal:
– Web interface for interacting with preceding services,
– Workflow: Coordinate data flow/control of components within a larger
system of work,
– Submit jobs and observe status, and access files in MySpace.
• Dashboard/Workbench:
– Interact with MySpace, Registry, CEA from any language that provides
XML-RPC library. Web Start application.
ADASS 2005
8
AG rollout and access via portal
AG Portal
ADASS 2005
9
VOtechbroker
• Execute potentially thousands of sequential processes
simultaneously, repeat multiple times:
– Parameter sweep.
• Utilise existing infrastructure at remote sites:
– e.g. computational resources: Condor, Globus,
– Transparent to the user.
• Locate suitable compute nodes (i.e. processor type,
available libraries, CPU load, memory,
• Stage code and observe status of running processes,
• Combine results for further analysis, e.g as input to a
post-mortem visualisation component in the AG
workflow.
ADASS 2005
10
London eScience
Center
GGF
Architecture Job Submission Description
Language (JSDL) from the Global Grid
Forum
AstroGrid
ADASS 2005
13
Web form
ADASS 2005
14
Broker Summary
• A broker to submit parameter sweeps to the Grid, and
other distributed resources, in a transparent way,
• Aim to allow arbitrary algorithms to be added easily,
just a new web form
• Aim to interoperate with a wide range of job
submission systems using a plug-in system,
• Distributed architecture based on Web Services,
allows for multiple types of client,
• AstroGrid workflow important:
– CEA command line to thin algorithm wrappers,
– Wrapper and Broker interaction with MySpace.
• X.509 and myproxy for authentication/authorisation.
• Ready for full-scale testing: n-point functions
ADASS 2005
19
N-point Correlation Functions
The 2-point function ((r)) has a long history in
cosmology (Peebles 1980). It is the excess joint
probability (dP12) of a pair of points over that
expected from a Poisson process.
dP12 = n2 dV1 dV2 [1 + (r)]
dV1
r
dV2
dP123=n3dV1dV2dV3[1+23(r)+13(r)+12(r)+123(r)]
Measure of the topology of the large-scale structure
in universe
Credit: Alex Szalay
Same 2pt, different 3pt
Multi-resolutional KD-trees
• Scale to n-dimensions
• Use Cached Representation (store at each
node summary sufficient statistics). Compute
counts from these statistics
• Prune the tree which is stored in memory
• (Moore et al. 2001 astro-ph/0012333)
Top Level
1st Level
2nd Level
5th Level
Just a set of range searches
Also Prune cells inside!
Greater saving in time
Prune cells outside range
Dual Tree Algorithm
N1 dmax
Usually binned into annuli
rmin< r < rmax
dmin
Thus, for each r transverse both
trees and prune pairs of nodes
N2
No count
dmin < rmax or dmax < rmin
Therefore, only need to calculate
pairs cutting the boundaries.
Scales to n-point functions also do
all r values at once
N1 x N2
rmin > dmin and rmax< dmax
ADASS 2005
25
3-point Correlation Function of SDSS
Luminous Red Galaxies
Details of the dataset:
0.15 < z < 0.55
-23.2 < Mg < -21.2
(~50000 LRGs from
Eisenstein et al. 2005)
Each bin can be 100’s of
individual calculations
(errors)
ADASS 2005
26
Employing npt on Teragrid - I
• Scale of computing npt:
– For the value of 2-point correlation function within any give
bin, we need 3 types of pair counts (DD, DR and RR) while
for the value of 3-point correlation function, we need 4 types
of triplet counts (DDD, DDR, DRR, RRR).
– Memory requirement depends upon the size of the dataset
and random catalog. For ~50,000 LRGs and random dataset
of ~800,000 ,NPT code makes a tree of ~50MB.
– Each bin requires error estimate, which can mean 30 jackknifes: Therefore, each bin can be hundreds of individual
jobs which can be sent to a separate node
ADASS 2005
27
1.5 < s <
2.5 Mpc/h
0.5 < s < 1.5
Mpc/h
Time (sec)
0.5 < s < 1.5
Mpc/h
Time (sec)
Time (sec)
Employing npt on Teragrid - II
9.5 < s <
10.5 Mpc/h
19.5 < s <
20.0 Mpc/h
9.5 < s <
10.5 Mpc/h
1.5 < s <
2.5 Mpc/h
19.5 < s <
20.0 Mpc/h
/ (radian)
/(radian)
(radian)
/
Time taken on TeraGrid to compute
DDD triplets for LRG data.
Time taken on TeraGrid to compute
RRR triplets.
ADASS 2005
28
Employing npt on Teragrid - III
• Limitations:
– Long queue time (stretching sometimes to
6 hours).
– After 24 hrs, jobs are terminated. So
bigger datasets need to be processed on
different cluster.
ADASS 2005
29
Non-parametric techniques
• The complexicity and wealth of the data
demands non-parametric techniques,
ie., can one describe phenomena using
the least amount of assumptions?
ADASS 2005
30
CMB Power Spectrum
Before WMAP
WMAP data
Are the 2nd and 3rd
peaks detected?
ADASS 2005
31
In parametric models of the CMB power spectrum the
answer is likely “yes” as all CMB models have multiple
peaks. But that has not really answered our question!
Can we answer the question non-parametrically e.g.,
Yi = f(Xi) + ci
Where Yi is the observed data, f(Xi) is an orthogonal
function (icos(iXi)), ci is the covariance matrix. The
challenge is to “shrink” f(Xi), we use
• Beran (2000) to strink f(Xi) to N terms equal to the number
of data points - optimal for all smooth functions and
provides valid confidence intervals
• Monotonic shrinkage of i - specifically nested subset
selection (NSS)
See Genovese et al. (2004) astro-ph/0410104
ADASS 2005
32
Results
(optimal smoothing through bias-variance trade-off)
Concordance
Our f(Xi)
Note, WMAP only fit is not same as concordance model
ADASS 2005
33
Testing models
• The main advantage of this method is that we can construct
a “confidence ball” (in N dimensions) around f(Xi) and thus
perform non-parametric interferences e.g. is the second
peak detected?
Not at 95%
confidence!
ADASS 2005
34
Gray are models in the
95% confidence ball
ASA “Outstanding Application of the year” (2005)
Using CMBfast we can make parametric models (11
parameters) and test if they are within the “confidence
ball”. Varying b we get a range of 0.0169 to 0.0287
ADASS 2005
35
Testing in high D
• Now we can now jointly search 7 cosmological
parameters in the parametric model and determine
which models fit in the confidence ball (at 95%).
• Traditionally this is done by marginalising over the
other parameters to gain confidence intervals on
each parameter separately. This is a problem in highD where the likelihood function could be degenerate,
ill-defined and under-identified
• This is computational intense as millions of models
need to searched, each takes ~3 minute to run
ADASS 2005
36
Find boundaries
We using kriging
–
–
–
“method of interpolation which predicts unknown values from data
observed at known locations”
Also known as Gaussian process regression; a form of Bayesian
inference
Different metrics for evaluation (Variance, Entropy, least probable)
Variance: pick points
far from other searches
Straddle: points far from
other searches and near
predicted boundary
ADASS 2005
37
50 samples
ADASS 2005
200 samples
38
Results
baryons
Add two heuristics:
• Path - explore
between peaks
• Depth - flood
peaks
darkmatter
ADASS 2005
39
Purple: 68%
Red: 95%
1.2 million models
6.8yrs
of CPU
Time
ADASS 2005
40
Future
• Marriage with VOtechbroker and run 10
million models on TeraGrid (300Gb of
models)
• Java code exists to query dataspace provide a webservice
• Add other data (CMB, LSS)
• Convergence test: shape of surface
• Visualization of 7D space
ADASS 2005
41
Future applications
• Selection function
for XMM Cluster
Survey
• Add fake clusters
and then analyse
• Over a million
combinations, or
4 yrs of CPU time
Fake cluster added to XMM field
ADASS 2005
42
Summary
• VO infrastucture with emerging Grids
provides a powerful framework within
which to do massive calculations
• VOtechbroker will abstract Grid from
user and interface with VO mySpace
• Registry of advanced algorithms (npt,
kriging, nonparametric statistics etc.)
ADASS 2005
43
Download