Association techniques for the Virtual Observatory Bob Mann

Association techniques for the
Virtual Observatory
Bob Mann
Why associations are crucial to
the Virtual Observatory
The essence of the VO is database federation
 Usually DBs of independent origin
 No links between entries in different DBs
 Such links needed for prototypical VO query
 e.g. “give me all galaxies in region A of the
sky with an optical/X-ray flux ratio greater
than X which are not detected in the radio
to a limiting flux of Y”
Why you might think
associations are easy to make
Natural spatial indexing to astro databases
 Plus uncertainties on positions, in general
 Just perform matching by proximity
 Simple-ish methods for doing this [Clive]
 Some practical issues for distributed case
 Data volumes
 think about transfers & performance
 Metadata for interoperability
Restriction to SQLServer databases & .Net
Requires special facilities at data centres? [Greg]
Matching by proximity alone
Matching by proximity is not
always adequate
Need astrophysical
information to
know which of the
red objects is the
most likely
counterpart to the
cyan source
General Case
Database A:
 Positions: (RAi,Deci) for i=1,NA
 Pos. Uncerts: (σRA,i, σDec,i) or (σX,i, σY,i) or σi or σ
 Other attributes Aij for j=1,MA
Ditto for Database B:
(NA,NB) may be up to ~109
(MA,MB) may be ~102
 <10 likely to be used in association procedure
General Requirements
Users can readily assess whether
associations are suitable for their analysis
 Transparency of method used
 Figure of merit for each association
 User-supplied association methods(?)
 Performance: pre-computation vs. on-the-fly
 Incorporating astrophysical prior knowledge,
but not biasing associations unduly
 Often new classes of source involved
Likelihood Ratio technique(s)
Likelihood Ratio, LRij, for association of ith
entry of DB A and jth of B defined to be
LRij= prob. that Ai is true counterpart of Bj
prob. that Ai is not true counterpart of Bj
Choose i that maximises LRij
LR example
A is an optical catalogue, with magnitudes m
and negligible positional errors
 Gaussian positional uncertainty, e(x,y), for B
 Then, LRij = nA,ID(mi) e(xj,yj) / nA(mi)
 Problems:
 Might not know form of nA,ID(mi)
 Might have several populations in B
If nA,ID(mi) is not known
Estimate it:
 Compare nA(m) around source positions
with nA(m) for full database A
 Learn it:
 Use EM algorithm to learn form of
nA,ID(mi) [Emma Taylor PhD thesis]
 Circumvent it:
 Set nA,ID(mi)=const. and normalise LRij
using randomly-located fictitious sources
All of these methods require statistics on A
 e.g. nA(m)
 …or histogram of any other attribute(s)
 The more complicated the physical model –
e.g. multiple source populations in B – the
more complicated the statistics that are
 Not insurmountable problem – just lots of
count(*) queries
Pre-computing cross-neighbours
LR chooses between a few candidates usually
 Pre-compute & store cross-neighbours
 At least for the few, very large DBs
CrossNeighbours (B,C)
CrossNeighbours (C,B)
Can then allow many probabilistic models to
be used following the initial proximity cut
Distributed Association Service?
c.f. Distributed Annotation Server
 Allows third-party annotation in bio DBs
 “inferred function of this gene is junk”
 Can be included in queries (somehow)
 Select whatever from BioDB
where not “function is junk”
 Some sort of join between BioDB and the
Distributed Annotation Server
Distributed Association Service (2)
Is something like this needed in the VO?
 Easier than adding extra columns to tables
 What would it contain:
 References to original databases
 “entry N in DB A is entry M in DB B”
 Descriptions of methods used
 Links to literature references…ADS/CDS
Associations in the VO
Basically, something like Greg’s picture…
 Start with a large dose of SkyQuery
 Add possibility of running user-defined
algorithms on dataset from proximity cut
 Pre-compute cross-neighbours for big DBs
 Distributed Association Service to record
matches made?…and methods used?