Association techniques for the Virtual Observatory Bob Mann Why associations are crucial to the Virtual Observatory The essence of the VO is database federation Usually DBs of independent origin No links between entries in different DBs Such links needed for prototypical VO query e.g. “give me all galaxies in region A of the sky with an optical/X-ray flux ratio greater than X which are not detected in the radio to a limiting flux of Y” Optical X-ray Radio Why you might think associations are easy to make Natural spatial indexing to astro databases Plus uncertainties on positions, in general Just perform matching by proximity Simple-ish methods for doing this [Clive] Some practical issues for distributed case Data volumes think about transfers & performance Metadata for interoperability SkyQuery: www.skyquery.net Restriction to SQLServer databases & .Net Requires special facilities at data centres? [Greg] Matching by proximity alone Matching by proximity is not always adequate Need astrophysical information to know which of the red objects is the most likely counterpart to the cyan source General Case Database A: Positions: (RAi,Deci) for i=1,NA Pos. Uncerts: (σRA,i, σDec,i) or (σX,i, σY,i) or σi or σ Other attributes Aij for j=1,MA Ditto for Database B: (NA,NB) may be up to ~109 (MA,MB) may be ~102 <10 likely to be used in association procedure General Requirements Users can readily assess whether associations are suitable for their analysis Transparency of method used Figure of merit for each association User-supplied association methods(?) Performance: pre-computation vs. on-the-fly Incorporating astrophysical prior knowledge, but not biasing associations unduly Often new classes of source involved Likelihood Ratio technique(s) Likelihood Ratio, LRij, for association of ith entry of DB A and jth of B defined to be LRij= prob. that Ai is true counterpart of Bj ________________________________ prob. that Ai is not true counterpart of Bj Choose i that maximises LRij LR example A is an optical catalogue, with magnitudes m and negligible positional errors Gaussian positional uncertainty, e(x,y), for B Then, LRij = nA,ID(mi) e(xj,yj) / nA(mi) Problems: Might not know form of nA,ID(mi) Might have several populations in B If nA,ID(mi) is not known Estimate it: Compare nA(m) around source positions with nA(m) for full database A Learn it: Use EM algorithm to learn form of nA,ID(mi) [Emma Taylor PhD thesis] Circumvent it: Set nA,ID(mi)=const. and normalise LRij using randomly-located fictitious sources But… All of these methods require statistics on A e.g. nA(m) …or histogram of any other attribute(s) The more complicated the physical model – e.g. multiple source populations in B – the more complicated the statistics that are needed Not insurmountable problem – just lots of count(*) queries Pre-computing cross-neighbours LR chooses between a few candidates usually Pre-compute & store cross-neighbours At least for the few, very large DBs A B CrossNeighbours (B,C) CrossNeighbours (C,B) C Can then allow many probabilistic models to be used following the initial proximity cut Distributed Association Service? c.f. Distributed Annotation Server Allows third-party annotation in bio DBs “inferred function of this gene is junk” Can be included in queries (somehow) Select whatever from BioDB where not “function is junk” Some sort of join between BioDB and the Distributed Annotation Server Distributed Association Service (2) Is something like this needed in the VO? Easier than adding extra columns to tables What would it contain: References to original databases “entry N in DB A is entry M in DB B” Descriptions of methods used Links to literature references…ADS/CDS Associations in the VO Basically, something like Greg’s picture… Start with a large dose of SkyQuery Add possibility of running user-defined algorithms on dataset from proximity cut Pre-compute cross-neighbours for big DBs Distributed Association Service to record matches made?…and methods used?