Indexing the Sky Clive Page 2003 Apr 8 1 2003 Apr 8 2 Formats of Raw Data • Radio: – Complex visibility for each polarisation at set of points sampling the (u,v) plane. • Infra-red, Optical, Ultra-violet: – Images from 1k×1k to 18k×20k, collected every few seconds or few minutes. • X-ray, Gamma-ray: – Lists of detected photons (x, y, time, energy) typically accumulated for several hours. 2003 Apr 8 3 Formats of Reduced Data • • • • Images Time-series Spectra Source Catalogues: – Vital to cross-identify sources from different wavebands, basis for many subsequent data mining investigations. – Problem: can be large, examples: USNO-B 1st XMM-Newton catalogue 2003 Apr 8 1,045,913,669 rows 30 columns 56,711 rows 379 columns 4 Required Functionality • SELECT sources in given small patch of sky (circle, rectangle, or polygon) • JOIN two tables e.g. from different wavebands to find corresponding sources – Principal matching criterion is positional match typically overlap of error-circles. 2003 Apr 8 5 Problems handling source catalogues • Positions use spherical-polar coordinates (RA, Dec) – Right Ascension corresponds to geographic longitude – Declination corresponds to geographic latitude • There are singularities at the poles and distortions in the scales everywhere except at the equator. • RA wraps from 24 hours (360 degrees) to zero. • Two-dimensional indexing is really needed. • All source positions are imprecise points have an error radius. • Distances between points must use a great-circle distance function not cartesian distance. 2003 Apr 8 6 Indexing Possibilities 1. Use simple B-tree on one spatial axis only 2. Use 1-d to 2-d mapping function then B-tree 3. Use spatial index such as R-tree 2003 Apr 8 7 (1) Index one spatial axis only • For example consider USNO-B: a table of a billion rows. • Typical search/join uses a radius of say 3 arc-seconds. • Probability of finding a source in a circle of radius 3 arcseconds in a random position is around 17%, so most searches find 0 or 1 rows. • An index on just one coordinate (say Dec) will effectively search a strip 360° wide by 6 arc-seconds high, and will find some 10,000 rows matching. These have to be scanned sequentially to find at most one matching row. • Conclusion: a true 2-d index can gain five orders of magnitude in efficiency. 2003 Apr 8 8 (2) 2-d to 1-d mapping • • • • Cover the space with cells (pixels) and number them. Create conventional B-tree on resulting set of integers. Each point maps to an integer. Areas map to a list of integers: – Ideally a small spatial area maps to a small range of integers so one can do a range search using the B-tree. – Various space-filling curves such as the Z-ordering index and Peano Curve have been used in the hope that this works… 2003 Apr 8 9 Z-order mapping function 2003 Apr 8 10 Space-filling Curves • All have same failing: – At some places in the grid a high-order bit flips and the range of integers becomes huge. – Tests confirm this defect: the worst-case performance is rather poor. • Simple cartesian grids also unsuited to spherical-polar coordinates as there are too many tiny pixels near the poles. 2003 Apr 8 11 Covering the sky evenly with pixels • Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix) – invented at European Southern Observatory. • Hierarchical Triangular Mesh (HTM) – invented at Johns Hopkins University • Can use either algorithm – call it pixel-code or PCODE for short – Do not try to conduct spatial range search using range of PCODE values. 2003 Apr 8 12 Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix) 2003 Apr 8 13 Hierarchical Triangular Mesh (HTM) 2003 Apr 8 14 Spatial Join using PCODE Table CAT1 has columns • ID1 • RA • DEC • POSERR • MAGNITUDE • etc 2003 Apr 8 Table CAT2 has columns • ID2 • RA • DEC • POSERR • FLUX • etc 15 Create additional tables with PCODE values Table CAT1 has columns • ID1 – primary key • RA • DEC • POSERR • MAGNITUDE • Etc • • • • • • • Table P1 has columns • ID1 • PCODE1 – primary key Table P2 has columns • ID2 • PCODE2 – primary key 2003 Apr 8 Table CAT2 has columns ID2 – primary key RA DEC POSERR FLUX Etc 16 JOIN the two PCODE tables Note: tables P1, P2 have extra rows when error-circles overlap more than one pixel. • Join P1 and P2 on PCODE1=PCODE2 making a table PJOIN with just two columns: ID1 and ID2. • Use SELECT DISTINCT to remove any duplicates • Table PJOIN identifies pixels which may contain sources with overlapping error circles (or they may just be near but not overlapping) • Create B-tree index on PJOIN(ID1) 2003 Apr 8 17 Use PJOIN table to match catalogue rows • Three-way join then produces required results, e.g. SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2 AND (2 * asin(sqrt(pow(sin((cat1.deccat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.racat2.ra)/2),2))) <= cat1.poserr+cat2.poserr) ; 2003 Apr 8 18 (3) True Multi-dimensional Indexing • Hot topic of research in computer science departments for more than 20 years • Very many algorithms have been proposed: – BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hBtree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree, Zorder index. – So many alternatives, but none of them provides a good general solution, like the B-tree in 1-D indexing. • R-tree indexing is built into several modern DBMS. 2003 Apr 8 19 Spatial Options in current DBMS Commercial: DB2 Spatial Extender – multi-level grid file Ingres Oracle None Spatial Option – R-tree (?) SQL Server Sybase None Spatial Option (Boeing SQS) – R-tree Open Source: MySQL R-tree in V4.1 (beta, documentation lacking) Interbase PostgreSQL 2003 Apr 8 None R-tree 20 Using R-trees Used R-trees in Postgres – does what it says on the box. Problems/limitations include: • Object indexed by R-tree is a rectangular box, so must draw a box outside each error circle • Boxes get rather extended (along RA axis) near poles • Need a subsequent filter to remove spurious matches where rectangles overlap but circles do not. • R-tree indices are large, creation is slow (2 hours for table of 3.5 million rows using Postgres). – Kalpakis et al. used Informix to load part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire 500M row table. 2003 Apr 8 21 Comparison of PCODE and R-tree • Advantages – PCODE join seems to be faster (but not yet benchmarked with identical systems). – Takes up less disc space in total. – Can use any DBMS, not just those with an R-tree or other spatial data option. • Disadvantages – Additional tables and indices have to be created – More complex set of joins. – Needs external code as neither HTM or HEALPix can be expressed as an SQL-callable function (they return a variable-length array of integers). 2003 Apr 8 22 Conclusions • Indexing on just one spatial axis is simply too inefficient for large tables. • R-trees are powerful and easy to use, but index creation times are a serious cause for concern. • 2d1d mapping functions such as HTM or HEALPix are more complicated to use, but may be worthwhile for JOINs if they turn out to be faster. 2003 Apr 8 23