Spatial Indexing Clive Page 2003 June 30 1 2003 June 30 2 Formats of Raw Data • Radio: – Complex visibility for each polarisation at set of points sampling the complex (u,v) plane. • Infra-red, Optical, Ultra-violet: – Images from 1k×1k to 18k×20k, collected at intervals of a few seconds to several minutes. • X-ray, Gamma-ray: – Photon-even lists: get properties (x, y, time, energy) for each detected photon. May get a list of millions of these over an integration period of a few hours. 2003 June 30 3 Formats of Reduced Data • • • • Images Time-series Spectra Catalogues of astronomical objects: – Vital to cross-identify objects from different wavebands, basis for many subsequent data mining investigations. – Problem: tables can be long or wide: Catalogue Optical: USNO-B Infra-red: 2MASS X-ray: 2003 June 30 1XMM Rows Columns 1,045,913,669 30 470,992,970 60 56,711 379 4 Main Functionality • SELECT objects in a given small patch of sky – In a rectangle – to cover same region as an image – In a circle – to cover a radius around point of interest (also known as search in a cone). – In a polygon – e.g. around extended object. • Spatial JOIN: Cross-match objects from, for example, two wavebands or two epochs. – Principal matching criterion is overlap of error-circles. – Often important to find objects in one table which are NOT matched in the other: need a left outer join. 2003 June 30 5 Functionality (continued) • Self JOIN useful in some applications, for example: – Search for clusters of galaxies – Search for double stars – Find spatial distribution or spatial correlation functions for particular classes of objects – Those observing with adaptive optics need a suitable reference star in or near each field, e.g. might want to find all AGNs near to bright stars to plan an observing campaign. 2003 June 30 6 Current on-line services • Search in a cone – many on-line services • Cross-matching (spatial JOIN) – few current examples: – Astrobrowse (at GSFC) has pre-computed joins for a limited number of important catalogues. – Vizier service (CDS) allows cross-match of user’s own table of positions with list of catalogues. • But results of list of N sources and M catalogues are presented as unmerged list of M*N separate tables. – Skyserver – joins between SDSS and other tables. • Self-join: do not know of any on-line services at present. 2003 June 30 7 Problems handling object catalogues • Positions use spherical-polar coordinates (RA, Dec) – Right Ascension corresponds to geographic longitude – Declination corresponds to geographic latitude • There are singularities at the poles and distortions in the scales everywhere except at the equator. • RA wraps from 24 hours (360 degrees) to zero. • All object positions are imprecise positions have an error radius. • Distances between points must use a great-circle distance function not cartesian distance. • Two-dimensional indexing is really needed 2003 June 30 8 Indexing Possibilities 1. Home-brew slicing of the sky 2. Use B-tree on one spatial axis only 3. Use 1-d to 2-d mapping function then simple B-tree 4. Use true spatial index such as R-tree. 2003 June 30 9 Home-brew slicing of sky • Both USNO-B and 2MASS issued as separate files, each covering 0.1 degree in declination, sorted in RA within each strip. • Software from Harvard-Smithsonian Center for Astrophysics (WCSTOOLS) allows fairly efficient access to data stored in these strips. • WFCSTOOLS is used by many astronomical archives around the world, (including LEDAS www.ledas.ac.uk). • Ok for cone-search services, not suitable for spatial joins. 2003 June 30 10 Index one spatial axis only • Widely used, including Vizier collection of ~3000 astronomical tables. • Index on declination (avoids RA wrap-around problem) • Poor (but acceptable) performance on cone-search: – E.g. consider USNO-B: a table of a billion rows – Typical search/join uses a radius of say 3 arc-seconds. – 17% chance of finding a match at a random position. – Index on declination effectively searches a strip 360° x 6 arc-seconds: get around 10,000 rows matching. Need to check all these to find the (0 or 1) true matches. • Conclusion: might gain five orders of magnitude in efficiency by using a true 2-d index. 2003 June 30 11 One-dimensional index: spatial join problem • Very inefficient, and efficiency is needed when joining one large table with another • Join criterion is, typically: – SELECT * FROM cat1,cat2 WHERE ABS(dec1 - dec2) < combinedError AND … – I have not yet found any DBMS with an optimiser which uses an index when confronted with a join criterion of this form (or indeed any other expression for joining on an overlap of error ranges). 2003 June 30 12 Mapping functions (2-d to 1-d) • • • • • Cover the space with cells (pixels) and number them. Create conventional B-tree on resulting set of integers. Each point in the sky maps to a single integer. An area maps to a set of integers. A seductive idea: – If a small spatial area maps to a smallish range of integers, then a search over a spatial area might be done by a B-tree search over a small integer range. – Various space-filling curves have been used in the hope that this works in practice, e.g. Z-order index, Hilbert, Peano curves, and many others. 2003 June 30 13 Hilbert Curve 2003 June 30 14 Z-order (bit-interleaved) Mapping Function 2003 June 30 15 Space-filling Curves as Mapping Functions • Excellent performance when searching for single points. • For area searches, the median performance is adequate, but all curves suffer the same drawback: – There exist cases in which nearby points in space are very far apart on the curve (in the Z-order index this corresponds to a high-order bit flipping). – Performance tests confirm this defect: the worst-case performance is so abysmal that the average performance is very poor. • Another problem: simple cartesian grids also unsuited to spherical-polar coordinate searches as there are too many tiny distorted pixels near the poles. 2003 June 30 16 Better Mapping Functions Aim: cover sky uniformly with pixels - integer pixel-codes. • HEALPix - Hierarchical Equal Area iso-Latitude Pixelisation – invented at ESO for COBE. – Pseudo-square pixels, 2-d arrays of pixels suitable for analysis of large-scale spatial structures. • HTM - Hierarchical Triangular Mesh – invented at Johns Hopkins University – now used by several projects. – Triangular pixels. – Hierarchical numbering: high-order bits are a valid HTM index for coarser grid. • Both algorithms perfectly good when searching for points. 2003 June 30 17 Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix) 2003 June 30 18 Hierarchical Triangular Mesh (HTM) 2003 June 30 19 Spatial Join using Pixel-code Method Given table CAT1 with columns: • ID1 – primary key • RA • DEC • POSERR • MAGNITUDE • etc 2003 June 30 Given table CAT2 with columns: • ID2 – primary key • RA • DEC • POSERR • FLUX • etc 20 Create tables P1, P2 with pixel-code column Given table CAT1 with columns: • ID1 – primary key • RA • DEC • POSERR • MAGNITUDE • etc Create table P1: • ID1 – foreign key • PCODE1 – primary key Given table CAT2 with columns: • ID2 – primary key • RA • DEC • POSERR • FLUX • etc Create table P2: • ID2 – foreign key • PCODE2 – primary key Note: P1 and P2 have extra rows where error-circle overlaps two or more pixels. 2003 June 30 21 Join P1 and P2 on pixel-codes creating PJOIN Given table CAT1 with columns: • ID1 – primary key • RA • DEC • POSERR • MAGNITUDE • etc Create table P1: • ID1 – foreign key • PCODE1 – primary key Create table PJOIN by joining P1 and P2 on PCODE1=PCODE2 • ID1 • ID2 2003 June 30 Given table CAT2 with columns: • ID2 – primary key • RA • DEC • POSERR • FLUX • etc Create table P2: • ID2 – foreign key • PCODE2 – primary key 22 PJOIN table • Join using SELECT DISTINCT to remove any duplicates (two error-circles may each overlap two or more pixels). • Table PJOIN identifies all pixels where error-circles potentially overlap – Circles may or may not actually overlap, may just be nearby in the same pixels. • Next step: create B-tree index on PJOIN(ID1) 2003 June 30 23 Use PJOIN table to match catalogue rows • Three-way join then produces required results, e.g. SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2 AND (2 * asin(sqrt(pow(sin((cat1.deccat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.racat2.ra)/2),2))) <= cat1.poserr+cat2.poserr) ; • This works, and speed appears good, but more testing is needed on a wider range of datasets. 2003 June 30 24 True Spatial Indexing • Hot topic of research in computer science departments for more than 20 years • Very many algorithms have been proposed: – BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hBtree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree. – So many alternatives, but none of them has properties as good as the B-tree in one dimension (e.g. compact and efficient, with fairly good worst-case performance). • R-tree one of the earliest structures, one of the best. – R-trees are built into several modern DBMS. 2003 June 30 25 Spatial Options in current DBMS DBMS Spatial Index DB2 Spatial Extender: multi-level grid file Informix R-tree indexing included Ingres Interbase/Firebird Microsoft SQL Server MySQL R-trees on arbitrary polygons Microsoft SQL Server Oracle Spatial option: R-trees Postgres R-trees on rectangular boxes Sybase Spatial Option (Boeing SQS): R-trees 2003 June 30 26 Using R-trees • Must draw a rectangular box outside each error circle • Boxes get rather extended (along RA axis) near poles • Need a subsequent filter to remove spurious matches where rectangles overlap but error-circles do not. • If the error criterion alters (e.g. user wants 99% probability circle rather than 90%) need to recreate column of boxes, and then recreate the R-tree index. – Solution: always use a box as large as anyone could possible want, subsequent filtering on error-circles is still quite cheap. 2003 June 30 27 R-tree Performance • Postgres: R-tree indexing works as advertised – R-trees are large, creation is slow, e.g. 2 hours for table of 3.5 million rows. • MySQL: latest version allows R-trees on any polygon (but anything above 4 sides is wasteful). – Very verbose external data format – Works as advertised, not yet measured performance. • Informix: Kalpakis et al. (ADASS conference report) loaded part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire table of 500 million rows. 2003 June 30 28 Comparison of Pixel-code and R-tree Methods • Advantages – Pcode join seems to be faster (but not yet benchmarked with identical systems). – Takes up less disc space in total. – Can use any DBMS, not just those with an R-tree or other spatial data option. • Disadvantages – Additional tables and indices have to be created. – More complex set of joins. – Needs external code as neither HTM or HEALPix can be expressed as an SQL-callable function (because they return a variable-length array of integers). 2003 June 30 29 Indexing: summary • Indexing on just one spatial axis is simply too inefficient for large tables and cannot support joins. • R-trees are powerful and easy to use, but index creation times are a serious cause for concern. • 2d1d mapping functions such as HTM or HEALPix are more complicated to use but may be faster in some cases. • AstroGrid will continue work in this area. 2003 June 30 30 Idea: HTM as Universal Position Locator? Can choose HTM (or HEALPix) resolution as high as needed • 32-bit words: pixels around 24 arcsec on a side. – More pixels than objects in largest current catalogue • 64-bit words: pixels around 0.4 milliarcsec on a side. – Ample precision for some time to come Maybe every object catalogue should list the UPL of each celestial object, allowing fast and easy searches and joins? • Fine for points in sky, but real objects have finite extent, and some will correspond to a more than one UPL • Some objects move – should pixel-code change with time? • Reference frame also moves 2003 June 30 31 UPLs for extended objects: possible work-arounds • Choose a pixel size large enough to encompass the error region in each case – In some cases this would be very large indeed – Different objects would have different pixel size • Have extra rows in tables for objects crossing pixel boundaries. – Need to alter original tables, not always acceptable. • Add variable-length vector to table to contain list of UPLs – Non-normalised table; not allowed by most DBMS • Introduce separate table to map object IDs to UPLs. – Adds significant complexity to search/join operations 2003 June 30 32 Speeding up Spatial Data Access Use Parallel Hardware such as Beowulf Clusters • We may be able to spread a large table over the disc drives of several nodes in a cluster for faster searches. • Some support for this is built into DBMS such as Oracle, DB2 and SQL Server. There is currently nothing to support it in Open Source products such as MySQL and Postgres. • Possible do-it-yourself approach: split table into separate DBMS instances on separate nodes. Farm out query to all instances, then combine the results. Column-oriented storage • Have tested Sybase-IQ: performance good, but expensive. No multi-dimensional indexing. 2003 June 30 33 Can we do Distributed Searches and Joins? • Object catalogues of interest are located on data archive servers all over the world. Is it feasible to search or join without copying entire tables? – Cone search services are becoming widespread. A few servers allow a single search to be farmed out to many servers, no attempt is made to merge the results, they are just concatenated. Room for improvement here? – An outer join requires every row in table#1 to be present in the output, so there is little point in attempting a join over the network, when it is simpler to copy table#1 to the host holding table#2. • Perhaps need to set up astronomical data warehouse with services to import tables and join them. 2003 June 30 34