Indexing the Sky Clive Page 2003 Apr 8 1

advertisement
Indexing the Sky
Clive Page
2003 Apr 8
1
2003 Apr 8
2
Formats of Raw Data
• Radio:
– Complex visibility for each polarisation at set of points
sampling the (u,v) plane.
• Infra-red, Optical, Ultra-violet:
– Images from 1k×1k to 18k×20k, collected every few
seconds or few minutes.
• X-ray, Gamma-ray:
– Lists of detected photons (x, y, time, energy) typically
accumulated for several hours.
2003 Apr 8
3
Formats of Reduced Data
•
•
•
•
Images
Time-series
Spectra
Source Catalogues:
– Vital to cross-identify sources from different wavebands, basis for
many subsequent data mining investigations.
– Problem: can be large, examples:
USNO-B
1st XMM-Newton catalogue
2003 Apr 8
1,045,913,669 rows
30 columns
56,711 rows
379 columns
4
Required Functionality
• SELECT sources in given small patch of sky (circle,
rectangle, or polygon)
• JOIN two tables e.g. from different wavebands to find
corresponding sources
– Principal matching criterion is positional match typically overlap of error-circles.
2003 Apr 8
5
Problems handling source catalogues
• Positions use spherical-polar coordinates (RA, Dec)
– Right Ascension corresponds to geographic longitude
– Declination corresponds to geographic latitude
• There are singularities at the poles and distortions in the
scales everywhere except at the equator.
• RA wraps from 24 hours (360 degrees) to zero.
• Two-dimensional indexing is really needed.
• All source positions are imprecise  points have an error
radius.
• Distances between points must use a great-circle distance
function not cartesian distance.
2003 Apr 8
6
Indexing Possibilities
1. Use simple B-tree on one spatial axis only
2. Use 1-d to 2-d mapping function then B-tree
3. Use spatial index such as R-tree
2003 Apr 8
7
(1) Index one spatial axis only
• For example consider USNO-B: a table of a billion rows.
• Typical search/join uses a radius of say 3 arc-seconds.
• Probability of finding a source in a circle of radius 3 arcseconds in a random position is around 17%, so most
searches find 0 or 1 rows.
• An index on just one coordinate (say Dec) will effectively
search a strip 360° wide by 6 arc-seconds high, and will
find some 10,000 rows matching. These have to be
scanned sequentially to find at most one matching row.
• Conclusion: a true 2-d index can gain five orders of
magnitude in efficiency.
2003 Apr 8
8
(2) 2-d to 1-d mapping
•
•
•
•
Cover the space with cells (pixels) and number them.
Create conventional B-tree on resulting set of integers.
Each point maps to an integer.
Areas map to a list of integers:
– Ideally a small spatial area maps to a small range of
integers so one can do a range search using the B-tree.
– Various space-filling curves such as the Z-ordering
index and Peano Curve have been used in the hope that
this works…
2003 Apr 8
9
Z-order mapping function
2003 Apr 8
10
Space-filling Curves
• All have same failing:
– At some places in the grid a high-order bit flips and the
range of integers becomes huge.
– Tests confirm this defect: the worst-case performance is
rather poor.
• Simple cartesian grids also unsuited to spherical-polar
coordinates as there are too many tiny pixels near the poles.
2003 Apr 8
11
Covering the sky evenly with pixels
• Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix) – invented at European Southern Observatory.
• Hierarchical Triangular Mesh (HTM) – invented at Johns
Hopkins University
• Can use either algorithm – call it pixel-code or PCODE for
short
– Do not try to conduct spatial range search using range of
PCODE values.
2003 Apr 8
12
Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix)
2003 Apr 8
13
Hierarchical Triangular Mesh (HTM)
2003 Apr 8
14
Spatial Join using PCODE
Table CAT1 has columns
• ID1
• RA
• DEC
• POSERR
• MAGNITUDE
• etc
2003 Apr 8
Table CAT2 has columns
• ID2
• RA
• DEC
• POSERR
• FLUX
• etc
15
Create additional tables with PCODE values
Table CAT1 has columns
• ID1 – primary key
• RA
• DEC
• POSERR
• MAGNITUDE
• Etc
•
•
•
•
•
•
•
Table P1 has columns
• ID1
• PCODE1 – primary key
Table P2 has columns
• ID2
• PCODE2 – primary key
2003 Apr 8
Table CAT2 has columns
ID2 – primary key
RA
DEC
POSERR
FLUX
Etc
16
JOIN the two PCODE tables
Note: tables P1, P2 have extra rows when error-circles overlap
more than one pixel.
• Join P1 and P2 on PCODE1=PCODE2 making a table
PJOIN with just two columns: ID1 and ID2.
• Use SELECT DISTINCT to remove any duplicates
• Table PJOIN identifies pixels which may contain sources
with overlapping error circles (or they may just be near but
not overlapping)
• Create B-tree index on PJOIN(ID1)
2003 Apr 8
17
Use PJOIN table to match catalogue rows
• Three-way join then produces required results, e.g.
SELECT cols FROM CAT1, PJOIN, CAT2
WHERE CAT1.ID1=PJOIN.ID1
AND PJOIN.ID2=CAT2.ID2
AND (2 * asin(sqrt(pow(sin((cat1.deccat2.dec)/2),2) + cos(cat1.dec) *
cos(cat2.dec) * pow(sin((cat1.racat2.ra)/2),2))) <=
cat1.poserr+cat2.poserr) ;
2003 Apr 8
18
(3) True Multi-dimensional Indexing
• Hot topic of research in computer science departments for
more than 20 years
• Very many algorithms have been proposed:
– BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hBtree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree, Zorder index.
– So many alternatives, but none of them provides a good
general solution, like the B-tree in 1-D indexing.
• R-tree indexing is built into several modern DBMS.
2003 Apr 8
19
Spatial Options in current DBMS
Commercial:
DB2
Spatial Extender – multi-level grid file
Ingres
Oracle
None
Spatial Option – R-tree (?)
SQL Server
Sybase
None
Spatial Option (Boeing SQS) – R-tree
Open Source:
MySQL
R-tree in V4.1 (beta, documentation lacking)
Interbase
PostgreSQL
2003 Apr 8
None
R-tree
20
Using R-trees
Used R-trees in Postgres – does what it says on the box.
Problems/limitations include:
• Object indexed by R-tree is a rectangular box, so must draw
a box outside each error circle
• Boxes get rather extended (along RA axis) near poles
• Need a subsequent filter to remove spurious matches where
rectangles overlap but circles do not.
• R-tree indices are large, creation is slow (2 hours for table
of 3.5 million rows using Postgres).
– Kalpakis et al. used Informix to load part of USNO-A2
and found data load and R-tree creation would have
taken 39 days for the entire 500M row table.
2003 Apr 8
21
Comparison of PCODE and R-tree
• Advantages
– PCODE join seems to be faster (but not yet
benchmarked with identical systems).
– Takes up less disc space in total.
– Can use any DBMS, not just those with an R-tree or
other spatial data option.
• Disadvantages
– Additional tables and indices have to be created
– More complex set of joins.
– Needs external code as neither HTM or HEALPix can
be expressed as an SQL-callable function (they return a
variable-length array of integers).
2003 Apr 8
22
Conclusions
• Indexing on just one spatial axis is simply too inefficient
for large tables.
• R-trees are powerful and easy to use, but index creation
times are a serious cause for concern.
• 2d1d mapping functions such as HTM or HEALPix are
more complicated to use, but may be worthwhile for JOINs
if they turn out to be faster.
2003 Apr 8
23
Download