Spatial Indexing Clive Page 2003 June 30 1

advertisement
Spatial Indexing
Clive Page
2003 June 30
1
2003 June 30
2
Formats of Raw Data
• Radio:
– Complex visibility for each polarisation at set of points
sampling the complex (u,v) plane.
• Infra-red, Optical, Ultra-violet:
– Images from 1k×1k to 18k×20k, collected at intervals of
a few seconds to several minutes.
• X-ray, Gamma-ray:
– Photon-even lists: get properties (x, y, time, energy) for
each detected photon. May get a list of millions of these
over an integration period of a few hours.
2003 June 30
3
Formats of Reduced Data
•
•
•
•
Images
Time-series
Spectra
Catalogues of astronomical objects:
– Vital to cross-identify objects from different wavebands, basis for
many subsequent data mining investigations.
– Problem: tables can be long or wide:
Catalogue
Optical: USNO-B
Infra-red: 2MASS
X-ray:
2003 June 30
1XMM
Rows
Columns
1,045,913,669
30
470,992,970
60
56,711
379
4
Main Functionality
• SELECT objects in a given small patch of sky
– In a rectangle – to cover same region as an image
– In a circle – to cover a radius around point of interest
(also known as search in a cone).
– In a polygon – e.g. around extended object.
• Spatial JOIN: Cross-match objects from, for example,
two wavebands or two epochs.
– Principal matching criterion is overlap of error-circles.
– Often important to find objects in one table which are
NOT matched in the other: need a left outer join.
2003 June 30
5
Functionality (continued)
• Self JOIN useful in some applications, for example:
– Search for clusters of galaxies
– Search for double stars
– Find spatial distribution or spatial correlation functions
for particular classes of objects
– Those observing with adaptive optics need a suitable
reference star in or near each field, e.g. might want to
find all AGNs near to bright stars to plan an observing
campaign.
2003 June 30
6
Current on-line services
• Search in a cone – many on-line services
• Cross-matching (spatial JOIN) – few current examples:
– Astrobrowse (at GSFC) has pre-computed joins for a
limited number of important catalogues.
– Vizier service (CDS) allows cross-match of user’s own
table of positions with list of catalogues.
• But results of list of N sources and M catalogues are
presented as unmerged list of M*N separate tables.
– Skyserver – joins between SDSS and other tables.
• Self-join: do not know of any on-line services at present.
2003 June 30
7
Problems handling object catalogues
• Positions use spherical-polar coordinates (RA, Dec)
– Right Ascension corresponds to geographic longitude
– Declination corresponds to geographic latitude
• There are singularities at the poles and distortions in the
scales everywhere except at the equator.
• RA wraps from 24 hours (360 degrees) to zero.
• All object positions are imprecise  positions have an
error radius.
• Distances between points must use a great-circle distance
function not cartesian distance.
• Two-dimensional indexing is really needed
2003 June 30
8
Indexing Possibilities
1. Home-brew slicing of the sky
2. Use B-tree on one spatial axis only
3. Use 1-d to 2-d mapping function then simple B-tree
4. Use true spatial index such as R-tree.
2003 June 30
9
Home-brew slicing of sky
• Both USNO-B and 2MASS issued as separate files, each
covering 0.1 degree in declination, sorted in RA within
each strip.
• Software from Harvard-Smithsonian Center for
Astrophysics (WCSTOOLS) allows fairly efficient access
to data stored in these strips.
• WFCSTOOLS is used by many astronomical archives
around the world, (including LEDAS www.ledas.ac.uk).
• Ok for cone-search services, not suitable for spatial joins.
2003 June 30
10
Index one spatial axis only
• Widely used, including Vizier collection of ~3000
astronomical tables.
• Index on declination (avoids RA wrap-around problem)
• Poor (but acceptable) performance on cone-search:
– E.g. consider USNO-B: a table of a billion rows
– Typical search/join uses a radius of say 3 arc-seconds.
– 17% chance of finding a match at a random position.
– Index on declination effectively searches a strip 360° x
6 arc-seconds: get around 10,000 rows matching. Need
to check all these to find the (0 or 1) true matches.
• Conclusion: might gain five orders of magnitude in
efficiency by using a true 2-d index.
2003 June 30
11
One-dimensional index: spatial join problem
• Very inefficient, and efficiency is needed when joining one
large table with another
• Join criterion is, typically:
– SELECT * FROM cat1,cat2 WHERE
ABS(dec1 - dec2) < combinedError
AND …
– I have not yet found any DBMS with an optimiser
which uses an index when confronted with a join
criterion of this form (or indeed any other expression for
joining on an overlap of error ranges).
2003 June 30
12
Mapping functions (2-d to 1-d)
•
•
•
•
•
Cover the space with cells (pixels) and number them.
Create conventional B-tree on resulting set of integers.
Each point in the sky maps to a single integer.
An area maps to a set of integers.
A seductive idea:
– If a small spatial area maps to a smallish range of
integers, then a search over a spatial area might be done
by a B-tree search over a small integer range.
– Various space-filling curves have been used in the hope
that this works in practice, e.g. Z-order index, Hilbert,
Peano curves, and many others.
2003 June 30
13
Hilbert Curve
2003 June 30
14
Z-order (bit-interleaved) Mapping Function
2003 June 30
15
Space-filling Curves as Mapping Functions
• Excellent performance when searching for single points.
• For area searches, the median performance is adequate, but
all curves suffer the same drawback:
– There exist cases in which nearby points in space are
very far apart on the curve (in the Z-order index this
corresponds to a high-order bit flipping).
– Performance tests confirm this defect: the worst-case
performance is so abysmal that the average performance
is very poor.
• Another problem: simple cartesian grids also unsuited to
spherical-polar coordinate searches as there are too many
tiny distorted pixels near the poles.
2003 June 30
16
Better Mapping Functions
Aim: cover sky uniformly with pixels - integer pixel-codes.
• HEALPix - Hierarchical Equal Area iso-Latitude
Pixelisation – invented at ESO for COBE.
– Pseudo-square pixels, 2-d arrays of pixels suitable for
analysis of large-scale spatial structures.
• HTM - Hierarchical Triangular Mesh – invented at Johns
Hopkins University – now used by several projects.
– Triangular pixels.
– Hierarchical numbering: high-order bits are a valid
HTM index for coarser grid.
• Both algorithms perfectly good when searching for points.
2003 June 30
17
Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix)
2003 June 30
18
Hierarchical Triangular Mesh (HTM)
2003 June 30
19
Spatial Join using Pixel-code Method
Given table CAT1 with columns:
• ID1 – primary key
• RA
• DEC
• POSERR
• MAGNITUDE
• etc
2003 June 30
Given table CAT2 with columns:
• ID2 – primary key
• RA
• DEC
• POSERR
• FLUX
• etc
20
Create tables P1, P2 with pixel-code column
Given table CAT1 with columns:
• ID1 – primary key
• RA
• DEC
• POSERR
• MAGNITUDE
• etc
Create table P1:
• ID1 – foreign key
• PCODE1 – primary key
Given table CAT2 with columns:
• ID2 – primary key
• RA
• DEC
• POSERR
• FLUX
• etc
Create table P2:
• ID2 – foreign key
• PCODE2 – primary key
Note: P1 and P2 have extra rows
where error-circle overlaps two
or more pixels.
2003 June 30
21
Join P1 and P2 on pixel-codes creating PJOIN
Given table CAT1 with columns:
• ID1 – primary key
• RA
• DEC
• POSERR
• MAGNITUDE
• etc
Create table P1:
• ID1 – foreign key
• PCODE1 – primary key
Create table PJOIN by joining P1
and P2 on PCODE1=PCODE2
• ID1
• ID2
2003 June 30
Given table CAT2 with columns:
• ID2 – primary key
• RA
• DEC
• POSERR
• FLUX
• etc
Create table P2:
• ID2 – foreign key
• PCODE2 – primary key
22
PJOIN table
• Join using SELECT DISTINCT to remove any duplicates
(two error-circles may each overlap two or more pixels).
• Table PJOIN identifies all pixels where error-circles
potentially overlap
– Circles may or may not actually overlap, may just be
nearby in the same pixels.
• Next step: create B-tree index on PJOIN(ID1)
2003 June 30
23
Use PJOIN table to match catalogue rows
• Three-way join then produces required results, e.g.
SELECT cols FROM CAT1, PJOIN, CAT2
WHERE CAT1.ID1=PJOIN.ID1
AND PJOIN.ID2=CAT2.ID2
AND (2 * asin(sqrt(pow(sin((cat1.deccat2.dec)/2),2) + cos(cat1.dec) *
cos(cat2.dec) * pow(sin((cat1.racat2.ra)/2),2)))
<= cat1.poserr+cat2.poserr) ;
• This works, and speed appears good, but more testing is
needed on a wider range of datasets.
2003 June 30
24
True Spatial Indexing
• Hot topic of research in computer science departments for
more than 20 years
• Very many algorithms have been proposed:
– BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hBtree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree.
– So many alternatives, but none of them has properties as
good as the B-tree in one dimension (e.g. compact and
efficient, with fairly good worst-case performance).
• R-tree one of the earliest structures, one of the best.
– R-trees are built into several modern DBMS.
2003 June 30
25
Spatial Options in current DBMS
DBMS
Spatial Index
DB2
Spatial Extender: multi-level grid file
Informix
R-tree indexing included
Ingres
Interbase/Firebird
Microsoft SQL Server
MySQL
R-trees on arbitrary polygons
Microsoft SQL Server
Oracle
Spatial option: R-trees
Postgres
R-trees on rectangular boxes
Sybase
Spatial Option (Boeing SQS): R-trees
2003 June 30
26
Using R-trees
• Must draw a rectangular box outside each error circle
• Boxes get rather extended (along RA axis) near poles
• Need a subsequent filter to remove spurious matches where
rectangles overlap but error-circles do not.
• If the error criterion alters (e.g. user wants 99% probability
circle rather than 90%) need to recreate column of boxes,
and then recreate the R-tree index.
– Solution: always use a box as large as anyone could
possible want, subsequent filtering on error-circles is
still quite cheap.
2003 June 30
27
R-tree Performance
• Postgres: R-tree indexing works as advertised
– R-trees are large, creation is slow, e.g. 2 hours for table
of 3.5 million rows.
• MySQL: latest version allows R-trees on any polygon (but
anything above 4 sides is wasteful).
– Very verbose external data format
– Works as advertised, not yet measured performance.
• Informix: Kalpakis et al. (ADASS conference report)
loaded part of USNO-A2 and found data load and R-tree
creation would have taken 39 days for the entire table of
500 million rows.
2003 June 30
28
Comparison of Pixel-code and R-tree Methods
• Advantages
– Pcode join seems to be faster (but not yet benchmarked
with identical systems).
– Takes up less disc space in total.
– Can use any DBMS, not just those with an R-tree or
other spatial data option.
• Disadvantages
– Additional tables and indices have to be created.
– More complex set of joins.
– Needs external code as neither HTM or HEALPix can
be expressed as an SQL-callable function (because they
return a variable-length array of integers).
2003 June 30
29
Indexing: summary
• Indexing on just one spatial axis is simply too inefficient
for large tables and cannot support joins.
• R-trees are powerful and easy to use, but index creation
times are a serious cause for concern.
• 2d1d mapping functions such as HTM or HEALPix are
more complicated to use but may be faster in some cases.
• AstroGrid will continue work in this area.
2003 June 30
30
Idea: HTM as Universal Position Locator?
Can choose HTM (or HEALPix) resolution as high as needed
• 32-bit words: pixels around 24 arcsec on a side.
– More pixels than objects in largest current catalogue
• 64-bit words: pixels around 0.4 milliarcsec on a side.
– Ample precision for some time to come
Maybe every object catalogue should list the UPL of each
celestial object, allowing fast and easy searches and joins?
• Fine for points in sky, but real objects have finite extent,
and some will correspond to a more than one UPL
• Some objects move – should pixel-code change with time?
• Reference frame also moves
2003 June 30
31
UPLs for extended objects: possible work-arounds
• Choose a pixel size large enough to encompass the error
region in each case
– In some cases this would be very large indeed
– Different objects would have different pixel size
• Have extra rows in tables for objects crossing pixel
boundaries.
– Need to alter original tables, not always acceptable.
• Add variable-length vector to table to contain list of UPLs
– Non-normalised table; not allowed by most DBMS
• Introduce separate table to map object IDs to UPLs.
– Adds significant complexity to search/join operations
2003 June 30
32
Speeding up Spatial Data Access
Use Parallel Hardware such as Beowulf Clusters
• We may be able to spread a large table over the disc drives
of several nodes in a cluster for faster searches.
• Some support for this is built into DBMS such as Oracle,
DB2 and SQL Server. There is currently nothing to support
it in Open Source products such as MySQL and Postgres.
• Possible do-it-yourself approach: split table into separate
DBMS instances on separate nodes. Farm out query to all
instances, then combine the results.
Column-oriented storage
• Have tested Sybase-IQ: performance good, but expensive.
No multi-dimensional indexing.
2003 June 30
33
Can we do Distributed Searches and Joins?
• Object catalogues of interest are located on data archive
servers all over the world. Is it feasible to search or join
without copying entire tables?
– Cone search services are becoming widespread. A few
servers allow a single search to be farmed out to many
servers, no attempt is made to merge the results, they are
just concatenated. Room for improvement here?
– An outer join requires every row in table#1 to be present
in the output, so there is little point in attempting a join
over the network, when it is simpler to copy table#1 to
the host holding table#2.
• Perhaps need to set up astronomical data warehouse
with services to import tables and join them.
2003 June 30
34
Download