SPATIAL JOIN Biplob Kumar Debnath Department of Electrical and Computer Engineering, University of Minnesota SYNONYMS Intersect Join DEFINITION Spatial join operation is used to combine two or more dataset with respect to a spatial predicate. Predicate can be a combination of directional, distance, and topological spatial relations. In case of nonspatial join, the joining attributes must of the same type, but for spatial join they can be of different types. Usually each spatial attribute is represented by its minimum bounding rectangles (MBR). A typical example of spatial join is “Find all pair of rivers and cities that intersect”. For example in Figure 1, the result of join between the set of rivers {R1, R2} and cities {C1, C2, C3, C4, C5} is { (R1, C1), (R2, C5)}. Figure 1: Example of spatial join HISTORICAL BACKGROUND In 1986, Orenstein used grid based technique to perform spatial join. It is the first known technique to solve spatial join operation. Using grid multidimensional spaces are divided into smaller blocks, known as pixels. Then a z-ordering is used to order the pixels. Each object is approximated by the pixels which interest with its MBR. As pixels are ordered by z-ordering, now each object is represented by a set of z-values, which are onedimensional. Now, any one-dimensional indexing (e.g., B+-tree) can be used sort them and using sort-merge spatial join operation is done. The performance of this technique solely depends on the granularity of the grids. The finer grids are the more accurate the results will be, but the more memory it will consume. Later on to remedy this problem, that people devised multidimensional indices (e.g., R-tree) which can directly handle spatial data. Various new spatial join algorithms (e.g., R-tree join, sort and match, spatial hash join, slot index hash join etc.) based on multi-dimensional index appeared . KEY CONCEPTS: Spatial join is done in two steps: filter step and refine step. In filter step, tuples whose MBR overlaps with query region are determined. This step is not computationally expensive as at most four computations are required to determine whether two rectangles intersect. The tuples which passed the filter step is fed to the refinement step, where exact spatial representation is used and spatial predicate is checked on these spatial representations. Refinement step is computationally expensive, but the number of tuples it processed in this step is less, due to initial filter step. Spatial join algorithm can be classified into three categories. For the discussion below we will assume that we want to spatial join relation R1 and R2. In this discussion, we will focus on only intersection join. Same techniques can be extended for other join variants (e.g., distance join). Nested Loop In this algorithm, for each tuple of R1, entire R2 is scanned; any pair of tuples of R1 and R2 which satisfies the spatial join predicate is added to the result. The basic algorithm follows: 1. for all tuple r1 R1 2. for all tuple r2 R 2 3. if pair (r1, r2) satisfies the spatial join predicate 4. add <r1, r2> to result Here, R1 is the outer relation and R2 is the inner relation. If an index is available, we can make that relation as an inner one. In this case, we need not to scan the entire inner relation. Tree Matching Tree matching algorithm can be applied when indices are available on both the relations. For this discussion, we will assume that R-tree index is available. In R-tree, every node is in the form of <ref, rect>, where ref is pointer to child node and rect is the MBR of the child node or MBR of a spatial object. The pages which contain leaf nodes are called data pages, and the pages which contain non-leaf nodes are called directory pages. As directory entries contains the MBR of the child node entries, if MBRs of two directory entries Er1 and Er2 are disjoint, then there can be no match between entries of both directory pages. If they are not disjoint, there is some match between the entries, so we have traverse deeper the tree to get the matching tuple. The basic algorithm follows: Spatial_Join (R1, R2 ) // R1 and R2 are R-Tree nodes 1. for all Er1 R1 2. for all Er2 R 2 3. if (Not_Disjoint( Er1.rect, Er2.rect)) 4. if ( R1 and R2 are leaf pages) 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. if pair (R1, R2) satisfies the spatial join predicate add <R1, R2> to result else if (R1 is a leaf page) Read_Page (Er2ptr) Spatial_Join (Er1.tr, Er2.ptr) else if (R2 is a leaf page) Read_Page (Er1ptr) Spatial_Join (Er1.tr, Er2.ptr) else Read_Page (Er1.ptr) Read_Page (Er2.ptr) Spatial_Join (Er1.tr, Er2.ptr) When index exists for only one relation, the index on the other relation is built on the fly and tree-matching technique is applied. Partition-Based Spatial Merge Join In this case, first both of the relations are divided into p partitions if both of them do not fit in main memory. After that partition i of R1, where 1 i p , is compared with corresponding partition i of R2. We briefly go through the filter step of this algorithm: 1. For each tuple in R1 and R2, form new relations R1’ and R2’ where each tuple consists of unique object id of the tuple and MBR of the joining attributes. 2. If we can fit both R1’ and R2’ in the main memory, using a plane-sweep algorithm we can process the join relation. 3. If both R1’ and R2’ cannot be fitted in the main memory, we partition both the relations into p parts (R1’1,….R1’p and R2’1,….R2’p) where any partitions pair (R1’i,R2’i ) fits in main memory. In addition, we will make sure that, for each R1’i, any overlapping tuples in R2’ will reside in partition R2’i. Now, we can apply plane-sweep algorithm in each partition. This strategy is very good when no indices are present on both the relations. KEY APPLICATIONS One of the applications of applications of spatial join is to find all the objects which either intersect or overlap with each other. Some variants of spatial join (e.g., distance join) are used in data mining for data analysis and clustering. It can also be used to process closest-pairs query, k-nearest neighbors query, and є-distance query. FUTURE DIRECTIONS There are some issues in spatial join require further attention from the research community. For processing spatial join queries we usually follow filter and refine step in order. In some cases, some variants of this (e.g., interleaving) may give us more benefit. We can explore where probable variants can be beneficial and what information we need to collect for this. Although intersection joins algorithms (e.g., R-tree join) can be directly extended for other types (e.g., distance join) but often it cause inefficient performance benefit. Various optimization techniques can be applied to remedy this. Extending existing intersection join algorithms with various optimization criteria to other domain will be an interesting area for research. CROSS REFERENCES 1. Intersection join 2. Distance join 3. Similarity join 4. Spatial access method 5. R-Tree RECOMMENDED READING 1. Shashi Shekar, Sanjay Chawla (2003). Spatial Databases A Tour, First Edition, Prentice Hall. 2. Patel J. M. and Dewitt. D. J. (1996). Partition Based Spatial-Merge Join, Proceddings of ACM SIGMOD, pages 259-270. 3. Brinkhoff, T., Kriegel H., and Seeger B. (1993) Efficient processing of spatial joins using R-trees. In Proceeding of ACM SIGMOD, pages 237-246. 4. Brinkhoff, T., Kriegel H., and Seeger B. (1996) Parallel processing of spatial joins using R-trees. Proceeding of ICDE Conference, pages 258-265.. 5. Yannis Manolopoulos, Apostolos Papadopoulos, Michel Gr. Vassilakopulous (2005). Spatial Databases, Technologies, Techniques and Trends, IDEA Group Publishing. 6. Böhm C. and Krebs F. (2002). High Performance Data Mining Using the nearest Neighbor Join. Proceedings of the IEEE International Conference on Data Mining, pages 43-55. 7. Shou Y., Mamoulis N., Cao H., Papadis D., Cheung D. W. (2003). Evaluation of Iceberg Distance Joins. Proceedings of the Eighth International Symposium on Spatial and Temporal Databases, pages 270-288. 8. Corral A., Manolopoulos Y., Theodorisdis Y., Vassilakopoulos M., (2000). Closest pair queries in spatial databases. Proceedings of the ACM SIGMOD Conference, pages 189-200. 9. Guttmann A.(1984) R-trees: A dynamic index structure for spatial searching. Proceedings of the ACM SIGMOD Conderecee3, pages 47-57. 10. Koudas N., Sevcik k. (2000)/ High Dimensional Similarity Join. Proceedings of the ACM SIGMOD Conference, pages 324-335. 11. Mamaulis N., Papadias D. (2001). Multi-way Spatial Joins. ACM Transactions on Database Systems (TODS), 26(4), pages 424-475. 12. An N. Yang, Sivasurbramaniam A. (2001). Selectivity estimation for Spatial Joins. Proceddings of the IEEEE ICDE Conference, pages 368-375. 13. Faloutsos C., Seeger B., Traina A. , Traina C. (2000). Spatial Join Selectivity Using Power Laws. Proceedings of the ACM SIGMOD Conference, pages 177188. 14. Mamoulis N., and Papadias D. (2003). Slot Index Spatial Join, IEEE Transactions on Knowledge and Data Engineering (TKDE), 15(1), pages 211-231. 15. Orenstein J. (1986). Spatial Query Processing in an Object-Oriented Database System. Proceedings of the ACM SIGMOD Conference, pages 326-336.