Spatial Approximate String Search

advertisement
1
Spatial Approximate String Search
Abstract—This work deals with the approximate string search in large spatial databases.
Specifically, we investigate range queries augmented with a string similarity search
predicate in both euclidean space and road networks. We dub this query the spatial
approximate string (SAS) query. In euclidean space, we propose an approximate solution,
the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature
for an index node u keeps a concise representation of the union of q-grams from strings
under the subtree of u. We analyze the pruning functionality of such signatures based on
the set resemblance between the query string and the q-grams from the subtrees of index
nodes. We also discuss how to estimate the selectivity of a SAS query in euclidean space, for
which we present a novel adaptive algorithm to find balanced partitions using both the
spatial and string information stored in the tree. For queries on road networks, we propose a
novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in
practice. The RSASSOL combines the q-gram-based inverted lists and the reference nodes
based pruning. Extensive experiments on large real data sets demonstrate the efficiency and
effectiveness of our approaches.
INTRODUCTION
eyword search over a large amount of data is an important operation in a wide range
K
of domains. Felipe et al. have recently extended its study to spatial databases [17],
where keyword search becomes a fundamental building block for an increasing
number of real-world applications, and proposed the IR2 -Tree. A main limitation of the
IR2-Tree is that it only supports exact keyword search. In practice, keyword search for
retrieving approximate string matches is required [3], [9], [11], [27], [28], [30], [36], [43].
Archirtecture Diagram
CO N C L U S I O N
This paper presents a comprehensive study for spatial approximate string queries in both
the euclidean space and road networks. We use the edit distance as the similarity
www.frontlinetechnologies.org
projects@frontl.in
+91 7200247247
2
measurement for the string predicate and focus on the range queries as the spatial
predicate. We also address the problem of query selectivity estimation for queries in the
euclidean space. Future work includes examining spatial approximate substring queries,
designing methods that are more update friendly, and solving the selectivity estimation
problem for RSAS queries.
REFERENCES
1.
S. Acharya, V. Poosala, and S. Ramaswamy, "Selectivity Estimation in Spatial
Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 13-24, 1999.
2.
S. Alsubaiee, A. Behm, and C. Li, "Supporting Location-Based Approximate-Keyword
Queries," Proc. SIGSPATIAL 18th Int'l Conf. Advances in Geographic Information
Systems (GIS), pp. 61-70, 2010.
3.
A. Arasu, S. Chaudhuri, K. Ganjam, and R. Kaushik, "Incorporating String
Transformations in Record Matching," Proc. ACM SIGMOD Int'l Conf. Management
of Data, pp. 1231-1234, 2008.
4.
A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," Proc. 32nd
Int'l Conf. Very Large Data Bases (VLDB), pp. 918929, 2006.
5.
N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, "The R*- Tree: an Efficient
and Robust Access Method for Points and Rectangles," Proc. ACM SIGMOD Int'l
Conf. Management of Data, pp. 322-331, 1990.
6.
A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, "Min-Wise
Independent Permutations (Extended Abstract)," Proc. ACM 30th Symp. Theory of
Computing (STOC), pp. 327-336, 1998.
7.
X. Cao, G. Cong, and C.S. Jensen, "Retrieving Top-k Prestige- Based Relevant Spatial
Web Objects," Proc. VLDB Endowment, vol. 3, pp. 373-384, 2010.
8.
K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for
Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management
of Data, pp. 805-818, 2008.
9.
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, "Robust and Efficient Fuzzy
Match for Online Data Cleaning," Proc. ACM SIGMOD Int'l Conf. Management of
Data, pp. 313-324, 2003.
10.
S. Chaudhuri, V. Ganti, and L. Gravano, "Selectivity Estimation for String Predicates:
Overcoming the Underestimation Problem," Proc. Int'l Conf. Data Eng. (ICDE), pp.
227-238, 2004.
www.frontlinetechnologies.org
projects@frontl.in
+91 7200247247
Download