Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 1 Outline • • • • • • • Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation Conclusion www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 2 Motivation • More and more documents in the Internet are being associated with a spatial location – Ex: tweets, images (Flickr), Wikipedia sites, OpenStreetMap objects,… • Most of these geotagged objects are associated with a text (description) www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 3 Top-k spatial keyword queries • Query – Spatial location – Query keywords Italian food • Returns the k best spatio-textual objects ranked in terms of both – Spatial distance to the query location – Textual relevance to the query keywords www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 4 Another example… • Query – Spatial location – Query keywords • Returns the k best spatio-textual objects ranked in terms of both distance q – Spatial distance to the query location – Textual relevance to the query keywords objects www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA query location 5 Ranking objects • Score ( p,q) ( p,q) (1 ) ( p,q) • The spatial proximity (δ) is the normalized Euclidean distance between p and q • The textual relevance (θ) is the cosine similarity between the description of p and the query keywords • The query preference parameter (α) defines the importance of one measure over the other www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 6 Current approaches • Employ a modified R-tree [1,2] – Each node keeps an abstract document representing all documents in the node sub-tree • Abstract document – Pairs (term, weight), one pair per term – The weight permits computing an upper-bound score for the objects in the node sub-tree [1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. [2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010. www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 7 bar:2 pop:2 pub:1 rock:1 samba:1 Example root: e1 e2 e3 e1 e2 q e3 e 1 : p1 p 2 p 3 e2: p4 p6 bar:1 pop:2 pub:1 rock:1 bar:2 pub:2 samba:1 e3: p5 p7 pop:1 pub:1 samba:1 For simplicity, we assume that the impact of a term is defined by the frequency www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 8 Current approaches • There are several variations – Incorporating document similarity – Clustering the nodes • Main problems – Frequent and infrequent terms are stored in the same way (have the same cost) – Accesses several nodes due to text dimensionality – Complex management of inverted files and/or vectors, one per node www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 9 Spatial inverted index (S2I) • Similarly to an inverted index, S2I maps terms to objects that contain the term – The most frequent terms are stored in aggregated R-trees (aR-trees) – The less frequent terms are stored in blocks in a file • The aR-tree permits accessing the objects in decreasing order of term relevance • The blocks permits storing the less frequent terms efficiently www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 10 Distribution of terms Frequency • The distribution of terms is very skewed • Few hundred terms take up 50% of the text Terms www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 11 Example bar pop pub rock samba www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 12 Aggregated R-tree (max) for frequent terms (e.g., pub) e1 , max=1 e0 Max value e0: e1(1) e2(2) Term impact e1: p1(1) p2(1) q e2: p5(2) p6(2) p7(1) e2 , max=2 • Only relevant objects are evaluated • The objects are accessed in decreasing order of score www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 13 Single-keyword queries • Only a single block or tree is accessed • Block – All the objects are read and the k best are reported • Tree – The nodes are accessed in decreasing order of score – The algorithm terminates when the score of the k-th object is higher than the score of any unvisited node www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 14 Example, processing top-1 e1 , max=1 e0 e0: e1(1) e2(2) Minimum distance q e2 , max=2 e2: p5(2) p6(2) p7(1) e1: p1(1) p2(1) Max-heap: <e <p215,>ep16>, e1, p7> Top-1 www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA Multiple-keyword queries • Requires aggregating the partial scores of the objects for each term t of the query keywords • Similar to Fagin’s algorithm (NRA) – Different bounds • Score: Partial score ( p,q) t ( p,q) t q.d www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 16 Multiple-keyword algorithm • For each term t in q, access the objects p in S2I in decreasing of partial score – The objects are retrieved from a tree or block • Update the lower bound score of p – Sum of the partial scores know plus the lowest possible partial score (using the spatial distance) • Update the upper bound score of the visited objects • Return the objects whose lower bond score cannot be overcame by the remaining objects www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 17 Experimental evaluation • We compare our approach (S2I) with the DIRtree proposed by Cong et al. [1] • Both approaches are implemented in Java • Measures: response time, I/O, update time, and index size • Size of tree nodes and blocks: 4KB [1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 18 Datasets Twitter1 Total no. of objects 1M Twitter2 2M 12.00 25M Twitter3 3M 12.26 38.6M Twitter4 4M 12.27 51.6M Data1 0.1M 131.70 32.6M Wikipedia 0.4M 163.65 169.4M Flickr 1.4M 14.49 25.4M 3M 8.76 31.5M Datasets OpenStreetMap www.ntnu.no Avg. no. of unique Total no. of terms per object terms 11.94 12.5M SSTD 2011 - Minneapolis, Minnesota, USA 19 Variables studied • Number of results – 10, 20, 30, 40, 50 • Number of query keywords – 1, 2, 3, 4, and 5 • Query preference rate (α) – 0.1, 0.3, 0.5, 0.7, 0.9 • Scalability (twitter dataset) – 1M, 2M, 3M, 4M www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 20 Number of results (k) • The response time of S2I is one order of magnitude better due to less disk accesses – DIR-tree reads several nodes before finding the top-k due to text dimensionality www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 21 Number of query keywords • One order of magnitude better in I/O and response time www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 22 Insertion time and index size • S2I does not require updating inverted files (and vectors), and computing document similarity • S2I requires more space www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 23 Conclusions • Top-k spatial keyword queries are intuitive and have several applications • We propose a new index – Terms with different frequency are stored differently • We propose algorithms to single- and multiplekeyword queries • The efficiency of our approach is verified through experiments on synthetic and real datasets www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 24 Thanks! More information… João B. Rocha-Junior joao@idi.ntnu.no http://www.idi.ntnu.no/~joao www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 25 Scalability • S2I improvement over DIR-tree increases with cardinality of the datasets www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 26 Different datasets • The advantage of S2I over DIR-tree is higher for datasets with few terms per documents www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 27 Terms removal • Terms with length=1 • Terms that have no letter character – ! Character.isLetter(token.charAt(i)) www.ntnu.no SSTD 2011 - Minneapolis, Minnesota, USA 28