Junior - SSTD 2011

advertisement
Efficient Processing of Top-k Spatial
Keyword Queries
João B. Rocha-Junior, Orestis Gkorgkas,
Simon Jonassen, and Kjetil Nørvåg
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
1
Outline
•
•
•
•
•
•
•
Top-k spatial keyword queries
Current approaches
Spatial inverted index
Single-keyword queries
Multiple-keyword queries
Experimental evaluation
Conclusion
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
2
Motivation
• More and more documents in the Internet are
being associated with a spatial location
– Ex: tweets, images (Flickr), Wikipedia sites,
OpenStreetMap objects,…
• Most of these geotagged objects are
associated with a text (description)
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
3
Top-k spatial keyword queries
• Query
– Spatial location
– Query keywords
Italian
food
• Returns the k best
spatio-textual objects
ranked in terms of both
– Spatial distance to the
query location
– Textual relevance to the
query keywords
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
4
Another example…
• Query
– Spatial location
– Query keywords
• Returns the k best
spatio-textual objects
ranked in terms of both
distance
q
– Spatial distance to the
query location
– Textual relevance to the
query keywords
objects
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
query location
5
Ranking objects
• Score
 ( p,q)     ( p,q)  (1   )  ( p,q)
• The spatial proximity (δ) is the normalized
Euclidean distance between p and q
• The textual relevance (θ) is the cosine

similarity between the description of p and
the query keywords
• The query preference parameter (α) defines
the importance of one measure over the other
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
6
Current approaches
• Employ a modified R-tree [1,2]
– Each node keeps an abstract document
representing all documents in the node sub-tree
• Abstract document
– Pairs (term, weight), one pair per term
– The weight permits computing an upper-bound
score for the objects in the node sub-tree
[1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial
web objects”, VLDB, 2009.
[2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for
geographic document search”, TKDE, 2010.
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
7
bar:2
pop:2
pub:1
rock:1
samba:1
Example
root: e1 e2 e3
e1
e2
q
e3
e 1 : p1 p 2 p 3
e2: p4 p6
bar:1
pop:2
pub:1
rock:1
bar:2
pub:2
samba:1
e3: p5 p7
pop:1
pub:1
samba:1
For simplicity, we assume that the impact of a term is defined by the frequency
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
8
Current approaches
• There are several variations
– Incorporating document similarity
– Clustering the nodes
• Main problems
– Frequent and infrequent terms are stored in the
same way (have the same cost)
– Accesses several nodes due to text dimensionality
– Complex management of inverted files and/or
vectors, one per node
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
9
Spatial inverted index (S2I)
• Similarly to an inverted index, S2I maps terms to
objects that contain the term
– The most frequent terms are stored in aggregated
R-trees (aR-trees)
– The less frequent terms are stored in blocks in a file
• The aR-tree permits accessing the objects in
decreasing order of term relevance
• The blocks permits storing the less frequent
terms efficiently
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
10
Distribution of terms
Frequency
• The distribution of terms is very skewed
• Few hundred terms take up 50% of the text
Terms
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
11
Example
bar
pop
pub
rock
samba
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
12
Aggregated R-tree (max) for
frequent terms (e.g., pub)
e1 , max=1
e0
Max
value
e0: e1(1) e2(2)
Term
impact
e1: p1(1) p2(1)
q
e2: p5(2) p6(2) p7(1)
e2 , max=2
• Only relevant objects are
evaluated
• The objects are accessed in
decreasing order of score
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
13
Single-keyword queries
• Only a single block or tree is accessed
• Block
– All the objects are read and the k best are reported
• Tree
– The nodes are accessed in decreasing order of score
– The algorithm terminates when the score of the k-th
object is higher than the score of any unvisited node
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
14
Example, processing top-1
e1 , max=1
e0
e0: e1(1) e2(2)
Minimum distance
q
e2 , max=2
e2: p5(2) p6(2) p7(1)
e1: p1(1) p2(1)
Max-heap: <e
<p215,>ep16>, e1, p7>
Top-1
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
Multiple-keyword queries
• Requires aggregating the partial scores of the
objects for each term t of the query keywords
• Similar to Fagin’s algorithm (NRA)
– Different bounds
• Score:
Partial score
( p,q) 
t

 ( p,q)
t  q.d

www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
16
Multiple-keyword algorithm
• For each term t in q, access the objects p in S2I in
decreasing of partial score
– The objects are retrieved from a tree or block
• Update the lower bound score of p
– Sum of the partial scores know plus the lowest
possible partial score (using the spatial distance)
• Update the upper bound score of the visited
objects
• Return the objects whose lower bond score
cannot be overcame by the remaining objects
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
17
Experimental evaluation
• We compare our approach (S2I) with the DIRtree proposed by Cong et al. [1]
• Both approaches are implemented in Java
• Measures: response time, I/O, update time,
and index size
• Size of tree nodes and blocks: 4KB
[1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant
spatial web objects”, VLDB, 2009.
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
18
Datasets
Twitter1
Total no.
of objects
1M
Twitter2
2M
12.00
25M
Twitter3
3M
12.26
38.6M
Twitter4
4M
12.27
51.6M
Data1
0.1M
131.70
32.6M
Wikipedia
0.4M
163.65
169.4M
Flickr
1.4M
14.49
25.4M
3M
8.76
31.5M
Datasets
OpenStreetMap
www.ntnu.no
Avg. no. of unique Total no. of
terms per object
terms
11.94
12.5M
SSTD 2011 - Minneapolis, Minnesota, USA
19
Variables studied
• Number of results
– 10, 20, 30, 40, 50
• Number of query keywords
– 1, 2, 3, 4, and 5
• Query preference rate (α)
– 0.1, 0.3, 0.5, 0.7, 0.9
• Scalability (twitter dataset)
– 1M, 2M, 3M, 4M
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
20
Number of results (k)
• The response time of S2I is one order of
magnitude better due to less disk accesses
– DIR-tree reads several nodes before finding the top-k
due to text dimensionality
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
21
Number of query keywords
• One order of magnitude better in I/O and
response time
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
22
Insertion time and index size
• S2I does not require updating inverted files (and
vectors), and computing document similarity
• S2I requires more space
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
23
Conclusions
• Top-k spatial keyword queries are intuitive and
have several applications
• We propose a new index
– Terms with different frequency are stored differently
• We propose algorithms to single- and multiplekeyword queries
• The efficiency of our approach is verified through
experiments on synthetic and real datasets
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
24
Thanks!
More information…
João B. Rocha-Junior
joao@idi.ntnu.no
http://www.idi.ntnu.no/~joao
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
25
Scalability
• S2I improvement over DIR-tree increases with
cardinality of the datasets
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
26
Different datasets
• The advantage of S2I over DIR-tree is higher
for datasets with few terms per documents
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
27
Terms removal
• Terms with length=1
• Terms that have no letter character
– ! Character.isLetter(token.charAt(i))
www.ntnu.no
SSTD 2011 - Minneapolis, Minnesota, USA
28
Download