PPT - Jiaheng Lu

advertisement
Presentation in Aalborg University
Reverse Spatial and Textual k
Nearest Neighbor Search
Jiaheng Lu
Renmin University of China
August 11 2011
Research experience
 Associate Professor: Renmin University of
China
 XML data management, Spatial data
management, Cloud data management
 Post-doc: University of California, Irvine
 Data integration, Approximate string match
 PhD National University of Singapore
 XML data management
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Approximate string matching
 Reverse Spatial and Textual k Nearest Neighbor
Search
XML twig query processing
 XPath: Section[Title]/Paragraph//Figure
 Twig pattern
Section
Title
Paragraph
Figure
XML twig query processing (Cont.)
 Problem Statement
Given a query twig pattern Q, and an XML database D, we
need to compute ALL the answers to Q in D.
 E.g. Consider Query and Document:
Document:
t1
t2
s1
Query: Section
Query solutions:
title
(s1, t1, f1)
(s2, t2, f1)
(s1, t2, f1)
s2
p1
f1
figure
XML twig query processing (Cont.)
 Several efficient pattern matching
algorithms




TJFast (VLDB 05)
iTwigJoin (SIGMOD 05)
TwigStackList (CIKM 04)
TreeMatch (TKDE 10)
 Current works: distributed XML twig
pattern processing
XML twig query processing










Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with
parent child edges: a look-ahead approach. CIKM 2004:533-542
Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To
Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB
2005:193-204
Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb
2004:180-189
Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig
pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119
Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of
Ordered XML Twig Pattern. DEXA 2005:300-309
Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing
- (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178
Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm
for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263
Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic
Approach to Optimize XML Query Processing. DASFAA 2008:282-298
Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern
Matching using Structural Indexing Techniques. SIGMOD 2005:455-466
……
XML keyword search
课题背景: XQuery vs. 关键字查询
XQuery:
for $a in doc(“bib.xml”)//author
$n in $a/name
where $n=”Mike”
return $a//inproceedings
Complicated
Query papers by
“Mike”
Keyword search: 
Mike,inproceedings
XML keyword search
 The proposed keyword search returns the set of
smallest trees containing all keywords.
bib
Keywords:
Mike hobby
article
Paper
author
author
2009
name
Mike
ward
publications
inproceedings
title
year
articles
title
Base line of 2002 Information
Retrival
XML key
hobby
name
publications
Paper
John
folding Hopking inproceedings
year
2002
title
Data
Mining
year
2007
article
title
Keyword
Search
in XML
hobby
Read
book
year
2009
Effectiveness
Capture user’s search intention
Identify the target that users intend to search for
Infer the predicate constraint that user intends to search via
Result ranking
Rank the query results according to their objective
relevance to user search intention
XML keyword search








Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML
keyword searching. CIKM 2010:1933-1934
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An
Effective Object-Level XML Keyword Search. DASFAA 2010:93-109
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective
XML Keyword Search. TKDE, 22(8):1077-1092 (2010)
Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective
Ranked XML Keyword Search with Meaningful Result Display. DASFAA
2009:750-754
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML
Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for
Effective Keyword Search in XML Documents. DASFAA 2008:529-537
Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in
XML Documents Based on MIU. DASFAA 2006:702-716
……
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Approximate string matching
 Reverse Spatial and Textual k Nearest Neighbor
Search
Motivation: Data Cleaning
Should clearly be “Niels Bohr”

Real-world data is dirty

Typos

Inconsistent representations

(PO Box vs. P.O. Box)

Approximately check against
clean dictionary
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Motivation: Record Linkage
We want to link records belonging to the same entity
Phone
…
…
…
…
…
Age
…
…
…
…
…
Name
Brad Pitt
Arnold Schwarzeneger
George Bush
Angelina Jolie
Forrest Whittaker
No exact
match!
Name
Brad Pitt
Forest Whittacker
George Bush
Angelina Jolie
Arnold Schwarzenegger
The same entity may have similar representations
Arnold Schwarzeneger
Arnold Schwarzenegger
versus
Forrest Whittaker
Forest Whittacker
versus
Hobbies
…
…
…
…
…
Address
…
…
…
…
…
Motivation: Query Relaxation
Actual
queries
gathered by
Google
http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful
results closer together
What is Approximate String Search?
Queries against collection:
Find all entries similar to “Forrest Whitaker”
Find all entries similar to “Arnold Schwarzenegger”
Find all entries similar to “Brittany Spears”
What do we mean by similar to?
- Edit Distance
- Jaccard Similarity
- Cosine Similaity
- Dice
- Etc.
String Collection: (People)
Brad Pitt
Forest Whittacker
George Bush
Angelina Jolie
Arnold Schwarzeneger
…
…
…
The similar to predicate can help our described applications!
How can we support these types of queries efficiently?
Approximate Query Answering
Main Idea: Use q-grams as signatures for a string
irvine
Sliding Window
2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Inverted index on grams supports finding all data strings sharing enough
grams with a query
Approximate Query Example
Query: “irvine”, Edit Distance 1
2-grams {ir, rv, vi, in, ne}
Lookup Grams
2-grams
Inverted
Lists
(stringIDs)
…
in
tf
vi
ir
ef
rv
1
3
4
5
7
9
5
9
1
5
1
2
3
9
3
9
7
9
Each edit operations can “destroy” at most q grams
Answers must share at least T = 5 – 1 * 2 = 3 grams
ne
un
…
1
2
4
5
6
Candidates = {1, 5, 9}
May have false positives
Need to compute real
similarity
5
6
9
T-Occurrence problem: Find elements occurring at least T=3 times among
inverted lists. This is called list-merging. T is called merging-threshold.
Approximate string matching
 Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for
approximate member extraction using signature-based inverted lists.
CIKM 2009:315-324
 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: SpaceConstrained Gram-Based Indexing for Efficient Approximate String
Search. ICDE 2009:604-615
 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering
Algorithms for Approximate String Searches. ICDE 2008:257-266
 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu,
Xiaoyong Du: Efficient Algorithm for Computing Link-Based
Similarity in Real World Networks. ICDM 2009:734-739
 ……
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Approximate string matching
 Reverse Spatial and Textual k Nearest Neighbor
Search (SIGMOD 2011)
Motivation
 If add a new shop at
Q, which shops will be
influenced?
 Influence facts
clothes
food
 Spatial Distance
Results: D, F
 Textual Similarity
Services/Products.
..
Results: F, C
clothes
clothes
sports
food
clothes
2
Problems of finding Influential Sets
Traditional query
Reverse k nearest neighbor query (RkNN)
Our new query
Reverse spatial and textual k nearest
neighbor query (RSTkNN)
3
Problem Statement
Spatial-Textual Similarity
• describe the similarity between such objects based
on both spatial proximity and textual similarity.
Spatial-Textual Similarity Function
4
Problem Statement (con’t)
 RSTkNN query
 finding objects which have the query
object as one of their k spatial-textual
similar objects.
5
Related Work
• Pre-computing the kNN for each object
(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy
(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method
(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)
(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
7
Intersection and Union R-tree (IUR-tree)
N1
p1
p1
p2
p3
p4
p5
q
p4
N3
p5
p2
x
or
d2
w
1
1
5
8
1
8
or
N4
[4,4]
[4,4]
1
w
1
1
5
8
1
8
d1
d1
p3
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
or
w
N2
vectors
[0,0]
[1,1]
2
[3,1.5]
[3.5,2.5]
2
IntUniVct1
or
N4
y
4
1
0
2.5
1.5
2.5
w
q(0.5, 2.5)
x
4
0
1
3
3.5
0.5
d2
y
IntVct1
1
1
IntVct2
1
1
IntUniVct3
IntVct3
1
1
[3.5, 1.5]
[3.5, 1.5]
UniVct1
1
1
UniVct2
UniVct3
5
8
5
8
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
p1
p2
p3
p4
p5
w
d2
[1, 0]
[1, 0]
N3
[3, 2.5]
[3, 2.5]
w
N2
[0, 1]
[0, 1]
or
N1
[4,4]
[4,4]
or
d1
IntUniVct2
10
Overview of Search Algorithm
 RSTkNN Algorithm:
 Travel from the IUR-tree root
 Progressively update lower and upper bounds
 Apply search strategy:

prune unrelated entries in Pruned;

report entries to be results Ans;

add candidate objects to Cnd.
 FinalVerification
 For objects in Cnd, check whether results or not by
updating the bounds for candidates using expanding
entries in Pruned.
14
Example: Execution of the RSTkNN Algorithm on IUR-tree,
given k=2, alpha=0.6
y
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
N3
p3
p4
p5
Initialize N4.CLs;
EnQueue(U, N4);
U
N4, (0, 0)
15
Example: Execution of the RSTkNN Algorithm on IUR-tree,
given k=2, alpha=0.6
y
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
Pruned
U
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
d
or
1
o
w
1
1
5
8
1
8
2
rd
N1
N2
p1
p2
DeQueue(U, N4)
EnQueue(U, N2)
EnQueue(U, N3)
Pruned.add(N1)
N3
p3
p4
p5
Mutual-effect
N1
N2
N1
N3
N2
N3
N1(0.37, 0.432)
N4(0, 0.619
0)
N3(0.323,
)
N2(0.21, 0.619 )
16
Example: Execution of the RSTkNN Algorithm on IUR-tree,
given k=2, alpha=0.6
y
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
Pruned
U
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
p3
DeQueue(U, N3)
Answer.add(p4)
Candidate.add(p5)
Answer
N1(0.37, 0.432)
N3(0.323, 0.619 )
N3
N2(0.21, 0.619 )
Candidate
p5
p4
p4
Mutual-effect
p5
N2
p4,N2
p4(0.21, 0.619 )
p5(0.374, 0.374)
17
Example: Execution of the RSTkNN Algorithm on IUR-tree,
given k=2, alpha=0.6
y
N1
p1
N4
p4
q(0.5, 2.5)
N4
N3
p5
p2
N2
p1
p2
p3
p4
p5
q
x
4
0
1
3
3.5
0.5
x
p3
y
4
1
0
2.5
1.5
2.5
vectors
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
w
1
1
5
8
1
8
or
d1
or
w
1
1
5
8
1
8
d2
N1
N2
p1
p2
N3
p3
DeQueue(U, N2)
Answer.add(p2, p3)
Pruned.add(p5)
So far since U=Cand=empty,
algorithm ends.
Results: p2, p3, p4.
Pruned
U
N1(0.37, 0.432)
N2(0.21, 0.619 )
p5
p4
Mutual-effect
p2
p4,p5
p3
p2,p4,p5
Answer
Candidate
p4
p2
p3
p5(0.374, 0.374)
18
Cluster IUR-tree: CIUR-tree
IUR-tree: Texts in an index node could be very different.
CIUR-tree: An enhanced IUR-tree by incorporating textual clusters.
p1
p1
p2
p3
p4
p5
q
p4
p5
p2
C1:1
IntUniVct1
x
[0,0]
[1,1]
2
[3,1.5]
[3.5,2.5]
2
IntVct1
1
1
IntVct2
1
1
IntVct3
1
1
d1
C2:2
IntUniVct2
C1:1, C3:1
IntUniVct3
or
or
N4
[4,4]
[4,4]
1
d2
o r cluster
w
C1
1
C2
5
C2
5
C3
8
C1
1
8
d1
p3
w
1
5
5
8
1
8
d1
w
N2
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
ObjVctQ
or
N2
[0, 1]
[0, 1]
[1, 0]
[1, 0]
N3
[3, 2.5]
[3, 2.5]
[3.5, 1.5]
[3.5, 1.5]
ObjVct1
ObjVct2
ObjVct3
ObjVct4
ObjVct5
p1
p2
p3
p4
p5
w
or
N1
[4,4]
[4,4]
or
N3
vectors
w
N4
y
4
1
0
2.5
1.5
2.5
w
q(0.5, 2.5)
x
4
0
1
3
3.5
0.5
d2
N1
d2
y
UniVct1
1
1
UniVct2
UniVct3
5
8
5
8
19
Optimizations
 Motivation
 To give a tighter bound during CIUR-tree traversal
 To purify the textual description in the index node
 Outlier Detection and Extraction (ODE-CIUR)
 Extract subtrees with outlier clusters
 Take the outliers into special account and calculate their
bounds separately.
 Text-entropy based optimization (TE-CIUR)
 Define TextEntropy to depict the distribution of text
clusters in an entry of CIUR-tree
 Travel first for the entries with higher TextEntropy, i.e.
more diverse in texts.
20
Experimental Study

Experimental Setup



Memory: 4GB
baseline,
IUR-tree,
ODE-CIUR,
TE-CIUR, and
ODE-TE.
Datasets




CPU: 2.0GHz;
Language: C/C++.
Compared Methods


OS: Windows XP;
Page size: 4KB;
ShopBranches(Shop), extended from a small real data
GeographicNames(GN), real data
CaliforniaDBpedia(CD), generated combining location in California and documents
from DBpedia.
Metric


Total query time
Page access number
Statistics
Shop
CD
GN
Total # of objects
304,008
1,555,209
1,868,821
Total unique words in dataset
3933
21,578
222,409
Average # words per object
45
47
4
21
Scalability
0.2K
(1) Log-scale version
3K
40K
550K
4M
(2) Linear-scale version
22
Effect of k
Query time
23
Conclusion
 Propose a new query problem
RSTkNN.
 Present a hybrid index IUR-Tree.
 Show the enhanced variant CIURTree and two optimizations ODE-CIUR
and TE-CIUR to further improve
search processing.
24
Current and future works
 Distributed XML query processing
 Cloud-based data management
Thank you
Q&A
Download