PPTX - CSE - University of New South Wales

advertisement
Efficient Error-tolerant Query
Autocompletion
Chuan Xiao1, Jianbin Qin2, Wei Wang2,
Yoshiharu Ishikawa1, Koji Tsuda3, Kunihiko Sadakane4
1, Nagoya University, Japan
2, University of New South Wales, Australia
3, AIST and JST ERATO, Japan
4, NII, Japan
Presenter:
Jianbin Qin
jqin@cse.unsw.edu.au
Database Group – CSE - UNSW
1
Database Group – CSE - UNSW
2
Database Group – CSE - UNSW
3
Target String set S = {s1, s2, …, sn}.
Edit distance threshold τ .
User query string q
Mobile Phone
q
Return a set of Result strings R contains all
strings s ∈ S, such that ∃s′ ≼ s, ed(s′, q) ≤ τ
Browser
R
q
Edit Distance Prefix Searcher
Index
Target String set S
R
Example:
τ = 1, q = “abc”, S= {“acdefg”, “cda”, … }
Then:
R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ.
Challenges: String set S usually very large.
Query response time is critical.
Core
Database Group – CSE - UNSW
4
Directly index string set S into a trie.
ED = 0
Simulate edit distance calculation when traversing the trie.
q = “abc”
“”
“a”
“ab”
Example: τ = 1 When user types in:
ED = 1
Drawback: Tracking too many nodes during process.
O(|Σ|τ
ED > 1
0
ξ
)
1
7
a
b
S
2
SiD
String
S1
“abcd”
S2
“abdc”
S4
“bcd”
b
c
3
c
5
d
d
4
d
S1
Database Group – CSE - UNSW
8
9
6
c
S2
S4
5
We offer another option to trade space for runtime performance.
Error Tolerant Prefix Searcher
Up to
X1000
Faster
Index
Transform an Edit Prefix Search
Problem into an
Exact Prefix Search Problem
Up to
X20
larger
Build
Deletion Variants Trie
One server can serve up to 1000
times more users simultaneously.
Database Group – CSE - UNSW
6
Deletion Neighborhood Generation.
s = abcd
2-Variants Family of s. V(s,2)
0-Variants
abcd {}
1-Variants
bcd {1}
acd {2}
abd {3}
cd {1,1}
bd {1,2}
bc {1,3}
ad {2,2}
ac {2,3}
ab {3,3}
abc {4}
2-Variants
⟨x, Dx⟩ is called a variant-list pair, Dx is the deletion list.
V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s.
Database Group – CSE - UNSW
7
s = abcd, V(s,2)
abcd {}
cd {1,1}
bcd {1}
bd {1,2}
acd {2}
bc {1,3}
abd {3}
abc {4}
ad {2,2}
ac {2,3}
ab {3,3}
ax {2,3}
ab {3,3}
q = abxd, V(q,2)
abxd {}
xd {1,1}
bxd {1}
bd {1,2}
axd {2}
bx {1,3}
abd {3}
abx {4}
ad {2,2}
Variants Matching Principle:
Given two strings s and t ,ED(s,t) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈ V(t,τ),
such that x = y and |Dx ∪ Dy| ≤ τ.
Two conditions need to satisfy:
1. x = y
Identical Check.
2. |Dx U Dy| ≤ τ Deletion list Union Size Check.
(Efficiently process with index)
(No efficient methods)
Database Group – CSE - UNSW
8
s = abcd, V(s,2)
abcd {}
cd {1,1}
bcd {1}
bd {1,2}
acd {2}
bc {1,3}
abd {3}
abc {4}
ad {2,2}
ac {2,3}
ab {3,3}
ax {2,3}
ab {3,3}
q = abxd, V(q,2)
abxd {}
xd {1,1}
bxd {1}
bd {1,2}
axd {2}
bx {1,3}
abd {3}
abx {4}
ad {2,2}
q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2)
abxd {}
abxd {}
abxd {1}
abxd {2}
…
…
abd {3}
abd {}
abd {1}
abd {2}
abd {3}
ab (3,3)
ab {}
ab {3}
ab {3,3}
……
……
abxd {3,4}
…
…
abxd {4,4}
abd {3,3}
Enumerated Variants Matching Principle:
Given two strings s and q, ED(s,q) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈
EnumV(q,τ) such that x = y and Dx = Dy.
Database Group – CSE - UNSW
9
bd {1,2}
Then we encode <x, Dx> together:
s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …}
#b#d
S
S1: abcd, #bcd, a#cd, ab#d, abc#, …
SiD
String
S2: abdc, #bdc, a#dc, ab#c, abd#, …
S1
“abcd”
S2
“abdc”
S3
“bcd”
a
b
c
#
d
d
#
S1
S1
c
S2
#
#
S2
d
d
S1
S3: bcd, #cd, b#d, bc#, …
b
#
c
c
c
ξ
d
d
S2
#
b
#
d
c
c
S1
c
d
d
S2
S3
Database Group – CSE - UNSW
S3
S3
d
c
S1
S2
S3
10
q = abc
t=2
abc
#bc
a#c
ab#
##c
#b#
a##
abc
abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc,
a#b#c, a#bc#, ab##c, ab#c#, abc##
#bc
bc, #bc, ##bc, #b#c, #bc#
a#c
ac, a#c, #a#c, a##c, a#c#
ab#
ab, ab#, #ab#, a#b#, ab##
##c
c, #c, ##c
#b#
b, #b, b#, #b#
a##
a, a#, a##
Database Group – CSE - UNSW
11
Database Group – CSE - UNSW
12
q = “”
EnumV = {ξ, #}
q = “a”
EnumV = {a, a#, #}
q = “ab”
EnumV = {ab, ab#, a#, #b}
q = “abc”
EnumV = {abc, abc#, ab#c, ab#, a#c, #bc}
ξ
·(|q|+τ)τ)
O(τ
a
b
c
#
d
d
#
S1
S1
c
S2
b
#
#
S2
c
c
c
d
d
S1
#
d
d
S2
#
b
#
d
c
c
S1
c
d
d
S2
S3
Database Group – CSE - UNSW
S3
S3
d
c
S1
S2
S3
13
Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27.
Prefix length is the query length. The time and size are all interval count.
1000 query average.
Edit distance threshold τ = 3,
IncNgTrie: Our algorithms
ICAN and ICPAN: previous direct trie methods.
Database Group – CSE - UNSW
14
DirectTrie:
NoReduction:
StringMerge:
SubtreeMerge:
Original trie.
IncNGTrie before compression.
Merge branches reaching the same string.
Merge subtrees with identical content.
Database Group – CSE - UNSW
15
• An alternative way to solve edit prefix search Problem.
• Our method is independent of character set size.
• Gain up to 1000 times of query performance
improvement.
• Data adaptive enumeration method.
Database Group – CSE - UNSW
16
Database Group – CSE - UNSW
17
Database Group – CSE - UNSW
18
Database Group – CSE - UNSW
19
Core Component is the Prefix Edit
Similarity Search.
• A string Q is t-edit prefix
matching another string S is
that there exist one prefix of S,
that the edit distance with Q is
within t.
User Client
Q
R
• R = {s | s S, s’ P(s) such that
ed(s’, Q) t} , P(s) denotes all the
prefixes of s.
Example:
If t = 1, Q=“abc” t-Edit Prefix Match
“acdefghtijk”, as “ac” is the prefix of
“acdefghtijk” and ed(Q, “ac”) <= 1;
Result Ranker
Fuzzy Prefix Searcher
Index
Target String set
Core
Database Group – CSE - UNSW
20
S
Q=“p”
SiD
Strin
g
S1
abcd
S2
abdc
S4
bade
S5
ξ
bcd
0
1
7
a
b
8
2
a
b
3
c
5
d
4
d
S1
11
c
9
d
d
6
c
S3
e
S2
12
10
S2
Database Group – CSE - UNSW
21
q = “”
EnumV = {ξ1, #1, #2}
q = “a”
EnumV = {a2, a#2, a#3, #2}
q = “ab”
EnumV = {ab3, ab#3, ab#4, a#3, #b3}
q = “abc”
EnumV = {abc4, abc#4, abc#5, ab#c4, ab#4, a#c4, #bc4}
ξ
a
b
c
#
d
d
#
S1
S1
c
S2
b
#
#
S2
c
c
c
d
d
S1
#
d
d
S2
#
b
#
d
c
c
S1
c
d
d
S2
S3
Database Group – CSE - UNSW
S3
S3
d
c
S1
S2
S3
22
Index data strings into a trie (Radix Tree).
Keep active nodes while traversal the tree.
For each query character Q[i] entered, traverse the trie and
incrementally maintain all the nodes n such that ed(n, Q[1..i]) 
t (also called active nodes/states)
Q=Ø
Id
c
String
0
e
1
s1
cab
s2
eat
s3
map
a
Q=“p”
m
4
b
t
7
a
a
2
c
5
e
1
a
8
p
0
4
b
t
7
a
a
2
m
5
8
p
3
6
9
3
6
9
s1
s2
s3
s1
s2
s3
Database Group – CSE - UNSW
23
Embed the second condition into the first condition and efficiently process with Index.
s=“abcd” 0-Variant-list = {<abcd>}
1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#>
2-Variant-list = {<##cd>, <#b#d>, <#bc#>, …
q=“abxd” 0-Variant-list = {<abxd, {}>}
1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> …
2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> …
Database Group – CSE - UNSW
24
Extended from Exact prefix search methods:
Directly indexing strings S into a TRIE.
Find the node that exactly match query q.
Example: User Types:
q = “abc”
“”
“a”
“ab”
0
ξ
1
S
SiD
7
a
b
String
2
S1
“abcd”
S2
“abdc”
S3
“bade”
S4
“bcd”
8
b
a
3
c
5
d
4
d
S1
Database Group – CSE - UNSW
c
9
d
d
6
c
S2
11
e
S3
12
10
S4
25
Simulate Edit distance Calculation During Traversal The TRIE.
Directly indexing strings S into a TRIE.
Example: When t = 1 User Types:
ED = 0
q = “abc”
“”
“a”
“ab”
ED = 1
Draw Back: Tracking too many nodes during process.
ED > 1
0
ξ
1
S
SiD
7
a
b
String
2
S1
“abcd”
S2
“abdc”
S3
“bade”
S4
“bcd”
8
b
a
3
c
5
d
4
d
S1
Database Group – CSE - UNSW
c
9
d
d
6
c
S2
11
e
S3
12
10
S4
26
Extended from Exact prefix search methods:
Directly indexing strings S into a TRIE.
Find the node that exactly match query q.
Example: User Types:
q = “abc”
“”
“a”
“ab”
0
ξ
1
S
SiD
7
a
b
String
2
S1
“abcd”
S2
“abdc”
S3
b
c
3
c
“bcd”
5
d
d
4
d
S1
Database Group – CSE - UNSW
8
9
6
c
S2
S4
27
ξ
b
a
b
#
a
#
#
c
b
c
d
#
c
d
d
#
d
d
d
c
d
S1
#
S1
c
S2
#
c
S1
d
S3
S1
c
e
S3
S2
#
S2
e
e
S2
Database Group – CSE - UNSW
S2
S2
d
S2
S1
S3
28
S
Database Group – CSE - UNSW
29
 K-Matching Variant
Given two i-deletion-marked variants(0ik) x and y, if y contain
the same string content with x, (not count the mark symbol) and the
size of the union of their deletion-position-lists  k, y is called a kmatching variant of x.
cb, c#b, #cb, cb#, c##b, #c#b, c#b#
Problem Transformation
c#b
Database Group – CSE - UNSW
30
 K-Matching Variant
Given two i-deletion-marked variants(0ik) x and y, if y contain
the same string content with x, (not count the mark symbol) and the
size of the union of their deletion-position-lists  k, y is called a kmatching variant of x.
cb, c#b, #cb, cb#, c##b, #c#b, c#b#
Problem Transformation
c#b
Database Group – CSE - UNSW
31
Database Group – CSE - UNSW
32
Download