Efficient Error-tolerant Query Autocompletion Chuan Xiao1, Jianbin Qin2, Wei Wang2, Yoshiharu Ishikawa1, Koji Tsuda3, Kunihiko Sadakane4 1, Nagoya University, Japan 2, University of New South Wales, Australia 3, AIST and JST ERATO, Japan 4, NII, Japan Presenter: Jianbin Qin jqin@cse.unsw.edu.au Database Group – CSE - UNSW 1 Database Group – CSE - UNSW 2 Database Group – CSE - UNSW 3 Target String set S = {s1, s2, …, sn}. Edit distance threshold τ . User query string q Mobile Phone q Return a set of Result strings R contains all strings s ∈ S, such that ∃s′ ≼ s, ed(s′, q) ≤ τ Browser R q Edit Distance Prefix Searcher Index Target String set S R Example: τ = 1, q = “abc”, S= {“acdefg”, “cda”, … } Then: R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ. Challenges: String set S usually very large. Query response time is critical. Core Database Group – CSE - UNSW 4 Directly index string set S into a trie. ED = 0 Simulate edit distance calculation when traversing the trie. q = “abc” “” “a” “ab” Example: τ = 1 When user types in: ED = 1 Drawback: Tracking too many nodes during process. O(|Σ|τ ED > 1 0 ξ ) 1 7 a b S 2 SiD String S1 “abcd” S2 “abdc” S4 “bcd” b c 3 c 5 d d 4 d S1 Database Group – CSE - UNSW 8 9 6 c S2 S4 5 We offer another option to trade space for runtime performance. Error Tolerant Prefix Searcher Up to X1000 Faster Index Transform an Edit Prefix Search Problem into an Exact Prefix Search Problem Up to X20 larger Build Deletion Variants Trie One server can serve up to 1000 times more users simultaneously. Database Group – CSE - UNSW 6 Deletion Neighborhood Generation. s = abcd 2-Variants Family of s. V(s,2) 0-Variants abcd {} 1-Variants bcd {1} acd {2} abd {3} cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3} abc {4} 2-Variants ⟨x, Dx⟩ is called a variant-list pair, Dx is the deletion list. V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s. Database Group – CSE - UNSW 7 s = abcd, V(s,2) abcd {} cd {1,1} bcd {1} bd {1,2} acd {2} bc {1,3} abd {3} abc {4} ad {2,2} ac {2,3} ab {3,3} ax {2,3} ab {3,3} q = abxd, V(q,2) abxd {} xd {1,1} bxd {1} bd {1,2} axd {2} bx {1,3} abd {3} abx {4} ad {2,2} Variants Matching Principle: Given two strings s and t ,ED(s,t) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈ V(t,τ), such that x = y and |Dx ∪ Dy| ≤ τ. Two conditions need to satisfy: 1. x = y Identical Check. 2. |Dx U Dy| ≤ τ Deletion list Union Size Check. (Efficiently process with index) (No efficient methods) Database Group – CSE - UNSW 8 s = abcd, V(s,2) abcd {} cd {1,1} bcd {1} bd {1,2} acd {2} bc {1,3} abd {3} abc {4} ad {2,2} ac {2,3} ab {3,3} ax {2,3} ab {3,3} q = abxd, V(q,2) abxd {} xd {1,1} bxd {1} bd {1,2} axd {2} bx {1,3} abd {3} abx {4} ad {2,2} q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2) abxd {} abxd {} abxd {1} abxd {2} … … abd {3} abd {} abd {1} abd {2} abd {3} ab (3,3) ab {} ab {3} ab {3,3} …… …… abxd {3,4} … … abxd {4,4} abd {3,3} Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈ EnumV(q,τ) such that x = y and Dx = Dy. Database Group – CSE - UNSW 9 bd {1,2} Then we encode <x, Dx> together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …} #b#d S S1: abcd, #bcd, a#cd, ab#d, abc#, … SiD String S2: abdc, #bdc, a#dc, ab#c, abd#, … S1 “abcd” S2 “abdc” S3 “bcd” a b c # d d # S1 S1 c S2 # # S2 d d S1 S3: bcd, #cd, b#d, bc#, … b # c c c ξ d d S2 # b # d c c S1 c d d S2 S3 Database Group – CSE - UNSW S3 S3 d c S1 S2 S3 10 q = abc t=2 abc #bc a#c ab# ##c #b# a## abc abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc## #bc bc, #bc, ##bc, #b#c, #bc# a#c ac, a#c, #a#c, a##c, a#c# ab# ab, ab#, #ab#, a#b#, ab## ##c c, #c, ##c #b# b, #b, b#, #b# a## a, a#, a## Database Group – CSE - UNSW 11 Database Group – CSE - UNSW 12 q = “” EnumV = {ξ, #} q = “a” EnumV = {a, a#, #} q = “ab” EnumV = {ab, ab#, a#, #b} q = “abc” EnumV = {abc, abc#, ab#c, ab#, a#c, #bc} ξ ·(|q|+τ)τ) O(τ a b c # d d # S1 S1 c S2 b # # S2 c c c d d S1 # d d S2 # b # d c c S1 c d d S2 S3 Database Group – CSE - UNSW S3 S3 d c S1 S2 S3 13 Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count. 1000 query average. Edit distance threshold τ = 3, IncNgTrie: Our algorithms ICAN and ICPAN: previous direct trie methods. Database Group – CSE - UNSW 14 DirectTrie: NoReduction: StringMerge: SubtreeMerge: Original trie. IncNGTrie before compression. Merge branches reaching the same string. Merge subtrees with identical content. Database Group – CSE - UNSW 15 • An alternative way to solve edit prefix search Problem. • Our method is independent of character set size. • Gain up to 1000 times of query performance improvement. • Data adaptive enumeration method. Database Group – CSE - UNSW 16 Database Group – CSE - UNSW 17 Database Group – CSE - UNSW 18 Database Group – CSE - UNSW 19 Core Component is the Prefix Edit Similarity Search. • A string Q is t-edit prefix matching another string S is that there exist one prefix of S, that the edit distance with Q is within t. User Client Q R • R = {s | s S, s’ P(s) such that ed(s’, Q) t} , P(s) denotes all the prefixes of s. Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1; Result Ranker Fuzzy Prefix Searcher Index Target String set Core Database Group – CSE - UNSW 20 S Q=“p” SiD Strin g S1 abcd S2 abdc S4 bade S5 ξ bcd 0 1 7 a b 8 2 a b 3 c 5 d 4 d S1 11 c 9 d d 6 c S3 e S2 12 10 S2 Database Group – CSE - UNSW 21 q = “” EnumV = {ξ1, #1, #2} q = “a” EnumV = {a2, a#2, a#3, #2} q = “ab” EnumV = {ab3, ab#3, ab#4, a#3, #b3} q = “abc” EnumV = {abc4, abc#4, abc#5, ab#c4, ab#4, a#c4, #bc4} ξ a b c # d d # S1 S1 c S2 b # # S2 c c c d d S1 # d d S2 # b # d c c S1 c d d S2 S3 Database Group – CSE - UNSW S3 S3 d c S1 S2 S3 22 Index data strings into a trie (Radix Tree). Keep active nodes while traversal the tree. For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i]) t (also called active nodes/states) Q=Ø Id c String 0 e 1 s1 cab s2 eat s3 map a Q=“p” m 4 b t 7 a a 2 c 5 e 1 a 8 p 0 4 b t 7 a a 2 m 5 8 p 3 6 9 3 6 9 s1 s2 s3 s1 s2 s3 Database Group – CSE - UNSW 23 Embed the second condition into the first condition and efficiently process with Index. s=“abcd” 0-Variant-list = {<abcd>} 1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#> 2-Variant-list = {<##cd>, <#b#d>, <#bc#>, … q=“abxd” 0-Variant-list = {<abxd, {}>} 1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> … 2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> … Database Group – CSE - UNSW 24 Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “abc” “” “a” “ab” 0 ξ 1 S SiD 7 a b String 2 S1 “abcd” S2 “abdc” S3 “bade” S4 “bcd” 8 b a 3 c 5 d 4 d S1 Database Group – CSE - UNSW c 9 d d 6 c S2 11 e S3 12 10 S4 25 Simulate Edit distance Calculation During Traversal The TRIE. Directly indexing strings S into a TRIE. Example: When t = 1 User Types: ED = 0 q = “abc” “” “a” “ab” ED = 1 Draw Back: Tracking too many nodes during process. ED > 1 0 ξ 1 S SiD 7 a b String 2 S1 “abcd” S2 “abdc” S3 “bade” S4 “bcd” 8 b a 3 c 5 d 4 d S1 Database Group – CSE - UNSW c 9 d d 6 c S2 11 e S3 12 10 S4 26 Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “abc” “” “a” “ab” 0 ξ 1 S SiD 7 a b String 2 S1 “abcd” S2 “abdc” S3 b c 3 c “bcd” 5 d d 4 d S1 Database Group – CSE - UNSW 8 9 6 c S2 S4 27 ξ b a b # a # # c b c d # c d d # d d d c d S1 # S1 c S2 # c S1 d S3 S1 c e S3 S2 # S2 e e S2 Database Group – CSE - UNSW S2 S2 d S2 S1 S3 28 S Database Group – CSE - UNSW 29 K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contain the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a kmatching variant of x. cb, c#b, #cb, cb#, c##b, #c#b, c#b# Problem Transformation c#b Database Group – CSE - UNSW 30 K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contain the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a kmatching variant of x. cb, c#b, #cb, cb#, c##b, #c#b, c#b# Problem Transformation c#b Database Group – CSE - UNSW 31 Database Group – CSE - UNSW 32