PPT-part2

Efficient Approximate Search on String Collections Part II Marios Hadjieleftheriou Chen Li Outline       Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 2/68 N-Gram Signatures    Use string signatures that upper bound similarity Use signatures as filtering step Properties:     Signature has to have small size Signature verification must be fast False positives/False negatives Signatures have to be “indexable” 3/68 Known signatures  Minhash   Prefix filter (CGK06)   Hamming, Jaccard, Edit distance LSH (GIM99)   Jaccard, Edit distance PartEnum (AGK06)   Jaccard, Edit distance Jaccard, Edit distance Mismatch filter (XWL08)  Edit distance 4/68 Prefix Filter  Bit vectors: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 q s  Mismatch vector:  s: matches 6, missing 2, extra 2 If |sq|6 then s’s s.t. |s’|3, |s’q| For at least k matches, |s’| = l - k + 1  5/68 Using Prefixes  Take a random permutation of n-gram universe: 6 9 11 14 8 1 2 3 4 5 7 10 12 13 q s  Take prefixes from both sets:  |s’|=|q’|=3, if |sq|6 then s’q’ 6/68 Prefix Filter for Weighted Sets  For example:  Order n-grams by weight (new coordinate space) t1 q t2 t4 w1  w2  …  w14 t6 t8 t11 t14 w1 w2 0 w4 0 s w1 w 2 0 w4 0   Query: w(qs)=Σiqs wi  τ Keep prefix s’ s.t. w(s’)  w(s) - α w(s)-α s’   α s/s’ Best case: w(q/q’  s/s’) = α Hence, we need w(q’s’)  τ - α 7/68 Prefix Filter Properties   The larger we make α, the smaller the prefix The larger we make α, the smaller the range of thresholds we can support:    Because τα, otherwise τ-α is negative. We need to pre-specify minimum τ Can apply to Jaccard, Edit Distance, IDF 8/68 Other Signatures   Minhash (still to come) PartEnum:     LSH:    Upper bounds Hamming Select multiple subsets instead of one prefix Larger signature, but stronger guarantee Probabilistic with guarantees Based on hashing Mismatch filter:  Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance 9/68 Signature Indexing  Straightforward solution:    Create an inverted index on signature n-grams Merge inverted lists to compute signature intersections For a given string q: - Access only lists in q’ Find strings s with w(q’ ∩ s’) ≥ τ - α 10/68 The Inverted Signature Hashtable (CCVX08)   Maintain a signature vector for every n-gram Consider prefix signatures for simplicity:     s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… co-occurence lists: ‘t L’: ‘tt ’  ‘t&t’  … ‘&tt’: ‘t L’  … Hash all n-grams (h: n-gram  [0, m]) Convert co-occurrence lists to bit-vectors of size m 11/68 Example Hash lab at& t&t tL la … Signatures s’1 s’2 s’3 s’4 s’5 … 5 4 5 1 0 at&, la t&t, at& t L, at& abo, t&t t&t, la Hashtable at& t&t lab tL la … 100011 010101 … 12/68 Using the Hashtable?  Let list ‘at&’ correspond to bit-vector 100011   There exists string s s.t. ‘at&’  s’ and s’ also contains some ngrams that hash to 0, 1, or 5 Given query q:  Construct query signature matrix: q’ q at& res … 1 0 … 0 1 … lab t&t at& 1 1 lab 1 1 r p   Consider only solid sub-matrices P: rq’, pq We need to look only at rq’ such that w(r)τ-α and w(p)τ 13/68 Verification  How do we find which strings correspond to a given sub-matrix?   Create an inverted index on string n-grams Examine only lists in r and strings with w(s)τ -  Remember that rq’ Can be used with other signatures as well 14/68 Outline       Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 15/68 Length Normalized Measures  What is normalization?  Normalize similarity scores by the length of the strings. -   Can result in more meaningful matches. Can use L0 (i.e., the length of the string), L1, L2, etc. For example L2: - Let w2(s)  Σtsw(t)2 Weight can be IDF, unary, language model, etc. ||s||2 = w2(s)-1/2 16/68 The L2-Length Filter (HCKS08)  Why L2?   For almost exact matches. Two strings match only if: - - They have very similar n-gram sets, and hence L2 lengths The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths). 17/68 Example    “AT&T Labs – Research”  L2=100 “ATT Labs – Research”  L2=95 “AT&T Labs”  L2=70    If “Research” happened to be very popular and had small weight? “The Dark Knight” “Dark Night”  L2=75  L2=72 18/68 Why L2 (continued)   Tight L2-based length filtering will result in very efficient pruning. L2 yields scores bounded within [0, 1]:    1 means a truly perfect match. Easier to interpret scores. L0 and L1 do not have the same properties - - Scores are bounded only by the largest string length in the database. For L0 an exact match can have score smaller than a non-exact match! 19/68 Example      q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’}  L0=5 s1={‘ATT’}  L0=1 s2=q  L0=5 S(q, s1)=Σw(qs1)/(||q||0 ||s1||0)=10/5 = 2 S(q, s2)=Σw(qs2)/(||q|| ||s2|| )=40/25<2 0 0 20/68 Problems  L2 normalization poses challenges.  For example: - S(q, s) = w2(qs)/(||q||2 ||s||2) Prefix filter cannot be applied. Minimum prefix weight α?   Value depends both on ||s||2 and ||q||2. But ||q||2 is unknown at index construction time 21/68 Important L2 Properties  Length filtering:      For S(q, s) ≥ τ τ ||q||2  ||s||2  ||q||2 / τ We are only looking for strings within these lengths. Proof in paper Monotonicity … 22/68 Monotonicity     Let s={t1, t2, …, tm}. Let pw(s, t)=w(t) / ||s||2 (partial weight of s) Then: S(q, s) = Σ tqs w(t)2 / (||q||2 ||s||2)= Σtqs pw(s, t) pw(q, t) If pw(s, t) > pw(r, t):   w(t)/||s||2 > w(t)/||r||2  ||s||2 < ||r||2 Hence, for any t’  t:  w(t’)/||s||2 > w(t’)/||r||2  pw(s, t’) > pw(r, t’) 23/68 Indexing  Use inverted lists sorted by pw(): id 0 1 2 3 4 strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 0 2 3 0 0 4 4 4 3 3 1 4 1 2 3 1 2 1 2 • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic)  ||0||2 < ||4||2 < ||1||2 < ||2||2 24/68 L2 Length Filter  Given q and τ, and using length filtering: at ch ck ic ri st ta ti tu uc 4 0 3 0 0 4 4 4 3 3 2 1 4 1 2 3 1 2 1 2 • We examine only a small fraction of the lists 25/68 Monotonicity  If I have seen 1 already, then 4 is not in the list: at ch ck ic ri st ta ti tu uc 4 0 2 3 0 0 4 4 4 3 3 1 4 1 2 3 1 2 1 2 26/68 Other Improvements  Use properties of weighting scheme    Scan high weight lists first Prune according to string length and maximum potential score Ignore low weight lists altogether 27/68 Conclusion  Concepts can be extended easily for:      BM25 Weighted Jaccard DICE IDF Take away message:   Properties of similarity/distance function can play big role in designing very fast indexes. L2 super fast for almost exact matches 28/68 Outline       Motivation and preliminaries Inverted list based algorithms Gram signature algorithms Length-normalized measures Selectivity estimation Conclusion and future directions 29/68 The Problem  Estimate the number of strings with:     Edit distance smaller than k from query q Cosine similarity higher than τ to query q Jaccard, Hamming, etc… Issues:    Estimation accuracy Size of estimator Cost of estimation 30/68 Motivation  Query optimization:    Selectivity of query predicates Need to support selectivity of approximate string predicates Visualization/Querying:   Expected result set size helps with visualization Result set size important for remote query processing 31/68 Flavors  Edit distance:     Based on clustering (JL05) Based on min-hash (MBKS07) Based on wild-card n-grams (LNS07) Cosine similarity:  Based on sampling (HYKS08) 32/68 Selectivity Estimation for Edit Distance  Problem:    Given query string q Estimate number of strings s  D Such that ed(q, s)  δ 33/68 Sepia - Clustering (JL05, JLV08)  Partition strings using clustering:   Store per cluster histograms:   Enables pruning of whole clusters Number of strings within edit distance 0,1,…,δ from the cluster center Compute global dataset statistics:  Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query 34/68 Edit Vectors  Edit distance is not discriminative:  Use Edit Vectors <1,1,1> 3 Lukas <2,0,0> 2 Lucia q  pi <1,1,0> 2 Luciano Ci Lucas 3D space vs 1D space 35/68 Visually C1 F1 p1 C2 ... p2 Cn pn Edit Vector # Edit Vector # Edit Vector # <0, 0, 0> 4 <0, 0, 0> 3 <0, 0, 0> 2 <0, 0, 1> 12 <0, 1, 0> 40 <1, 0, 2> 84 <1, 0, 2> 7 <1, 0, 1> 6 <1, 1, 1> 1 F2 … … … v(q,pi) Global Table Fn v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 <1, 0, 1> <0, 0, 1> 2 4 57 <1, 0, 1> <0, 0, 1> 3 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … 36/68 Selectivity Estimation  Use triangle inequality:   Compute edit vector v(q,pi) for all clusters i If |v(q,pi)|  ri+δ disregard cluster Ci δ ri q pi 37/68 Selectivity Estimation  Use triangle inequality:    Compute edit vector v(q,pi) for all clusters i If |v(q,pi)|  ri+δ disregard cluster Ci For all entries in frequency table: - If |v(q,pi)| + |v(pi,s)|  δ then ed(q,s)  δ for all s If ||v(q,pi)| - |v(pi,s)||  δ ignore these strings Else use global table:   Lookup entry <v(q,pi), v(pi,s), δ> in global table Use the estimated fraction of strings 38/68 Example F1 Edit Vector # <0, 0, 0> 4 <0, 0, 1> 12 <1, 0, 2> 7   δ =3 v(q,p1) = <1,1,0> v(p1,s) = <1,0,2> … Global lookup: [<1,1,0>,<1,0,2>, 3]  Fraction is 25% x 7 = 1.75  Iterate through F1, and add up contributions  Global Table v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 <1, 0, 1> <0, 0, 1> 2 4 57 <1, 0, 1> <0, 0, 1> 3 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … 39/68 Cons   Hard to maintain if clusters start drifting Hard to find good number of clusters   Space/Time tradeoffs Needs training to construct good dataset statistics table 40/68 VSol – minhash (MBKS07)   Solution based on minhash minhash is used for:   Estimate the size of a set |s| Estimate resemblance of two sets -   I.e., estimating the size of J=|s1s2| / |s1s2| Estimate the size of the union |s1s2| Hence, estimating the size of the intersection - |s1s2| J~(s1, s2)  ~(s1, s2) 41/68 Minhash   Given a set s = {t1, …, tm} Use independent hash functions h1, …, hk:      hi: n-gram  [0, 1] Hash elements of s, k times Keep the k elements that hashed to the smallest value each time We reduced set s, from m to k elements Denote minhash signature with s’ 42/68 How to use minhash  Given two signatures q’, s’:  J(q, s)  Σ1ik I{q’[i]=s’[i]} / k    |s|  ( k / Σ1ik s’[i] ) - 1 (qs)’ = q’  s’ = min1ik(q’[i], s’[i]) Hence: - |qs|  (k / Σ1ik (qs)’[i]) - 1 43/68 VSol Estimator  Construct one inverted list per n-gram in D   The lists are our sets Compute a minhash signature for each list t1 Inverted list 1 5 … t2 3 5 … … t10 1 8 … 14 25 43 Minhash 44/68 Selectivity Estimation  Use edit distance length filter:   If ed(q, s)  δ, then q and s share at least L = |s| - 1 - n (δ-1) n-grams Given query q = {t1, …, tm}:   Answer is the size of the union of all non-empty Lintersections (binomial coefficient: m choose L) We can estimate sizes of L-intersections using minhash signatures 45/68 Example  δ = 2, n = 3  L = 6 q= Inverted list t1 1 5 … t2 3 5 … … t10 1 8 … 14 25 43    Look at all 6-intersections of inverted lists Α = |ι1, ..., ι6  [1,10] (ti1  ti2  …  ti6)| There are (10 choose 6) such terms 46/68 The m-L Similarity Can be done efficiently using minhashes  Answer:     ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } A  ρ  |t1… tm| Proof very similar to the proof for minhashes 47/68 Cons  Will overestimate results   Many L-intersections will share strings Edit distance length filter is loose 48/68 OptEQ – wild-card n-grams (LNS07)  Use extended n-grams:   Introduce wild-card symbol ‘?’ E.g., “ab?” can be: -  “aba”, “abb”, “abc”, … Build an extended n-gram table:    Extract all 1-grams, 2-grams, …, n-grams Generalize to extended 2-grams, …, n-grams Maintain an extended n-grams/frequency hashtable 49/68 Example n-gram table n-gram Dataset string abc def ghi … Frequency ab bc de ef gh hi … ?b a? ?c … 10 15 4 1 21 2 … 13 17 23 … abc def … 5 2 … 50/68 Query Expansion (Replacements only)    Given query q=“abcd” δ=2 And replacements only:  Base strings: -  “??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??” Query answer: - S1={sD: s  ”??cd”}, S2=… A = |S1  S2  S3  S4  S5  S6| = Σ1n6 (-1)n-1 |S1  …  Sn| 51/68 Replacement Intersection Lattice A = Σ1n6 (-1)n-1 |S1  …  Sn|     Need to evaluate size of all 2-intersections, 3intersections, …, 6-intersections Then, use n-gram table to compute sum A Exponential number of intersections But ... there is well-defined structure 52/68 Replacement Lattice  Build replacement lattice: ??cd ?b?d ?bcd ?bc? a??d a?cd ab?d a?c? ab?? abc? abcd   2 ‘?’ 1 ‘?’ 0 ‘?’ Many intersections are empty Others produce the same results  we need to count everything only once 53/68 General Formulas  Similar reasoning for:    Other combinations difficult:    r replacements d deletions Multiple insertions Combinations of insertions/replacements But … we can generate the corresponding lattice algorithmically!  Expensive but possible 54/68 BasicEQ  Partition strings by length:   Query q with length l Possible matching strings with lengths: -  [l-δ, l+δ] For k = l-δ to l+δ - Find all combinations of i+d+r = δ and l+i-d=k If (i,d,r) is a special case use formula Else generate lattice incrementally:   Start from query base strings (easy to generate) Begin with 2-intersections and build from there 55/68 OptEq  Details are cumbersome   Left for homework Various optimizations possible to reduce complexity 56/68 Cons    Fairly complicated implementation Expensive Works for small edit distance only 57/68 Hashed Sampling (HYKS08)   Used to estimate selectivity of TF/IDF, BM25, DICE (vector space model) Main idea:   Take a sample of the inverted index But do it intelligently to improve variance 58/68 Example  Take a sample of the inverted index at ch ck ic ri st ta ti tu uc 4 0 3 0 0 4 4 4 3 3 2 1 4 1 2 3 1 2 1 2 59/68 Example (Cont.)  But do it intelligently to improve variance at ch ck ic ri st ta ti tu uc 4 0 2 3 0 0 4 4 4 3 3 1 4 1 2 3 1 2 1 2 60/68 Construction  Draw samples deterministically:    Use a hash function h: N  [0, 100] Keep ids that hash to values smaller than σ Invariant:  If a given id is sampled in one list, it will always be sampled in all other lists that contain it: - - S(q, s) can be computed directly from the sample No need to store complete sets in the sample No need for extra I/O to compute scores 61/68 Selectivity Estimation   The union of arbitrary list samples is an σ% sample Given query q = {t1, …, tm}:  A = |Aσ| |t1  …  tm| / |tσ1  …  tσm|: -  Aσ is the query answer size from the sample The fraction is the actual scale-up factor But there are duplicates in these unions! We need to know: - The distinct number of ids in t1  …  tm The distinct number of ids in tσ1  …  tσm 62/68 Count Distinct  Distinct |tσ1  …  tσm| is easy:   Scan the sampled lists Distinct |t1  …  tm| is hard:   Scanning the lists is the same as computing the exact answer to the query … naively We are lucky: - - Each list sample doubles up as a k-minimum value estimator by construction! We can use the list samples to estimate the distinct |t1  …  tm| 63/68 The k-Minimum Value Synopsis  It is used to estimated the distinct size of arbitrary set unions (the same as FM sketch):    Take hash function h: N  [0, 100] Hash each element of the set The r-th smallest hash value is an unbiased estimator of count distinct: r 0 hr hr r 100 ? 100 64/68 Outline       Motivation and preliminaries Inverted list based algorithms Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 65/68 Future Directions  Result ranking    Diversity of query results    In practice need to run multiple types of searches Need to identify the “best” results Some queries have multiple meanings E.g., “Jaguar” Updates  Incremental maintenance 66/68 References            [AGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu, ICDE 2009 [HCK+08] Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava: Fast Indexes and Algorithms for Set Similarity Selection Queries. ICDE 2008 [HYK+08] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Liang Jin, and Chen Li. VLDB 2005. [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008. [LNS07] Hongrae Lee, Raymond T. Ng, Kyuseok Shim: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava: Estimating the selectivity of approximate string queries. ACM TODS 2007 [XWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008 67/68 References         [XWL+08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW 2008. [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently, Xiaochun Yang, Bin Wang, and Chen Li, SIGMOD 2008 [JLV08]L. Jin, C. Li, R. Vernica: SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases, VLDBJ08 [CGK06] S. Chaudhuri, V. Ganti, R. Kaushik : A Primitive Operator for Similarity Joins in Data Cleaning, ICDE06 [CCGX08]K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin: An Efficient Filter for Approximate Membership Checking, SIGMOD08 [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754 [BK02] Jérémy Barbay, Claire Kenyon: Adaptive intersection and t-threshold problems. SODA 2002: 390-399 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918920 68/68

PPT-part2

Related documents

Products

Support

PPT-part2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib