PPT-part2

advertisement
Efficient Approximate Search on String Collections
Part II
Marios Hadjieleftheriou
Chen Li
Outline






Motivation and preliminaries
Inverted list based algorithms
Gram Signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions
2/68
N-Gram Signatures



Use string signatures that upper bound
similarity
Use signatures as filtering step
Properties:




Signature has to have small size
Signature verification must be fast
False positives/False negatives
Signatures have to be “indexable”
3/68
Known signatures

Minhash


Prefix filter (CGK06)


Hamming, Jaccard, Edit distance
LSH (GIM99)


Jaccard, Edit distance
PartEnum (AGK06)


Jaccard, Edit distance
Jaccard, Edit distance
Mismatch filter (XWL08)

Edit distance
4/68
Prefix Filter

Bit vectors:
1
2
3
4
5
6
7
8
9
10 11
12 13 14
q
s

Mismatch vector:

s: matches 6, missing 2, extra 2
If |sq|6 then s’s s.t. |s’|3, |s’q|
For at least k matches, |s’| = l - k + 1

5/68
Using Prefixes

Take a random permutation of n-gram
universe:
6
9
11 14
8
1
2
3
4
5
7 10 12 13
q
s

Take prefixes from both sets:

|s’|=|q’|=3, if |sq|6 then s’q’
6/68
Prefix Filter for Weighted Sets

For example:

Order n-grams by weight (new coordinate space)
t1
q
t2
t4
w1  w2  …  w14
t6
t8
t11
t14
w1 w2 0 w4 0
s w1 w 2 0 w4 0


Query: w(qs)=Σiqs wi  τ
Keep prefix s’ s.t. w(s’)  w(s) - α
w(s)-α
s’


α
s/s’
Best case: w(q/q’  s/s’) = α
Hence, we need w(q’s’)  τ - α
7/68
Prefix Filter Properties


The larger we make α, the smaller the prefix
The larger we make α, the smaller the range
of thresholds we can support:



Because τα, otherwise τ-α is negative.
We need to pre-specify minimum τ
Can apply to Jaccard, Edit Distance, IDF
8/68
Other Signatures


Minhash (still to come)
PartEnum:




LSH:



Upper bounds Hamming
Select multiple subsets instead of one prefix
Larger signature, but stronger guarantee
Probabilistic with guarantees
Based on hashing
Mismatch filter:

Use positional mismatching n-grams within the prefix to
attain lower bound of Edit Distance
9/68
Signature Indexing

Straightforward solution:



Create an inverted index on signature n-grams
Merge inverted lists to compute signature
intersections
For a given string q:
-
Access only lists in q’
Find strings s with w(q’ ∩ s’) ≥ τ - α
10/68
The Inverted Signature Hashtable
(CCVX08)


Maintain a signature vector for every n-gram
Consider prefix signatures for simplicity:




s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=…
co-occurence lists: ‘t L’: ‘tt ’  ‘t&t’  …
‘&tt’: ‘t L’  …
Hash all n-grams (h: n-gram  [0, m])
Convert co-occurrence lists to bit-vectors of size m
11/68
Example
Hash
lab
at&
t&t
tL
la
…
Signatures
s’1
s’2
s’3
s’4
s’5
…
5
4
5
1
0
at&, la
t&t, at&
t L, at&
abo, t&t
t&t, la
Hashtable
at&
t&t
lab
tL
la
…
100011
010101
…
12/68
Using the Hashtable?

Let list ‘at&’ correspond to bit-vector 100011


There exists string s s.t. ‘at&’  s’ and s’ also contains some ngrams that hash to 0, 1, or 5
Given query q:

Construct query signature matrix:
q’
q
at&
res
…
1
0
…
0
1
…
lab t&t
at&
1
1
lab
1
1
r
p


Consider only solid sub-matrices P: rq’, pq
We need to look only at rq’ such that w(r)τ-α and w(p)τ
13/68
Verification

How do we find which strings correspond to a
given sub-matrix?


Create an inverted index on string n-grams
Examine only lists in r and strings with w(s)τ
-

Remember that rq’
Can be used with other signatures as well
14/68
Outline






Motivation and preliminaries
Inverted list based algorithms
Gram Signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions
15/68
Length Normalized Measures

What is normalization?

Normalize similarity scores by the length of the
strings.
-


Can result in more meaningful matches.
Can use L0 (i.e., the length of the string), L1, L2,
etc.
For example L2:
-
Let w2(s)  Σtsw(t)2
Weight can be IDF, unary, language model, etc.
||s||2
= w2(s)-1/2
16/68
The L2-Length Filter (HCKS08)

Why L2?


For almost exact matches.
Two strings match only if:
-
-
They have very similar n-gram sets, and hence L2
lengths
The “extra” n-grams have truly insignificant weights in
aggregate (hence, resulting in similar L2 lengths).
17/68
Example



“AT&T Labs – Research”  L2=100
“ATT Labs – Research”  L2=95
“AT&T Labs”
 L2=70



If “Research” happened to be very popular and
had small weight?
“The Dark Knight”
“Dark Night”
 L2=75
 L2=72
18/68
Why L2 (continued)


Tight L2-based length filtering will result in
very efficient pruning.
L2 yields scores bounded within [0, 1]:



1 means a truly perfect match.
Easier to interpret scores.
L0 and L1 do not have the same properties
-
-
Scores are bounded only by the largest string length in
the database.
For L0 an exact match can have score smaller than a
non-exact match!
19/68
Example





q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’}  L0=5
s1={‘ATT’}
 L0=1
s2=q
 L0=5
S(q, s1)=Σw(qs1)/(||q||0 ||s1||0)=10/5 = 2
S(q, s2)=Σw(qs2)/(||q|| ||s2|| )=40/25<2
0
0
20/68
Problems

L2 normalization poses challenges.

For example:
-
S(q, s) = w2(qs)/(||q||2 ||s||2)
Prefix filter cannot be applied.
Minimum prefix weight α?


Value depends both on ||s||2 and ||q||2.
But ||q||2 is unknown at index construction time
21/68
Important L2 Properties

Length filtering:





For S(q, s) ≥ τ
τ ||q||2  ||s||2  ||q||2 / τ
We are only looking for strings within these
lengths.
Proof in paper
Monotonicity …
22/68
Monotonicity




Let s={t1, t2, …, tm}.
Let pw(s, t)=w(t) / ||s||2 (partial weight of s)
Then: S(q, s) = Σ tqs w(t)2 / (||q||2 ||s||2)=
Σtqs pw(s, t) pw(q, t)
If pw(s, t) > pw(r, t):


w(t)/||s||2 > w(t)/||r||2  ||s||2 < ||r||2
Hence, for any t’  t:

w(t’)/||s||2 > w(t’)/||r||2  pw(s, t’) > pw(r, t’)
23/68
Indexing

Use inverted lists sorted by pw():
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
3
0
0
4
4
4
3
3
1
4
1
2
3
1
2
1
2
• pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic) 
||0||2 < ||4||2 < ||1||2 < ||2||2
24/68
L2 Length Filter

Given q and τ, and using length filtering:
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
3
0
0
4
4
4
3
3
2
1
4
1
2
3
1
2
1
2
• We examine only a small fraction of the lists
25/68
Monotonicity

If I have seen 1 already, then 4 is not in the
list:
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
3
0
0
4
4
4
3
3
1
4
1
2
3
1
2
1
2
26/68
Other Improvements

Use properties of weighting scheme



Scan high weight lists first
Prune according to string length and maximum
potential score
Ignore low weight lists altogether
27/68
Conclusion

Concepts can be extended easily for:





BM25
Weighted Jaccard
DICE
IDF
Take away message:


Properties of similarity/distance function can play
big role in designing very fast indexes.
L2 super fast for almost exact matches
28/68
Outline






Motivation and preliminaries
Inverted list based algorithms
Gram signature algorithms
Length-normalized measures
Selectivity estimation
Conclusion and future directions
29/68
The Problem

Estimate the number of strings with:




Edit distance smaller than k from query q
Cosine similarity higher than τ to query q
Jaccard, Hamming, etc…
Issues:



Estimation accuracy
Size of estimator
Cost of estimation
30/68
Motivation

Query optimization:



Selectivity of query predicates
Need to support selectivity of approximate string
predicates
Visualization/Querying:


Expected result set size helps with visualization
Result set size important for remote query
processing
31/68
Flavors

Edit distance:




Based on clustering (JL05)
Based on min-hash (MBKS07)
Based on wild-card n-grams (LNS07)
Cosine similarity:

Based on sampling (HYKS08)
32/68
Selectivity Estimation for Edit
Distance

Problem:



Given query string q
Estimate number of strings s  D
Such that ed(q, s)  δ
33/68
Sepia - Clustering (JL05, JLV08)

Partition strings using clustering:


Store per cluster histograms:


Enables pruning of whole clusters
Number of strings within edit distance 0,1,…,δ from the cluster
center
Compute global dataset statistics:

Use a training query set to compute frequency of strings within
edit distance 0,1,…,δ from each query
34/68
Edit Vectors

Edit distance is not discriminative:

Use Edit Vectors
<1,1,1>
3
Lukas
<2,0,0>
2
Lucia
q

pi
<1,1,0>
2
Luciano
Ci
Lucas
3D space vs 1D space
35/68
Visually
C1
F1
p1
C2
...
p2
Cn
pn
Edit Vector
#
Edit Vector
#
Edit Vector
#
<0, 0, 0>
4
<0, 0, 0>
3
<0, 0, 0>
2
<0, 0, 1>
12
<0, 1, 0>
40
<1, 0, 2>
84
<1, 0, 2>
7
<1, 0, 1>
6
<1, 1, 1>
1
F2
…
…
…
v(q,pi)
Global Table
Fn
v(pi,s)
ed(q,s)
#
%
<1, 0, 1>
<0, 0, 1>
1
1
14
<1, 0, 1>
<0, 0, 1>
2
4
57
<1, 0, 1>
<0, 0, 1>
3
7
100
…
…
<1, 1, 0>
<1, 0, 2>
3
21
25
<1, 1, 0>
<1, 0, 2>
4
63
75
<1, 1, 0>
<1, 0, 2>
5
84
100
…
…
36/68
Selectivity Estimation

Use triangle inequality:


Compute edit vector v(q,pi) for all clusters i
If |v(q,pi)|  ri+δ disregard cluster Ci
δ
ri
q
pi
37/68
Selectivity Estimation

Use triangle inequality:



Compute edit vector v(q,pi) for all clusters i
If |v(q,pi)|  ri+δ disregard cluster Ci
For all entries in frequency table:
-
If |v(q,pi)| + |v(pi,s)|  δ then ed(q,s)  δ for all s
If ||v(q,pi)| - |v(pi,s)||  δ ignore these strings
Else use global table:


Lookup entry <v(q,pi), v(pi,s), δ> in global table
Use the estimated fraction of strings
38/68
Example
F1
Edit Vector
#
<0, 0, 0>
4
<0, 0, 1>
12
<1, 0, 2>
7


δ =3
v(q,p1) = <1,1,0>
v(p1,s) = <1,0,2>
…
Global lookup:
[<1,1,0>,<1,0,2>, 3]
 Fraction is 25% x 7 =
1.75
 Iterate through F1, and
add up contributions

Global Table
v(q,pi)
v(pi,s)
ed(q,s)
#
%
<1, 0, 1>
<0, 0, 1>
1
1
14
<1, 0, 1>
<0, 0, 1>
2
4
57
<1, 0, 1>
<0, 0, 1>
3
7
100
…
…
<1, 1, 0>
<1, 0, 2>
3
21
25
<1, 1, 0>
<1, 0, 2>
4
63
75
<1, 1, 0>
<1, 0, 2>
5
84
100
…
…
39/68
Cons


Hard to maintain if clusters start drifting
Hard to find good number of clusters


Space/Time tradeoffs
Needs training to construct good dataset
statistics table
40/68
VSol – minhash (MBKS07)


Solution based on minhash
minhash is used for:


Estimate the size of a set |s|
Estimate resemblance of two sets
-


I.e., estimating the size of J=|s1s2| / |s1s2|
Estimate the size of the union |s1s2|
Hence, estimating the size of the intersection
-
|s1s2| J~(s1, s2)  ~(s1, s2)
41/68
Minhash


Given a set s = {t1, …, tm}
Use independent hash functions h1, …, hk:





hi: n-gram  [0, 1]
Hash elements of s, k times
Keep the k elements that hashed to the
smallest value each time
We reduced set s, from m to k elements
Denote minhash signature with s’
42/68
How to use minhash

Given two signatures q’, s’:
 J(q, s)  Σ1ik I{q’[i]=s’[i]} / k



|s|  ( k / Σ1ik s’[i] ) - 1
(qs)’ = q’  s’ = min1ik(q’[i], s’[i])
Hence:
-
|qs|  (k / Σ1ik (qs)’[i]) - 1
43/68
VSol Estimator

Construct one inverted list per n-gram in D


The lists are our sets
Compute a minhash signature for each list
t1
Inverted list
1
5
…
t2
3
5
…
…
t10
1
8
…
14
25
43
Minhash
44/68
Selectivity Estimation

Use edit distance length filter:


If ed(q, s)  δ, then q and s share at least
L = |s| - 1 - n (δ-1)
n-grams
Given query q = {t1, …, tm}:


Answer is the size of the union of all non-empty Lintersections (binomial coefficient: m choose L)
We can estimate sizes of L-intersections using
minhash signatures
45/68
Example

δ = 2, n = 3  L = 6
q=
Inverted list
t1
1
5
…
t2
3
5
…
…
t10
1
8
…
14
25
43



Look at all 6-intersections of inverted lists
Α = |ι1, ..., ι6  [1,10] (ti1  ti2  …  ti6)|
There are (10 choose 6) such terms
46/68
The m-L Similarity
Can be done efficiently using
minhashes
 Answer:




ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] }
A  ρ  |t1… tm|
Proof very similar to the proof for minhashes
47/68
Cons

Will overestimate results


Many L-intersections will share strings
Edit distance length filter is loose
48/68
OptEQ – wild-card n-grams
(LNS07)

Use extended n-grams:


Introduce wild-card symbol ‘?’
E.g., “ab?” can be:
-

“aba”, “abb”, “abc”, …
Build an extended n-gram table:



Extract all 1-grams, 2-grams, …, n-grams
Generalize to extended 2-grams, …, n-grams
Maintain an extended n-grams/frequency
hashtable
49/68
Example
n-gram table
n-gram
Dataset
string
abc
def
ghi
…
Frequency
ab
bc
de
ef
gh
hi
…
?b
a?
?c
…
10
15
4
1
21
2
…
13
17
23
…
abc
def
…
5
2
…
50/68
Query Expansion
(Replacements only)



Given query q=“abcd”
δ=2
And replacements only:

Base strings:
-

“??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”
Query answer:
-
S1={sD: s  ”??cd”}, S2=…
A = |S1  S2  S3  S4  S5  S6| =
Σ1n6 (-1)n-1 |S1  …  Sn|
51/68
Replacement Intersection
Lattice
A = Σ1n6 (-1)n-1 |S1  …  Sn|




Need to evaluate size of all 2-intersections, 3intersections, …, 6-intersections
Then, use n-gram table to compute sum A
Exponential number of intersections
But ... there is well-defined structure
52/68
Replacement Lattice

Build replacement lattice:
??cd
?b?d
?bcd
?bc?
a??d
a?cd
ab?d
a?c?
ab??
abc?
abcd


2 ‘?’
1 ‘?’
0 ‘?’
Many intersections are empty
Others produce the same results

we need to count everything only once
53/68
General Formulas

Similar reasoning for:



Other combinations difficult:



r replacements
d deletions
Multiple insertions
Combinations of insertions/replacements
But … we can generate the corresponding
lattice algorithmically!

Expensive but possible
54/68
BasicEQ

Partition strings by length:


Query q with length l
Possible matching strings with lengths:
-

[l-δ, l+δ]
For k = l-δ to l+δ
-
Find all combinations of i+d+r = δ and l+i-d=k
If (i,d,r) is a special case use formula
Else generate lattice incrementally:


Start from query base strings (easy to generate)
Begin with 2-intersections and build from there
55/68
OptEq

Details are cumbersome


Left for homework
Various optimizations possible to reduce
complexity
56/68
Cons



Fairly complicated implementation
Expensive
Works for small edit distance only
57/68
Hashed Sampling (HYKS08)


Used to estimate selectivity of TF/IDF, BM25,
DICE (vector space model)
Main idea:


Take a sample of the inverted index
But do it intelligently to improve variance
58/68
Example

Take a sample of the inverted index
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
3
0
0
4
4
4
3
3
2
1
4
1
2
3
1
2
1
2
59/68
Example (Cont.)

But do it intelligently to improve variance
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
3
0
0
4
4
4
3
3
1
4
1
2
3
1
2
1
2
60/68
Construction

Draw samples deterministically:



Use a hash function h: N  [0, 100]
Keep ids that hash to values smaller than σ
Invariant:

If a given id is sampled in one list, it will always be
sampled in all other lists that contain it:
-
-
S(q, s) can be computed directly from the sample
No need to store complete sets in the sample
No need for extra I/O to compute scores
61/68
Selectivity Estimation


The union of arbitrary list samples is an σ%
sample
Given query q = {t1, …, tm}:

A = |Aσ| |t1  …  tm| / |tσ1  …  tσm|:
-

Aσ is the query answer size from the sample
The fraction is the actual scale-up factor
But there are duplicates in these unions!
We need to know:
-
The distinct number of ids in t1  …  tm
The distinct number of ids in tσ1  …  tσm
62/68
Count Distinct

Distinct |tσ1  …  tσm| is easy:


Scan the sampled lists
Distinct |t1  …  tm| is hard:


Scanning the lists is the same as computing the
exact answer to the query … naively
We are lucky:
-
-
Each list sample doubles up as a k-minimum value
estimator by construction!
We can use the list samples to estimate the distinct |t1
 …  tm|
63/68
The k-Minimum Value
Synopsis

It is used to estimated the distinct size of
arbitrary set unions (the same as FM sketch):



Take hash function h: N  [0, 100]
Hash each element of the set
The r-th smallest hash value is an unbiased
estimator of count distinct:
r
0
hr
hr
r
100 ?
100
64/68
Outline






Motivation and preliminaries
Inverted list based algorithms
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions
65/68
Future Directions

Result ranking



Diversity of query results



In practice need to run multiple types of searches
Need to identify the “best” results
Some queries have multiple meanings
E.g., “Jaguar”
Updates

Incremental maintenance
66/68
References











[AGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB
2006
[BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search,
Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu, ICDE 2009
[HCK+08] Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava: Fast Indexes and
Algorithms for Set Similarity Selection Queries. ICDE 2008
[HYK+08] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava: Hashed samples:
selectivity estimators for set similarity selection queries. PVLDB 2008.
[JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Liang Jin, and Chen Li.
VLDB 2005.
[KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, and
Divesh Srivastava. SIGMOD 2006.
[LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng
Lu, and Yiming Lu. ICDE 2008.
[LNS07] Hongrae Lee, Raymond T. Ng, Kyuseok Shim: Extending Q-Grams to Estimate Selectivity of
String Matching with Low Edit Distance. VLDB 2007
[LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using
Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007
[MBK+07] Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava: Estimating the
selectivity of approximate string queries. ACM TODS 2007
[XWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with
edit distance constraints. PVLDB 2008
67/68
References








[XWL+08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near
duplicate detection. WWW 2008.
[YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support
Approximate Queries Efficiently, Xiaochun Yang, Bin Wang, and Chen Li, SIGMOD 2008
[JLV08]L. Jin, C. Li, R. Vernica: SEPIA: Estimating Selectivities of Approximate String Predicates
in Large Databases, VLDBJ08
[CGK06] S. Chaudhuri, V. Ganti, R. Kaushik : A Primitive Operator for Similarity Joins in Data
Cleaning, ICDE06
[CCGX08]K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin: An Efficient Filter for Approximate
Membership Checking, SIGMOD08
[SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD
Conference 2004: 743-754
[BK02] Jérémy Barbay, Claire Kenyon: Adaptive intersection and t-threshold problems. SODA
2002: 390-399
[CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya,
Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918920
68/68
Download