PPT - Jiaheng Lu

advertisement
String Similarity Measures and
Joins with Synonyms
Jiaheng Lu
Renmin University of China
Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang
Motivation Example (String Measure)
S1=“International Conference on Management of Data NY USA”
S2=“SIGMOD 2013 New York United States”
no semantic
Example (String Measure)
S1=“International Conference on Management of Data NY USA”
S2=“SIGMOD 2013 New York United States”
Synonyms
SIGMOD  ACM's Special Interest Group on Management Of Data
SIGMOD  International Conference on Management of Data
NY  New York
USA  United States
How to use the existing synonyms?
Research Problem 1--(String Measurements)
Input
Two strings s and t, and a set of synonyms R
Output
Using R to return the maximal Jaccard similarity
Jaccard(s,t,R)
Problem 2-- (String Similarity Join)
Input
Two set of strings S and T, and a set of synonyms R, and a
threshold value
Output
Return all similar pairs
Jaccard(s,t,R)>=
, such that
An example of similarity join
Table S1
ID
String
q1
2013 ACM Intl Conf
on Management of
Data USA
q2
Very Large Data
Bases Conf
q3
VLDB Conf
q4
ICDE 2013
Table S2
ID
String
s1
SIGMOD
s2
VLDB
Synonyms
SIGMOD  International Conference on Management of Data
VLDB  Very Large Data Bases
6
Existing works on approximate string match with synonyms
 Transform based framework (JaccT) [1], compared with our
method.
Machine leaning method [2], Hidden Markov Model-based
measure.
Depend on training data, not efficient
[1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for
record matching. In ICDE, pages 40–49, 2008.
[2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable
string similarity measures. In KDD, pages 39–48, 2003.
7
Outline
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
String Similarity Measures (Full-expansion)
S1=“International Conference on Management of Data NY USA”
S2=“SIGMOD 2013 New York United States”
Synonyms
SIGMOD  ACM's Special Interest Group on Management Of Data
SIGMOD  International Conference on Management of Data
NY  New York
USA  United States
Expanding using all synonyms
S1’=" International Conference on Management of Data NY USA SIGMOD
New York United States "
S2’=" SIGMOD 2013 New York United States International Conference on
Management of Data NY USA ACM's Special Interest Group on
Management Of Data
Jaccard(S1’,S2’)= 13/18 = 0.72
String Similarity Measures (Selective-expansion)
S1=“International Conference on Management of Data NY USA”
S2=“SIGMOD 2013 New York United States of America”
Synonyms
SIGMOD  ACM's Special Interest Group on Management Of Data
SIGMOD  International Conference on Management of Data
NY  New York
USA  United States
Expanding using only good synonyms
S1’=" International Conference on Management of Data NY USA SIGMOD
New York United States "
S2’=" SIGMOD 2013 New York United States International Conference on
Management of Data NY USA "
Jaccard(S1’,S2’)= 13/14 = 0.93
String Similarity Measures (Selective)
Selective-expansion is:
NP-hard : Reduction from 3-SAT
Choose synonyms that can increase current
similarity by computing the similarity-gain
Greedy
algorithm
Property
Optimal, when more than 70% cases in
practice.
Outline
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Similarity Joins (Filtering and Verification)
Filtering candidates
Verify candidates
Generate
Signatures with
full expansion
Similarity
Measures
Full expansion
Prefix method
Selective
expansion
LSH method
13
String Similarity Joins (SN-Join)
Prefix method
Global ordering: {a b c d e f g h i j k l}
S1=“c k, e, a, f”
S2=“d, b, f, e, k”
Order the strings
Threshold=0.8
S1’=“a, c, e, f, k”
S2’=“b, d, e, f, k”
Get signatures
Sig(s1)=“a, c”
No overlap
Sig(s2)=“b, d”
Jacc(s1,s2)<0.8
14
Signatures selection is important
How to select signatures to enhance the signature
filtering power?
It is unrealistic to find a “one-size-fits-all” solution.
15
Estimation-based signatures selection
. Three steps to select signatures:
• Generate multiple signatures schemes for each
data set.
• Given two tables for join, quickly estimate the
filtering power of each scheme.
• Select the scheme with the best filtering power.
16
An example on estimator
Self-join:
ID
String
Signatures
q1
2013 ACM Intl Conf on
Management of Data
USA
ACM,
International,
Conference, on
q2
Very Large Data Bases
Conf
Conf, Conference
q3
VLDB Conf
Conf, Conference
q4
ICDE 2013
ICDE
ACM
Conf
Conference
q1
q2
q1
q3
q2
q3
International
q1
Filtering results
(candidates):
(q2,q3) ,(q1,q2)
(q1,q3)
on
q1
ICDE
q4
17
Applying FM sketches on inverted lists
Self-join:
ID
String
Signatures
q1
2013 ACM Intl Conf on
Management of Data
USA
ACM,
International,
Conference, on
q2
Very Large Data Bases
Conf
Conf, Conference
q3
VLDB Conf
Conf, Conference
q4
ICDE 2013
ICDE
ACM
q1
Conf
Conference
q2
q1
q3
q2
q3
International
q1
Filtering results
(candidates):
(q2,q3) ,(q1,q2),
(q1,q3)
on
q1
ICDE
q4
Using FlajoletMartin (FM)
sketch for each
list
18
FM sketches (Flajolet and Martin JCSS 1985)
• Estimates the number of distinct items in a multi-set of values from
[0,…, M-1]
Number of distinct values: 5
3 0 5 3 0 1 7 5 1 0 3 7
• Assume a hash function h(x) that maps incoming values x in [0,…,
M-1] uniformly across [0,…, 2L-1], where L = O(logM)
• Let lsb(y) denote the position of the least-significant 1 bit in the
binary representation of y
– A value x is mapped to lsb(h(x))
BITMAP
x=5
h(x) = 101100
lsb(h(x)) = 2
5
4
0
0
3
0
2
1
0
1
0
0
19
Estimating the filtering power of a signature scheme
Constructing a two-dimensional hash sketch
Computing tighter upper and lower bounds of candidates size
20
String Similarity Filtering with Length Filter
Filtering candidates
Generate
Signatures
Compute
lengths
Verify candidates
Similarity
Measures
Full expansion
Prefix method
Length filter
LSH method
Selective
expansion
String Similarity Joins (SI-Join)
Length filtering
Strings
S1=“a b c d e”
S2=“x y z”
Full-expansion
Synonyms
Length range
S1’=“a b c d e f g h k”
S2’=“x y z s”
s1: [5, 9]
s2: [3, 4]
a->f g h
x-> s
b->k
Jacc(s1,s2,R)<0.9
String Similarity Joins (SI-Join)
Filtering candidates
Generate
Signatures
Compute
lengths
Prefix/LSH
method
Length filter
Verify candidates
Similarity
Measures
Full expansion
Selective
expansion
23
String Similarity Joins (SI-tree)
Outline
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Data sets and algorithms
• Compared method: JaccT [Arasu et al. ICDE 2008]
• Three datasets:
Data
# of
strings
String Len
(avg/max)
#of
Synonyms
# of applied
synonyms
(avg/max)
USPS
1M
6.75/15
300
2.19/5
CONF
10K
5.84/14
1000
1.43/4
SPROT
1M
10.32/20
10K
37.78/104
26
Effectiveness of different similarity measurements
String Similarity Measures
Selective-expansion (SE) achieves the best effectiveness.
Efficiency of algorithms
S: selective expansion
String Similarity Joins
SI-Join achieve the best performance.
F: full expansion
Prefix V.s. LSH
Prefix scheme VS. LSH schemee
Prefix is better
LSH is better
Estimation effectiveness
Outline
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Conclusion and future work
String similarity measure with synonyms
Two new measures and a new join algorithm
One estimator for signature selection
Future work: how to deal with synonym ambiguity
E.g. UW = University of Washington
UW = University of Waterloo
OR
?
String Similarity Measures and
Joins with Synonyms
Download