EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad

advertisement
EFFICIENT ALGORITHMS FOR APPROXIMATE
MEMBER EXTRACTION
By Swapnil Kharche and Pavan Basheerabad
INTRODUCTION
•
AME
•
•
•
How to efficiently extract a substring from a text document that approximately
match some strings in the given dictionary.
Applications – named entity recognition, data cleaning
Two Steps
•
•
Filtration – filter out strings from dictionary which are very different from substring
Verification – each candidate string is verified to decide whether the substring
should be extracted
2
INTRODUCTION: AN EXAMPLE
•
A Dictionary of strings we are
interested in
•
•
E.g. Conference names, author
names etc.
We are going to locate their
“approximate appearances”
in a series of documents.
3
PROBLEM DEFINITION
•
Given a dictionary R of strings and a similarity threshold δ∈[0,1],
then a query M is submitted. Here, M represents a relatively
long string (e.g. a text file). The task of AME is to extract all M’s
substrings m, such that there exists some r ∈ R satisfying
Sim(m,r) ≥ δ.
•
•
•
r is a piece of evidence for m
Sim() is a function measuring the similarity of two strings
An example of similarity measure
Jaccard Similarity: J (r , m) 
wt (r  m)
wt (r  m)
4
APPROACH
•
When the input is given, we need to decide whether a
substring m should be extracted
•
•
•
Simple verification on all dictionary strings may be inefficient
Pre-pruning and post-verifying is beneficial
But should it be running-speed-oriented or filtering-power-oriented?
•
Less time or less survivors?
5
FILTRATION-VERIFICATION
Filtration
R
Input Query M
Potential Matches
Verification
True Matches
Wrong Matches
6
FILTRATION-VERIFICATION(CONT’D)
•
We need to balance between the two stages
More(less)
filtration time
Overall performance
Strong(weak)
Filtration power
=Tf+Tv
Fewer(more)
candidates
??
Less(more)
verification time
7
TECHNIQUES
•
If Sim(m,r) ≥ δ, what do we have ?
•
wt(Sig(m)∩Sig(r)) ≥ τ(m)
•
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)}
Existing techniques
Technique used
Where,
• Sig(m) is a prefix signature set of string m
• τ(m) is wt(Sig(m))-(1- δ)wt(m)
•
So the threshold does not remain constant
•
Use inverted lists to count sig-token overlapping
•
Using IDF weights (Inverse Document Frequency)
8
SIGNATURE-BASED INVERTED LISTS(SIL)
•
Lists indexed by sig-tokens
•
Each sig-token of a string creates a node (containing the string’s id) in the
corresponding list.
E.g. R = { r1 = “canon eos 5d digital camera”, r2 =“Nikon digital slr camera”, r3 =
“canon slr” }.
• wt(5d, eos, slr, Nikon, canon, camera, digital) = (9, 7, 2, 2, 2, 1 , 1)
•
9
SIL (CONT’D)
rid
String
Signature Set
1
“canon eos 5d
digital camera”
{“canon”,”eos”,
“5d”}
2
“Nikon digital slr
camera”
{“nikon”, “slr”,
“camera”}
3
“canon slr”
{“canon”, “slr”}
Signature sets of R’s strings
Signature
String rids
5d
(1)
“canon”
(1), (3)
“camera”
(2)
“eos”
(1)
“Nikon”
(2)
“slr”
(2), (3)
SIL
10
EvSCAN ALGORITHM BY SIL
•
Compute the overlapped sig weight using wt(Sig(m)∩Sig(r))
•
The best matched string will be the one which satisfy the
condition wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)}
•
E.g. m=“canon eos digital camera”, δ=0.8
rid
wt(Sig(m)∩Sig(r))
min{τ(m),τ(r)}
1
9.0
6.8
2
1.0
3.8
3
2.0
3.2
11
EvITER Algorithm – Progressive
Computation
•
Recall we are checking all substrings
•
•
•
Formally we proved that
•
•
•
Some of them are quite similar, indicating that they share duplicate
computation
This means that, if m have potential evidence r, then m t is very likely to
match r
Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary
strings that contain token t}
We have ES(m  t)  ES(m)∪ list[t]
ES(m) = { r ∈ R | wt(m ∩ sig(r)) ≥ min{ δ * wt(m) ,τ(r)}}
12
EXAMPLE
•
Document M:
m
t
“…. cannon eos digital camera lens…”
ES(m)
List[t]
…
{r1}
lens, 3.0
22
53
…
•
We know that only r1, r22, r53 are possible to match “cannon
eos digital camera lens”
13
FLOW OF EVIDENCE
•
EvITER for “Evidence ITERATION”
14
THE STATIC THRESHOLD PROBLEM
•
How does this index work so far?
•
•
•
•
•
•
•
-“Get ready for δ=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1, δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I want δ=0.9…”
-“Sorry, please wait another 30min for index regeneration…”
15
THE STATIC THRESHOLD PROBLEM
•
This One Seems Better
•
•
•
•
•
•
•
-“Get ready for δ>=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1, δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I want δ=0.9…”
-“…Extraction complete.”
16
EXPERIMENTAL DATASETS
•
Paper titles from the DBLP website
•
Author names from DBLP website
17
RESULTS
Fig. Performance under different k (δ = 0.85)
18
PERFORMANCE
Fig. Performance under different thresholds (k = 3)
19
CONCLUSION
•
This method causes no false negatives
•
It achieves a good balance between the two phases of
filtration and verification.
•
•
They proposed EvITER to eliminate duplicate computation
It achieves both effective & efficient performance
20
THANK YOU!
21
Download