EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad INTRODUCTION • AME • • • How to efficiently extract a substring from a text document that approximately match some strings in the given dictionary. Applications – named entity recognition, data cleaning Two Steps • • Filtration – filter out strings from dictionary which are very different from substring Verification – each candidate string is verified to decide whether the substring should be extracted 2 INTRODUCTION: AN EXAMPLE • A Dictionary of strings we are interested in • • E.g. Conference names, author names etc. We are going to locate their “approximate appearances” in a series of documents. 3 PROBLEM DEFINITION • Given a dictionary R of strings and a similarity threshold δ∈[0,1], then a query M is submitted. Here, M represents a relatively long string (e.g. a text file). The task of AME is to extract all M’s substrings m, such that there exists some r ∈ R satisfying Sim(m,r) ≥ δ. • • • r is a piece of evidence for m Sim() is a function measuring the similarity of two strings An example of similarity measure Jaccard Similarity: J (r , m) wt (r m) wt (r m) 4 APPROACH • When the input is given, we need to decide whether a substring m should be extracted • • • Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors? 5 FILTRATION-VERIFICATION Filtration R Input Query M Potential Matches Verification True Matches Wrong Matches 6 FILTRATION-VERIFICATION(CONT’D) • We need to balance between the two stages More(less) filtration time Overall performance Strong(weak) Filtration power =Tf+Tv Fewer(more) candidates ?? Less(more) verification time 7 TECHNIQUES • If Sim(m,r) ≥ δ, what do we have ? • wt(Sig(m)∩Sig(r)) ≥ τ(m) • wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)} Existing techniques Technique used Where, • Sig(m) is a prefix signature set of string m • τ(m) is wt(Sig(m))-(1- δ)wt(m) • So the threshold does not remain constant • Use inverted lists to count sig-token overlapping • Using IDF weights (Inverse Document Frequency) 8 SIGNATURE-BASED INVERTED LISTS(SIL) • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera”, r2 =“Nikon digital slr camera”, r3 = “canon slr” }. • wt(5d, eos, slr, Nikon, canon, camera, digital) = (9, 7, 2, 2, 2, 1 , 1) • 9 SIL (CONT’D) rid String Signature Set 1 “canon eos 5d digital camera” {“canon”,”eos”, “5d”} 2 “Nikon digital slr camera” {“nikon”, “slr”, “camera”} 3 “canon slr” {“canon”, “slr”} Signature sets of R’s strings Signature String rids 5d (1) “canon” (1), (3) “camera” (2) “eos” (1) “Nikon” (2) “slr” (2), (3) SIL 10 EvSCAN ALGORITHM BY SIL • Compute the overlapped sig weight using wt(Sig(m)∩Sig(r)) • The best matched string will be the one which satisfy the condition wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)} • E.g. m=“canon eos digital camera”, δ=0.8 rid wt(Sig(m)∩Sig(r)) min{τ(m),τ(r)} 1 9.0 6.8 2 1.0 3.8 3 2.0 3.2 11 EvITER Algorithm – Progressive Computation • Recall we are checking all substrings • • • Formally we proved that • • • Some of them are quite similar, indicating that they share duplicate computation This means that, if m have potential evidence r, then m t is very likely to match r Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} We have ES(m t) ES(m)∪ list[t] ES(m) = { r ∈ R | wt(m ∩ sig(r)) ≥ min{ δ * wt(m) ,τ(r)}} 12 EXAMPLE • Document M: m t “…. cannon eos digital camera lens…” ES(m) List[t] … {r1} lens, 3.0 22 53 … • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens” 13 FLOW OF EVIDENCE • EvITER for “Evidence ITERATION” 14 THE STATIC THRESHOLD PROBLEM • How does this index work so far? • • • • • • • -“Get ready for δ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ=0.9…” -“Sorry, please wait another 30min for index regeneration…” 15 THE STATIC THRESHOLD PROBLEM • This One Seems Better • • • • • • • -“Get ready for δ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ=0.9…” -“…Extraction complete.” 16 EXPERIMENTAL DATASETS • Paper titles from the DBLP website • Author names from DBLP website 17 RESULTS Fig. Performance under different k (δ = 0.85) 18 PERFORMANCE Fig. Performance under different thresholds (k = 3) 19 CONCLUSION • This method causes no false negatives • It achieves a good balance between the two phases of filtration and verification. • • They proposed EvITER to eliminate duplicate computation It achieves both effective & efficient performance 20 THANK YOU! 21