EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad INTRODUCTION • AME • • • How to efficiently extract a substring from a text document that approximately match some strings in the given dictionary. Applications – named entity recognition, data cleaning Two Steps • • Filtration – filter out strings from dictionary which are very different from substring Verification – each candidate string is verified to decide whether the substring should be extracted 2 INTRODUCTION: AN EXAMPLE • A Dictionary of strings we are interested in • • E.g. Conference names, author names etc. We are going to locate their “approximate appearances” in a series of documents. 3 PROBLEM DEFINITION • Given a dictionary R of strings and a similarity threshold δ∈[0,1], then a query M is submitted. Here, M represents a relatively long string (e.g. a text file). The task of AME is to extract all M’s substrings m, such that there exists some r ∈ R satisfying Sim(m,r) ≥ δ. • • • r is a piece of evidence for m Sim() is a function measuring the similarity of two strings An example of similarity measure Jaccard Similarity: J (r , m)  wt (r  m) wt (r  m) 4 APPROACH • When the input is given, we need to decide whether a substring m should be extracted • • • Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-power-oriented? • Less time or less survivors? 5 FILTRATION-VERIFICATION Filtration R Input Query M Potential Matches Verification True Matches Wrong Matches 6 FILTRATION-VERIFICATION(CONT’D) • We need to balance between the two stages More(less) filtration time Overall performance Strong(weak) Filtration power =Tf+Tv Fewer(more) candidates ?? Less(more) verification time 7 TECHNIQUES • If Sim(m,r) ≥ δ, what do we have ? • wt(Sig(m)∩Sig(r)) ≥ τ(m) • wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)} Existing techniques Technique used Where, • Sig(m) is a prefix signature set of string m • τ(m) is wt(Sig(m))-(1- δ)wt(m) • So the threshold does not remain constant • Use inverted lists to count sig-token overlapping • Using IDF weights (Inverse Document Frequency) 8 SIGNATURE-BASED INVERTED LISTS(SIL) • Lists indexed by sig-tokens • Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera”, r2 =“Nikon digital slr camera”, r3 = “canon slr” }. • wt(5d, eos, slr, Nikon, canon, camera, digital) = (9, 7, 2, 2, 2, 1 , 1) • 9 SIL (CONT’D) rid String Signature Set 1 “canon eos 5d digital camera” {“canon”,”eos”, “5d”} 2 “Nikon digital slr camera” {“nikon”, “slr”, “camera”} 3 “canon slr” {“canon”, “slr”} Signature sets of R’s strings Signature String rids 5d (1) “canon” (1), (3) “camera” (2) “eos” (1) “Nikon” (2) “slr” (2), (3) SIL 10 EvSCAN ALGORITHM BY SIL • Compute the overlapped sig weight using wt(Sig(m)∩Sig(r)) • The best matched string will be the one which satisfy the condition wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r)} • E.g. m=“canon eos digital camera”, δ=0.8 rid wt(Sig(m)∩Sig(r)) min{τ(m),τ(r)} 1 9.0 6.8 2 1.0 3.8 3 2.0 3.2 11 EvITER Algorithm – Progressive Computation • Recall we are checking all substrings • • • Formally we proved that • • • Some of them are quite similar, indicating that they share duplicate computation This means that, if m have potential evidence r, then m t is very likely to match r Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} We have ES(m  t)  ES(m)∪ list[t] ES(m) = { r ∈ R | wt(m ∩ sig(r)) ≥ min{ δ * wt(m) ,τ(r)}} 12 EXAMPLE • Document M: m t “…. cannon eos digital camera lens…” ES(m) List[t] … {r1} lens, 3.0 22 53 … • We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens” 13 FLOW OF EVIDENCE • EvITER for “Evidence ITERATION” 14 THE STATIC THRESHOLD PROBLEM • How does this index work so far? • • • • • • • -“Get ready for δ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ=0.9…” -“Sorry, please wait another 30min for index regeneration…” 15 THE STATIC THRESHOLD PROBLEM • This One Seems Better • • • • • • • -“Get ready for δ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ=0.9…” -“…Extraction complete.” 16 EXPERIMENTAL DATASETS • Paper titles from the DBLP website • Author names from DBLP website 17 RESULTS Fig. Performance under different k (δ = 0.85) 18 PERFORMANCE Fig. Performance under different thresholds (k = 3) 19 CONCLUSION • This method causes no false negatives • It achieves a good balance between the two phases of filtration and verification. • • They proposed EvITER to eliminate duplicate computation It achieves both effective & efficient performance 20 THANK YOU! 21

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad

Related documents

Products

Support

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib