Danny's slides

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle H. Unger Presented by Danny Wyatt Record Linkage Methods  As classification [Felligi & Sunter]    Data point is a pair of records Each pair is classified as “match” or “not match” Post-process with transitive closure  As clustering    Data point is an individual record All records in a cluster are considered a match No transitive closure if no cluster overlap Motivation  Either way, n2 such evaluations must be performed  Evaluations can be expensive   Many features to compare Costly metrics (e.g. string edit distance)  Non-matches far outnumber matches  Can we quickly eliminate obvious non- matches to focus effort? Canopies  A fast comparison groups the data into overlapping “canopies”  The expensive comparison for full clustering is only performed for pairs in the same canopy  No loss in accuracy if: “For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy” Creating Canopies  Define two thresholds   Tight: T1 Loose: T2  Put all records into a set S  While S is not empty     Remove any record r from S and create a canopy centered at r For each other record ri, compute cheap distance d from r to ri If d < T2, place ri in r’s canopy If d < T1, remove ri from S Creating Canopies  Points can be in more than one canopy  Points within the tight threshold will not start a new canopy  Final number of canopies depends on threshold values and distance metric  Experimental validation suggests that T1 and T2 should be equal Canopies and GAC  Greedy Agglomerative Clustering     Make fully connected graph with a node for each data point Edge weights are computed distances Run Kruskal’s MST algorithm, stopping when you have a forest of k trees Each tree is a cluster  With Canopies   Only create edges between points in the same canopy Run as before EM Clustering  Create k cluster prototypes c1…ck  Until convergence    Compute distance from each record to each prototype ( O(kn) ) Use that distance to compute probability of each prototype given the data Move the prototypes to maximize their probabilities Canopies and EM Clustering  Method 1   Distances from prototype to data points only computed within a canopies containing the prototype Note that prototypes can cross canopies  Method 2  Same as one, but also use all canopy centers to account for outside data points  Method 3  Same as 1, but dynamically create and destroy prototypes using existing techniques Complexity  n : number of data points  c : number of canopies  f : average number of canopies covering a data point  Thus, expect fn/c data points per canopy  Total distance comparisons needed becomes 2   fn  2    f 2 O c    O n    c    c    Reference Matching Results  Labeled subset of Cora data  1916 citations to 121 distinct papers  Cheap metric   Based on shared words in citations Inverted index makes finding that fast  Expensive metric  Customized string edit distance between extracted author, title, date, and venue fields  GAC for final clustering Reference Matching Results Method F1 Error Precision Recall Minutes Canopies 0.838 0.75% 0.735 0.976 7.65 Complete GAC 0.835 0.76% 0.737 0.965 134.09 Author/Year 0.697 1.60% 0.559 0.926 0.03 none 1.99% 1.000 0.000 0.00 Discussion  How do cheap and expensive distance metrics interact?    Ensure the canopies property Maximize number of canopies Minimize overlap  Probabilistic extraction, probabilistic clustering  How do the two interact?  Canopies and classification-based linkage  Only calculate pair data points for records in the same canopy

Danny's slides

Related documents

Products

Support

Danny's slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib