Author Disambiguation Using Error-driven Machine Learning with a Ranking Loss Function Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA 01003 the core of this approach is a classifier over record pairs, nowhere are aggregate features of an author modeled explicitly. That is, by restricting the model representation to evidence over pairs of authors, we cannot leverage evidence available from examining more than two records. For example, we would like to model the fact that authors are generally affiliated with only a few institutions, have only one or two different email addresses, and are unlikely to publish more than thirty publications in one year. None of these constraints can be captured with a pairwise classifier, which, for example, can only consider whether pairs of institutions or emails match. In this paper we propose a representation for author disambiguation that enables these aggregate constraints. The representation can be understood as a scoring function over a set of authors, indicating how likely it is that all members of the set are duplicates. While flexible, this new representation can make it difficult to estimate the model parameters from training data. We therefore propose a class of training algorithms to estimate the parameters of models adopting this representation. The method has two main characteristics essential to its performance. First, it is error-driven in that training examples are generated based on mistakes made by the prediction algorithm. This approach focuses training effort on the types of examples expected during prediction. Second, it is rank-based in that the loss function induces a ranking over candidate predictions. By representing the difference between predictions, preferences can be expressed over partially-correct or incomplete solutions; additionally, intractable normalization constants can be avoided because the loss function is a ratio of terms. In the following sections, we describe the representation in more detail, then present our proposed training methods. We evaluate our proposals on three real-world author deduplication datasets and demonstrate error-reductions of up to 60%. Abstract Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rank-based in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach. Introduction Record deduplication is the problem of deciding whether two records in a database refer to the same object. This problem is widespread in any large-scale database, and is particularly acute when records are constructed automatically from text mining. Author disambiguation, the problem of de-duplicating author records, is a critical concern for digital publication libraries such as Citeseer, DBLP, Rexa, and Google Scholar. Author disambiguation is difficult in these domains because of abbreviations (e.g., Y. Smith) misspellings (e.g., Y. Smiht), and extraction errors (e.g., Smith Statistical). Many supervised machine learning approaches to author disambiguation have been proposed. Most of these are variants of the following recipe: (1) train a binary classifier to predict whether a pair of authors are duplicates, (2) apply the classifier to each pair of ambiguous authors, (3) combine the classification predictions to cluster the records into duplicate sets. This approach can be quite accurate, and is attractive because it builds upon existing machine learning technology (e.g., classification and clustering). However, because Motivating Examples Table 1 shows three synthetic publication records that demonstrate the difficulty of author disambiguation. Each record contains equivalent author strings, and the publication titles contains similar words (networks, understanding, c 2007, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 32 Author Y. Li Y. Li Y. Li Title Understanding Social Networks Understanding Network Protocols Virtual Network Protocols Institution Stanford Carnegie Mellon Peking Univ. Year 2003 2002 2001 Table 1: Author disambiguation example with multiple institutions. Author P. Cohen P. Cohen P. Cohen Co-authors A. Howe M. Greenberg, A. Howe, ... M. Greenberg Title How evaluation guides AI research Trial by Fire: Understanding the design requirements ... in complex environments MU: a development environment for prospective reasoning systems Table 2: Author disambiguation example with overlapping co-authors. protocols). However, only the last two authors are duplicates. Consider a binary classifier that predicts whether a pair of records are duplicates. Features may include the similarity of the author, title, and institution strings. Given a labeled dataset, the classifier may learn that authors often have the same institution, but since many authors have multiple institutions, the pairwise classifier may still predict that all of the authors in Table 1 are duplicates. Consider instead a scoring function that considers all records simultaneously. For example, this function can compute a feature indicating that an author is affiliated with three different institutions in a three year period. Given training data in which this event is rare, it is likely that the classifier would not predict all the authors in Table 1 to be duplicates. Table 2 shows an example in which co-author information is available. It is very likely that two authors with similar names that share a co-author are duplicates. However, a pairwise classifier will compute that records one and three do not share co-authors, and therefore may not predict that all of these records are duplicates. (A post-processing clustering method may merge the records together through transitivity, but only if the aggregation of the pairwise predictions is sufficiently high.) A scoring function that considers all records simultaneously can capture the fact that records one and three each share a coauthor with record two, and are therefore all likely duplicates. In the following section, we formalize this intuition, then describe how to estimate the parameters of such a representation. tion is then the problem of searching for the highest scoring partitioning: T ∗ = argmax S(T ) T The structure of S determines its representational power. There is a trade-off between the types of evidence we can use to compute S(T ) and the difficulty of estimating the parameters of S(T ). Below, we describe a pairwise scoring function, which decomposes S(T ) into a sum of scores for record pairs, and a clusterwise scoring function, which decomposes S(T ) into a sum of scores for record clusters. Let T be decomposed into p (possibly overlapping) substructures {t1 . . . tp }, where ti ⊂ T indicates that ti is a substructure of T . For example, a partitioning T may be decomposed into a set of record pairs Ri , Rj . Let f : t → Rk be a substructure feature function that summarizes t with k real-valued features1 . Let s : f (t) × Λ → R be a substructure scoring function that maps features of substructure t to a real value, where Λ ∈ Rk is a set of real-valued parameters of s. For example, a linear substructure scoring function simply returns the inner product Λ, f (t). Let Sf : s(f (t1 ), Λ) × . . . × s(f (tn ), Λ) → R be a factored scoring function that combines a set of substructure scores into a global score for T . In the simplest case, Sf may simply be the sum of substructure scores. Below we describe two scoring functions resulting from different choices for the substructures t. Pairwise Scoring Function Given a partitioning T , let tij represent a pair of records Ri , Rj . We define the pairwise scoring function as Sp (T, Λ) = s(f (tij ), Λ) Scoring Functions for Disambiguation Consider a publications database D containing records {R1 . . . Rn }. A record Ri consists of k fields {F1 . . . Fk }, where each field is an attribute-value pair Fj = attribute, value. Author disambiguation is the problem of partitioning {R1 . . . Rn } into m sets {A1 . . . Am }, m ≤ n, where Al = {Rj . . . Rk } contains all the publications authored by the lth person. We refer to a partitioning of D as T (D) = {A1 . . . Am }, which we will abbreviate as T . Given some partitioning T , we wish to learn a scoring function S : T → R such that higher values of S(T ) correspond to more accurate partitionings. Author disambigua- ij Thus, Sp (T ) is a sum of scores for each pair of records. Each component score s(f (tij ), Λ) indicates the preference for the prediction that records Ri and Rj co-refer. This is analogous to approaches adopted recently using binary classifiers to perform disambiguation (Torvik et al. 2005; Huang, Ertekin, & Giles 2006). 1 33 Binary features are common as well: f : t → {0, 1}k Clusterwise Scoring Function Algorithm 1 Error-driven Training Algorithm k Given a partitioning T , let t represent a set of records {Ri . . . Rj } (e.g., tk is a block of the partition). We define the clusterwise scoring function as the sum of scores for each cluster: s(f (tk ), Λ) Sc (T, Λ) = 1: Input: Training set D Initial parameters Λ0 Prediction algorithm A 2: while Not Converged do 3: for all X, T ∗ (X) ∈ D do 4: T(X) ⇐ A(X, Λt ) 5: De ⇐ CreateExamplesFromErrors(T(X), T ∗ (X)) 6: Λt+1 ⇐ UpdateParameters(De , Λt ) 7: end for 8: end while k where each component score s(f (tk , Λ)) indicates the preference for the prediction that all the elements {Ri . . . Rj } co-refer. Learning Scoring Functions Let A be a prediction algorithm that computes a sequence of predictions, i.e., A : X × Λ → T 0 (X) × . . . × T r (X), where T r (X) is the final prediction of the algorithm. For example, A could be a clustering algorithm. Algorithm 1 gives high-level pseudo-code of the description of the errordriven framework.. At each iteration, we enumerate over the training examples in the original training set. For each example, we run A with the current parameter vector Λt to generate T(X), a sequence of predicted structures for X. In general, the function CreateExamplesFromErrors can select an arbitrary number of errors contained in T(X). In this paper, we select only the first mistake in T(X). When the prediction algorithm is computationally intensive, this greatly increases efficiency, since inference is terminated as soon as an error is made. Given De , the parameters Λt+1 are set based on the errors made using Λt . In the following section, we describe the nature of De in more detail, and present a ranking-based method to calculate Λt . Given some training database DT for which the true author disambiguation is known, we wish to estimate the parameters of S to maximize expected disambiguation performance on new, unseen databases. Below, we outline approaches to estimate Λ for pairwise and clusterwise scoring functions. Pairwise Classification Training A standard approach to train a pairwise scoring function is to generate a training set consisting of all pairs of authors (if this is impractical, one can prune the set to only those pairs that share a minimal amount of surface similarity). A classifier is estimated from this data to predict the binary label SameAuthor. Once the classifier is created, we can set each substructure score s(f (tij )) as follows: Let p1 = P (SameAuthor = 1|Ri , Rj ). Then the score is if Ri , Rj ∈ T p1 ij s(f (t )) ∝ 1 − p otherwise Thus, if Ri , Rj are placed in the same partition in T , then the score is proportional to the classifier output for the positive label; else, the score is proportional to output for the negative label. Learning To Rank An important consequence of using a search-based clustering algorithm is that the scoring function is used to compare a set of possible modifications to the current prediction. Given a clustering T i , let N (T i ) be the set of predictions in the neighborhood of T i . T i+1 ∈ N (T i ) if the prediction algorithm can construct T i+1 from T i in one iteration. For example, at each iteration of the greedy agglomerative system, N (T i ) is the set of clusterings resulting from all possible merges a pair of clusters in T i . We desire a training method that will encourage the inference procedure to choose the best possible neighbor at each iteration. Let N̂ (T ) ∈ N (T ) be the neighbor of T that has the maximum predicted global score, i.e., N̂ (T ) = argmaxT ∈N (T ) Sg (T ). Let S ∗ : T → R be a global scoring function that returns the true score for an object, for example the accuracy of prediction T . Given this notation, we can now fully describe the method CreateExamplesFromErrors in Algorithm 1. An error occurs when there exists a structure N ∗ (T ) such that S ∗ (N ∗ (T )) > S ∗ (N̂ (T )). That is, the best predicted structure N̂(T ) has a lower true score than another candidate structure N ∗ (T ). Clusterwise Classification Training The pairwise classification scheme forces each coreference decision to be made independently of all others. Instead, the clusterwise classification scheme implements a form of the clusterwise scoring function described earlier. A binary classifier is built that predicts whether all members of a set of author records {Ri · · · Rj } refer to the same person. The scoring function is then constructed in a manner analogous to the pairwise scheme, with the exception that the probability p1 is conditional on an arbitrarily large set of mentions, and s ∝ p1 only if all members of the set fall in the same block of T . Error-driven Online Training We employ a sampling scheme that selects training examples based on errors that occur during inference on the labeled training data. For example, if inference is performed with agglomerative clustering, the first time that two noncoreferent clusters are merged, the features that describe that merge decision are used to update the parameters. 34 A training example N̂ (T ), N ∗ (T ) is generated2, and a loss function is computed to adjust the parameters to encourage N ∗ (T ) to have a higher predicted score than N̂ (T ). Below we describe two such loss functions. census data. We use several different similarity measures on the title of the two citations such as the cosine similarity between the words, string edit distance, TF-IDF measure and the number of overlapping bigrams and trigrams. We also look for similarity in author emails, institution affiliation and the venue of publication whenever available. In addition to these, we also use the following first-order features over these pairwise features: For real-valued features, we compute their minimum, maximum and average values; for binary-valued features, we calculate the proportion of pairs for which they are true, and also compute existential and universal operators (e.g., “there exist a pair of authors with mismatching middle initials”). Ranking Perceptron The perceptron update is Λt+1 = Λt + y · F (T ) where y = 1 if T = N ∗ (T ), and y = −1 otherwise. This is the standard perceptron update (Freund & Schapire 1999), but in this context it results in a ranking update. The update compares a pair of F (T ) vectors, one for N̂ (T ) and one for N ∗ (T ). Thus, the update actually operates on the difference between these two vectors. Note that for robustness we average the parameters from each iteration at the end of training. Results Table 3 summarizes the various systems compared in our experiments. The goal is to determine the effectiveness of the clusterwise scoring functions, error-driven example generation, and rank-based training. In all experiments, prediction is performed with greedy agglomerative clustering. We evaluate performance using three popular measures: Pairwise, the precision and recall for each pairwise decision; MUC (Vilain et al. 1995), a metric commonly used in noun coreference resolution; and B-Cubed (Amit & Baldwin 1998). Tables 4, 5, and 6 present the performance on the three different datasets. Note that the accuracy varies significantly between datasets because each has quite different characteristics (e.g., different distributions of papers per unique author) and different available attributes (e.g., institutions, emails, etc.). The first observation is that the simplest method of estimating the parameters of the clusterwise score performs quite poorly. C/U/L trains the clusterwise score by uniformly sampling sets of authors and training a binary classifier to indicate whether all authors are duplicates. This performs consistently worse than P/A/L, which is the standard pairwise classifier. We next consider improvements to the training algorithm. The first enhancement is to perform error-driven training. By comparing C/U/L with C/E/Pc (the error-driven perceptron classifier) and C/E/Mc (the error-driven MIRA classifier), we can see that performing error-driven training often improves performance. For example, in Table 6, we see pairwise F1 increase from 82.4 for C/U/L to 93.1 for C/E/Pc and to 91.9 for C/E/Mc . However, this improvement is not consistent across all datasets. Indeed, simply using error-driven training is not enough to ensure accurate performance for the clusterwise score. The second enhancement is to perform a rank-based parameter update. With this additional enhancement, the clusterwise score consistently outperforms the pairwise score. For example, in Table 5, C/E/Pr obtains nearly a 60% reduction in pairwise F1 error over the pairwise scorer P/A/L. Similarly, in Table 5, C/E/Mr obtains a 35% reduction in pairwise F1 error P/A/L. While perceptron does well on most of the datasets, it performs poorly on the DBLP data (C/E/PR , Table 6). Because the perceptron update does not constrain the incorrect prediction to Ranking MIRA We use a variant of MIRA (Margin Infused Relaxed Algorithm), a relaxed, online maximum margin training algorithm (Crammer & Singer 2003). We updates the parameter vector with three constraints: (1) the better neighbor must have a higher score by a given margin, (2) the change to Λ should be minimal, and (3) the inferior neighbor must have a score below a user-defined threshold τ (0.5 in our experiments). The second constraint is to reduce fluctuations in Λ. This optimization is solved through the following quadratic program: Λt+1 = argmin ||Λt − Λ||2 s.t. Λ S(N ∗ (T ), Λ) − S(N̂ (T ), Λ) ≥ 1 S(N̂ , Λ) < τ The quadratic program of MIRA is a norm-minimization that is efficiently solved by the method of Hildreth (Censor & Zenios 1997). As in perceptron, we average the parameters from each iteration. Experiments Data We used three datasets for evaluation: • Penn: 2021 citations, 139 unique authors • Rexa: 1459 citations, 289 unique authors • DBLP: 566 citations, 76 unique authors Each dataset contains multiple sets of citations authored by people with the same last name and first initial. We split the data into training and testing sets such that all authors with the same first initial and last name are in the same split. The features used for our experiments are as follows. We use the first and middle names of the author in question and the number of overlapping co-authors. We determine the rarity of the last name of the author in question using US 2 While in general De can contain the entire neighborhood N (T ), in this paper we restrict De to contain only two structures, the incorrectly predicted neighbor and the neighbor that should have been selected. 35 Component Score Representation Training Example Generation Loss Function Name Pairwise (P ) Clusterwise (C) Error-Driven (E) Uniform (U ) All-Pairs(A) Ranking MIRA (Mr ) Non-Ranking Mira (Mc ) Ranking Perceptron (Pr ) Non-Ranking Perceptron (Pc ) Logistic Regression(L) Description See Section Pairwise Scoring Function. See Section Clusterwise Scoring Function. See Section Error-driven Online Training. Examples are sampled u.a.r. from search trees. All positive and negative pairs See Section Ranking MIRA. MIRA trained for classification, not ranking. See Section Ranking Perceptron. Perceptron for classification, not ranking. Standard Logistic Regression. Table 3: Description of the various system components used in the experiments. C/E/Mr C/E/Mc C/E/Pr C/E/Pc C/U /L P /A/L F1 36.0 24.4 52.0 40.3 36.2 44.9 Pairwise Precision 96.0 99.2 77.9 99.9 74.8 96.8 Recall 22.1 13.9 39.0 25.3 23.9 29.3 F1 48.8 35.8 63.8 52.6 46.1 56.0 B-Cubed Precision Recall 97.9 32.5 98.7 21.8 84.1 51.4 99.6 35.7 80.6 32.2 95.0 39.7 F1 79.8 71.2 88.6 81.5 80.4 87.2 MUC Precision 99.2 98.3 94.5 99.6 89.6 96.4 Recall 66.8 55.8 83.4 68.9 73.0 79.5 Table 4: Author Coreference results on the Penn data. See Table 3 for the definitions of each system. be below the classification threshold, the resulting clustering algorithm can over-merge authors. MIRA’s additional constraint (S(N̂ , Λ) < τ ) addresses this issue. In conclusion, these results indicate that simply increasing representational power by using a clusterwise scoring function may not result in improved performance unless appropriate parameter estimation methods are used. The experiments on these three datasets suggest that error-driven, rank-based estimation is an effective method to train a clusterwise scoring function. ture as correct or incorrect. Thus, each LaSO training example contains all candidate predictions, whereas our training examples contain only the highest scoring incorrect prediction and the highest scoring correct prediction. Our experiments show the advantages of this ranking-based loss function. Additionally, we provide an empirical study to quantify the effects of different example generation and loss function decisions. Related Work We have proposed a more flexible representation for author disambiguation models and described parameter estimation methods tailored for this new representation. We have performed empirical analysis of these methods on three realworld datasets, and the experiments support our claims that error-driven, rank-based training of the new representation can improve accuracy. In future work, we plan to investigate more sophisticated prediction algorithms that alleviate the greediness of local search, and also consider representations using features over entire clusterings. Conclusions and Future Work There has been a considerable interest in the problem of author disambiguation (Etzioni et al. 2004; Dong et al. 2004; Han et al. 2004; Torvik et al. 2005; Kanani, McCallum, & Pal 2007); most approaches perform pairwise classification followed by clustering. Han, Zha, & Giles (2005) use spectral clustering to partition the data. More recently, Huang, Ertekin, & Giles (2006) use SVM to learn similarity metric, along with a version of the DBScan clustering algorithm. Unfortunately, we are unable to perform a fair comparison with their method as the data is not yet publicly unavailable. On et al. (2005) present a comparative study using co-author and string similarity features. Bhattacharya & Getoor (2006) show suprisingly good results using unsupervised learning. There has also been a recent interest in training methods that enable the use of global scoring functions. Perhaps the most related is “learning as search optimization” (LaSO) (Daumé III & Marcu 2005). Like the current paper, LaSO is also an error-driven training method that integrates prediction and training. However, whereas we explicitly use a ranking-based loss function, LaSO uses a binary classification loss function that labels each candidate struc- Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under contract #NBCHD030010, in part by U.S. Government contract #NBCH040171 through a subcontract with BBNT Solutions LLC, in part by The Central Intelligence Agency, the National Security Agency and National Science Foundation under NSF grant #IIS0326249, in part by Microsoft Live Labs, and in part by the Defense Advanced Research Projects Agency (DARPA) under contract #HR0011-06-C-0023.0 36 C/E/Mr C/E/Mc C/E/Pr C/E/Pc C/U /L P /A/L F1 74.1 39.4 86.4 49.5 45.0 66.2 Pairwise Precision 86.3 98.1 87.4 87.2 87.3 72.4 Recall 65.0 24.7 85.5 34.5 30.3 61.0 F1 78.0 59.3 81.8 65.8 67.2 75.7 B-Cubed Precision Recall 94.6 66.4 96.6 42.8 81.6 82.0 94.6 50.4 86.9 54.8 75.5 76.0 F1 82.7 72.7 88.8 78.3 82.0 88.6 MUC Precision 98.0 96.7 87.4 96.2 87.8 85.3 Recall 71. 5 58.2 90.2 66.0 76.4 92.2 Table 5: Author Coreference results on the Rexa data. See Table 3 for the definitions of each system. C/E/Mr C/E/Mc C/E/Pr C/E/Pc C/U /L P /A/L F1 92.2 91.9 45.3 93.1 82.4 88.0 Pairwise Precision 94.2 90.6 29.4 91.0 95.7 84.6 Recall 90.2 93.2 99.4 95.3 72.3 91.7 F1 89.0 87.6 72.9 90.6 83.2 86.1 B-Cubed Precision Recall 94.4 84.2 94.8 81.5 57.8 98.8 92.0 89.3 93.0 75.3 84.6 87.8 F1 93.5 90.7 94.2 94.3 93.1 93.0 MUC Precision 98.5 98.5 89.3 97.2 96.2 93.0 Recall 89.0 84.1 99.6 91.6 90.3 93.0 Table 6: Author Coreference results on the DBLP data. See Table 3 for the definitions of each system. References Torvik, V. I.; Weeber, M.; Swanson, D. R.; and Smalheiser, N. R. 2005. A probabilistic similarity metric for medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology 56(2):140–158. Vilain, M.; Burger, J.; Aberdeen, J.; Connolly, D.; and Hirschman, L. 1995. A model-theoretic coreference scoring scheme. In Proceedings of MUC6, 45–52. Amit, B., and Baldwin, B. 1998. Algorithms for scoring coreference chains. In Proceedings of MUC7. Bhattacharya, I., and Getoor, L. 2006. A latent dirichlet model for unsupervised entity resolution. In SDM. Censor, Y., and Zenios, S. 1997. Parallel optimization: theory, algorithms, and applications. Oxford University Press. Crammer, K., and Singer, Y. 2003. Ultraconservative online algorithms for multiclass problems. JMLR 3:951–991. Daumé III, H., and Marcu, D. 2005. Learning as search optimization: Approximate large margin methods for structured prediction. In ICML. Dong, X.; Halevy, A. Y.; Nemes, E.; Sigurdsson, S. B.; and Domingos, P. 2004. Semex: Toward on-the-fly personal information integration. In IIWEB. Etzioni, O.; Cafarella, M.; Downey, D.; Kok, S.; Popescu, A.; Shaked, T.; Soderland, S.; Weld, D.; and Yates, A. 2004. Webscale information extraction in KnowItAll. In WWW. ACM. Freund, Y., and Schapire, R. E. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3):277– 296. Han, H.; Giles, L.; Zha, H.; Li, C.; and Tsioutsiouliklis, K. 2004. Two supervised learning approaches for name disambiguation in author citations. In JCDL, 296–305. ACM Press. Han, H.; Zha, H.; and Giles, L. 2005. Name disambiguation in author citations using a k-way spectral clustering method. In JCDL. Huang, J.; Ertekin, S.; and Giles, C. L. 2006. Efficient name disambiguation for large-scale databases. In PKDD, 536–544. Kanani, P.; McCallum, A.; and Pal, C. 2007. Improving author coreference by resource-bounded information gathering from the web. In Proceedings of IJCAI. On, B.-W.; Lee, D.; Kang, J.; and Mitra, P. 2005. Comparative study of name disambiguation problem using a scalable blockingbased framework. In JCDL, 344–353. New York, NY, USA: ACM Press. 37