Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams Mehmet Ali Abdulhayoglu Bart Thijs OUTLINE • Introduction • Methodology ▫ N-gram Notion ▫ Levenshtein (Edit) Distance Based on N-grams ▫ Kernel Discriminant Analysis • Results • Conclusion and Discussions 2 Introduction • CVs and publication lists of authors, applicants or institutions are used for the application of evaluative bibliometrics…(Job promotion, Institutional or macro level assessments) • It is generally needed to identify these publications in large databases like Web Of Science, Scopus and this process requires lots of manual work… • Automation of identifying publications in large databases save time and free up resources for manual cleaning 3 Introduction • The main issue is to deal with the existence of different reference standards such as APA(American Psychological Association), MLA(Modern Language Association) etc. • They may have different sequencing for the components. That is, while co-author names are placed in the very beginning of the reference, they may be put at the end of the reference. Or some of them may use abbreviations for the authors or journal names. 4 Introduction • Besides diverse standards, incomplete, erroneous or censored data in publication list or erroneous indexing in the database or changes in publication (title, number or sequence of coauthors, publication year…) • To grab the similarity of texts between the CV references and indexed publications in bibliometric databases, we applied a notion namely N-grams 5 N-Grams Example: The diffusion of H-related literature • Word N-grams - adjacent sequence of n words from a given string (the diffusion of) (diffusion of h-related) (of hrelated literature) • Character N-grams - adjacent sequence of n characters from a given string (_ _t) (_th) (the) (he_ ) (e_d ) (_di ) (dif ) and so on 6 N-grams • Word N-Grams are suitable for full text studies and not convenient for this study… • Character N-Grams are very powerful and handy especially for short texts and no need for stemming! • 3-grams are chosen considering the components’ lengths (e.g. author names, publication year, page) • Kondrak (2005) method for similarity measure Levenshtein (Edit) Distance based on character Ngrams 7 Modified Levenshtein Distance • The minimum number of single-character edits that have to be made in order to change one string to another (Levenshtein, 1966). • Operations: Add, Remove, Change • Kondrak (2005) improved this notion by using N-grams instead of single-character • For the N-gram based edit distance between strings x and y, a matrix is constructed where is the minimum number of edit operations needed to match to . Remove Add Change 8 Levenshtein Distance d i f f u s i o n () _ _d _di dif iff ffu fus usi sio ion 0 1 2 3 4 5 6 7 8 9 d _ _d 1 0 1 2 3 4 5 6 7 8 i _di 2 1 0 1 2 3 4 5 6 7 f dif 3 2 1 0 1 2 3 4 5 6 f iff 4 3 2 1 0 1 2 3 4 5 . ff. 5 4 3 2 1 0,33 1,33 2,33 3,33 4,33 () 9 Features of the Approach • Ordering is crucial… • For example similarity between: the diffusion vs. the diff. : 0,66 the diffusion vs. diff. the : 0,31 • Also, one can find two strings (Xanex and Nexan) having exactly the same N-gram decompositions which would give maximum similarity. 10 Application • As can be expected that publication lists provide detailed bibliographic information about the publications such as its title, the journal title, the names of the author and co-author(s), publication year, volume and first and end page. Components Title Journal Co-authors Volume Begin Page Publication Year SCORE1 2 3 1 4 5 6 SCORE2 2 - 1 3 4 5 SCORE3 1 2 - 3 4 5 SCORE4 1 - 2 - - - SCORE5 1 - - - - - SCORE6 1 2 - - - - SCORE7 - 1 2 - - - SCORE8 - 1 - - - - Variables 11 Application CV WOS WOS WOS WOS WOS WOS Zhang, L., Thijs, B., Glänzel, W., The diffusion of H-related literature. JOI, 2011, 5 (4), 583-593 ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, JOURNAL OF INFORMETRICS, 5, 583, 2011 ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, 5, 583, 2011 The diffusion of H-related literature, JOURNAL OF INFORMETRICS, 5, 583, 2011 ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature The diffusion of H-related literature The diffusion of H-related literature, JOURNAL OF INFORMETRICS 3-Gram Variables Score SCORE1 0,67 SCORE2 0.73 SCORE3 0.34 SCORE4 SCORE5 0.65 0.36 SCORE6 SCORE9 0.41 0,73 Using these scores, we would like to decide whether the publication is indexed in the given database. Discriminant Analysis is such a convenient tool for this purpose. 12 Kernel Discriminant Analysis • Since no assumptios are held for discriminant analysis, this non-parametric method is applied • Exploiting a kernel function (normal), it handles a nonlinear mapping by a linear mapping in a feature space • As a result, it is based on estimating a non-parametric density function for the observations • There exist a smoothing parameter ‘r’ which determine the degree of irregularity in the estimate of density function. As suggested in Khattree and Naik (2000), we tried several ‘r’ and reach the optimal solution 13 Kernel Discriminant Analysis • SCORE1, SCORE2, SCORE3, SCORE4, SCORE5 and SCORE6 • SCORE9, SCORE5 and SCORE8 • While the former set is chosen to examine the variables all including “Title” component and its variations, the latter one is chosen to analyse as a relatively more independent set with “Maximum”, “Title” and “Journal Name”. 14 Data • Training Set 6525 real pairs of applicants’ CVs (correct matched pairs by manually) (Group 1) 3 x 6525 randomly unmatched pairs (Group 0) for the same publications in Group 1 • Test Set 2570 new pairs to be classified completely different from the ones in training set The publications are queried through a sample of WoS data having a size of 7387 15 Results set1 set2 r accuracy false negatives false positives r accuracy false negatives false positives 2 92.96 4.48 6.20 2 91.13 2.63 10.15 3 90.3 13.9 1.20 3 94.9 5.68 2.23 • false positive: wrongly assigned to Group 1 • false negative: wrongly assigned to Group 0 16 4 78.25 33.31 0.20 4 90.97 13.33 0.62 5 60.62 60.53 0 5 82.33 27.03 0.16 Results Estimated 0 Estimated 1 Total Observed 0 862 36 898 Observed 1 95 1555 1672 Total 957 1613 2570 • • • • false positive: 36 false negative: 95 94,3% of publications in Group 1 are classified correctly 97,8% of publications estimated in Group 1 are classified correctly 17 Results • Even though vast majority of the publications are classified correctly in Group 1, • Similarity scores for Group 1 between 0,30 – 0,45 false negatives • Similarity scores for Group 0 higher than 0,45 false positives 18 Conclusion • By means of proposed model, 95% correct classification is achieved… • The matches which have a similarity score 0,6267 or higher indicate a precise correct matching… • For bibliometric evaluation processes, it will be useful to decrease the manual work However… 19 Parts not to be ruled out! • Only for papers in English • “false positives” remains as an issue to be solved • Tolerance to “false positives” depends on the Application (Micro vs. Macro Level Studies) 20 Thank you! 21