sCooL: A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder 1 About sCooL ◦ What is entity normalization? ◦ Why is academic entity normalization important? ◦ What are the academic entity normalization challenges? Inside sCooL ◦ A high-level overview of the core components ◦ Atlas- the mapping manager Evaluating sCooL ◦ Comparing sCooL with existing implementation ◦ Independent evaluation of sCooL Concluding remarks ◦ Demo ◦ Questions? Presentation overview 2 Facts 7,021 post-secondary title IV institutions in 2010-111* 200 Million 12 Million unique visitors @ CB U.S unique academic institutions entries in CB resume database About sCooL: Academic entity normalization facts *http://nces.ed.gov/fastfacts/display.asp?id=84 3 No. Name (surface formss) Frequency 1 410 2 139 3 131 4 6 5 1 6 1 7 1 8 1 9 1 10 1 Entity: About sCooL: Academic entity normalization definition 4 Improved Searching Labor market dynamics insights About sCooL: Why academic entity normalizations 5 No. Name (surface formss) Frequency 1 Salford College 410 2 Salford College of Technology 139 3 Salford City College 131 4 Salford Uni 6 5 Salford University - 1 6 The University of Salford. 1 7 Salford University **+ 1 8 University of Salford 1982 1 9 =- University OF SALFORD 1 10 University of Salford- 1 Entity: Salford City College Merchants Quay, Salford Quays United Kingdom Entity: University of Salford Salford, Lancashire United Kingdom Entity: Salford College 68 Grenfell Street, Adelaide Australia How will you identify the most accurate normalization from a given surface form? About sCooL: Academic entity normalization challenges 6 String similarity algorithms ◦ Edit distance Salford university -> Salford Unevarsity (Edit distance 2) (spelling error) St. Loye’s College ->St. Luke’s College (Edit distance 2) (Two different academic institutions) How will you distinguish spelling or typing errors from two different institution mapping scenario? About sCooL: Academic entity normalization challenges.. 7 Legacy names (Mergers) ◦ University of Central England in Birmingham is an old name of Birmingham City University ◦ In January 2009, Salford College merged with Eccles College and Pendleton College to form Salford City College ◦ In October 2004, Victoria University of Manchester with the University of Manchester Institute of Science and Technology to form The University of Manchester Popular names and Acronyms ◦ Ole Miss is a popular name for The University of Mississippi ◦ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular names for the institution. How will you create and maintain the surface form-entity mappings? About sCooL: Academic entity normalization challenges 8 No. Top 10 frequent universities in UK dataset Frequency Institution type Distribution 1 N/A 128976 College 23.32% 2 City & Guilds 23992 University 16.57% 3 Not Specified 18598 K-12 school 34.22% 4 City and Guilds 17441 Not sure 5 Open University 6886 6 MIDDLESEX UNIVERSITY 5490 7 University of East London 5266 8 University of Greenwich 5108 9 CITY UNIVERSITY 4863 10 Kingston University 4856 25.89% How can we remove K-12 schools and noise? About sCooL: Academic entity normalization challenges 9 How will you identify the most accurate normalization from a given surface form? How will you distinguish spelling or typing errors from two different institution mapping scenario? How will you create and maintain the surface formentity mappings? How can we remove K-12 schools and noise? About sCooL: Challenges summary 10 Inside sCooL: A high-level overview of the system 11 sCooL Lucene MongoDB CB mappings Wikimappings Atlas Inside sCooL: Atlas- sCooL’s mapping manager 12 0.6 1 0.98 0.5 0.96 0.4 0.94 0.92 0.3 Coverage 0.9 Accuracy 0.2 0.88 0.86 0.1 0.84 Threshold similarity 0 0.82 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑟𝑢𝑒 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑁𝑜𝑛𝑁𝑢𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛𝑠 (𝑇𝑟𝑢𝑒 + 𝐹𝑎𝑙𝑠𝑒) 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 𝑇𝑟𝑢𝑒 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝐴𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 (𝑇𝑟𝑢𝑒 + 𝐹𝑎𝑙𝑠𝑒 + 𝑁𝑢𝑙𝑙) Inside sCooL: Refining Lucene results 13 Targeted metrics: Accuracy & Coverage Precision is more important than Recall Stratified Sampling in estimate of ratios Favor high-frequency queries in sampling Evaluation: Comparing sCooL with existing implementation 14 Sampling design 𝐏𝐫 2% 𝑝𝑖 − 𝑃𝑖 < ℎ𝑖 = 𝐶 𝑃𝑖 𝑍𝛼2 𝑃𝑖 (1 − 𝑃𝑖 ) 𝑛0 = ℎ𝑖2 𝑛0 𝑛𝑖 = 1 + (𝑛0 − 1)/𝑁𝑖 7% 3 [1, 6] 91% [7, 39] 𝑝= 𝑖=1 𝑁𝑖 𝑝𝑖 𝑁 𝑖 𝑖 [40, max] Evaluation: Comparing sCooLwith existing implementation 15 Groups Group Size Sample Size [1, 6] [7, 39] [40, max] Total 145,126 11,938 3,896 160,960 Dataset UK CareerBuilder data 780 736 653 2,169 Sampling Rate 1% 6% 17% 1% Coverage sCool 40% sCool Accuracy 92% 96% 95% 95% Existing System Accuracy 75% 79% 85% 80% Weighted Coverage Existing System 1% sCool 73% Existing System 46% Evaluation: Comparing sCooL with existing implementation 16 Test1-4ICU university list The 4ICU [22] website 145 popular universities and colleges in U.K. Test2-Guardian university list: The Guardian [23] a list of 135 universities in U.K. Test 1 (145) Accuracy Existing sCool System 93% 91% Coverage Existing sCool System 95% 79% Test 2 (135) 93% 88% Dataset 90% 72% Evaluation: Independent evaluation of sCooL 17 Atlas http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/ sCooL:Demo 18 sCooL:Questions 19 Rank Searchable field Display name 1 polytechnic university of milan Polytechnic University of Milan 2 university of milan University of Milan 3 catholic university of milan Universit`a Cattolica del Sacro Cuore 4 iulm university of milan IULM University of Milan 5 university of milan bicocca University of Milan Bicocca 6 milan university University of Milan 7 politecnico of milan Polytechnic University of Milan 8 milan polytechnic Polytechnic University of Milan sCooL: Appendix Lucene search results for “University of Milan ” 20 Rank String similarity algorithms 1 Levenshtein 2 Lucene Levenshtein 3 N-gram 4 Jaccard Similarity 5 Jaro Winkler 6 Hamming 7 Equals 8 Ignore case Equals sCooL: Appendix String similarity algorithms 21 Balancing between Accuracy and Coverage 7000 0.6 Correct 6000 1 0.98 0.5 Wrong 0.96 Null 5000 0.4 Total input queries 4000 0.94 0.92 Coverage 0.3 3000 0.9 Accuracy 0.2 0.88 2000 0.86 0.1 Threshold similarity 1000 0.84 0 0.82 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Threshold similarity 1.6 1.8 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 Evaluation: Comparing sCool with existing implementation 22 Cucerzan, S from Microsoft Research did great work on large-scale disambiguation by Wikipedia data in 2007 Jijkoun, V et. al. from Univ. of Amsterdam proposed NEN in user generated content in 2008 Liu, X et. al. from Microsoft Research, China conducted a joint inference on NER and NEN for tweets in 2012 Magdy, W et. al. from IBM, Egypt invented NEN for Arabic names in 2007 Jonnalagadda, S et. al. from Lnx Research, CA developed NEMO, a NER and NEN system for PubMed author affiliations 2011 Cohen, A from OHSU studied gene/protein NEN by automatically generated libraries in 2005 About sCooL: Related work 23