sCooL: A System for Academic Institution Name Normalization

advertisement
sCooL:
A System for Academic Institution
Name Normalization
Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair
Classification R & D
CareerBuilder
1

About sCooL
◦ What is entity normalization?
◦ Why is academic entity normalization important?
◦ What are the academic entity normalization challenges?

Inside sCooL
◦ A high-level overview of the core components
◦ Atlas- the mapping manager

Evaluating sCooL
◦ Comparing sCooL with existing implementation
◦ Independent evaluation of sCooL

Concluding remarks
◦ Demo
◦ Questions?
Presentation overview
2

Facts
7,021
post-secondary title IV institutions in 2010-111*
200 Million
12 Million
unique visitors @ CB U.S
unique academic institutions entries in CB resume database
About sCooL:
Academic entity normalization facts
*http://nces.ed.gov/fastfacts/display.asp?id=84
3
No.
Name (surface formss)
Frequency
1
410
2
139
3
131
4
6
5
1
6
1
7
1
8
1
9
1
10
1
Entity:
About sCooL:
Academic entity normalization definition
4
Improved Searching
Labor market dynamics insights
About sCooL:
Why academic entity normalizations
5
No.
Name (surface formss)
Frequency
1
Salford College
410
2
Salford College of Technology
139
3
Salford City College
131
4
Salford Uni
6
5
Salford University -
1
6
The University of Salford.
1
7
Salford University **+
1
8
University of Salford 1982
1
9
=- University OF SALFORD
1
10
University of Salford-
1
Entity:
Salford City College
Merchants Quay, Salford Quays
United Kingdom
Entity:
University of Salford
Salford, Lancashire
United Kingdom
Entity:
Salford College
68 Grenfell Street, Adelaide
Australia
How will you identify the most accurate
normalization from a given surface form?
About sCooL:
Academic entity normalization challenges
6

String similarity algorithms
◦ Edit distance
 Salford university -> Salford Unevarsity (Edit distance 2)
(spelling error)
 St. Loye’s College ->St. Luke’s College (Edit distance 2)
(Two different academic institutions)
How will you distinguish spelling or typing errors from
two different institution mapping scenario?
About sCooL:
Academic entity normalization challenges..
7

Legacy names (Mergers)
◦ University of Central England in Birmingham is an old name of Birmingham City
University
◦ In January 2009, Salford College merged with Eccles College and Pendleton College to
form Salford City College
◦ In October 2004, Victoria University of Manchester with the University of Manchester
Institute of Science and Technology to form The University of Manchester

Popular names and Acronyms
◦ Ole Miss is a popular name for The University of Mississippi
◦ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an
acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular
names for the institution.
How will you create and maintain the
surface form-entity mappings?
About sCooL:
Academic entity normalization challenges
8
No.
Top 10 frequent
universities in UK dataset
Frequency
Institution type
Distribution
1
N/A
128976
College
23.32%
2
City & Guilds
23992
University
16.57%
3
Not Specified
18598
K-12 school
34.22%
4
City and Guilds
17441
Not sure
5
Open University
6886
6
MIDDLESEX UNIVERSITY
5490
7
University of East London
5266
8
University of Greenwich
5108
9
CITY UNIVERSITY
4863
10
Kingston University
4856
25.89%
How can we remove K-12
schools and noise?
About sCooL:
Academic entity normalization challenges
9

How will you identify the most accurate
normalization from a given surface form?

How will you distinguish spelling or typing errors from
two different institution mapping scenario?

How will you create and maintain the surface formentity mappings?

How can we remove K-12 schools and noise?
About sCooL:
Challenges summary
10
Inside sCooL:
A high-level overview of the system
11
sCooL
Lucene
MongoDB
CB mappings Wikimappings
Atlas
Inside sCooL:
Atlas- sCooL’s mapping manager
12
0.6
1
0.98
0.5
0.96
0.4
0.94
0.92
0.3
Coverage
0.9
Accuracy
0.2
0.88
0.86
0.1
0.84
Threshold similarity
0
0.82
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛
𝑁𝑜𝑛𝑁𝑢𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛𝑠 (𝑇𝑟𝑢𝑒 + 𝐹𝑎𝑙𝑠𝑒)
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =
𝑇𝑟𝑢𝑒 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛
𝐴𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 (𝑇𝑟𝑢𝑒 + 𝐹𝑎𝑙𝑠𝑒 + 𝑁𝑢𝑙𝑙)
Inside sCooL:
Refining Lucene results
13
Targeted metrics: Accuracy & Coverage
Precision is more important than Recall
Stratified Sampling in estimate of ratios
Favor high-frequency queries in sampling
Evaluation:
Comparing sCooL with existing implementation
14
Sampling design
𝐏𝐫
2%
𝑝𝑖 − 𝑃𝑖
< ℎ𝑖 = 𝐶
𝑃𝑖
𝑍𝛼2 𝑃𝑖 (1 − 𝑃𝑖 )
𝑛0 =
ℎ𝑖2
𝑛0
𝑛𝑖 =
1 + (𝑛0 − 1)/𝑁𝑖
7%
3
[1, 6]
91%
[7, 39]
𝑝=
𝑖=1
𝑁𝑖
𝑝𝑖
𝑁
𝑖 𝑖
[40, max]
Evaluation:
Comparing sCooLwith existing implementation
15
Groups
Group Size Sample Size
[1, 6]
[7, 39]
[40, max]
Total
145,126
11,938
3,896
160,960
Dataset
UK CareerBuilder data
780
736
653
2,169
Sampling
Rate
1%
6%
17%
1%
Coverage
sCool
40%
sCool
Accuracy
92%
96%
95%
95%
Existing
System
Accuracy
75%
79%
85%
80%
Weighted Coverage
Existing
System
1%
sCool
73%
Existing
System
46%
Evaluation:
Comparing sCooL with existing implementation
16
Test1-4ICU university list
The 4ICU [22] website
145 popular universities and colleges in U.K.
Test2-Guardian university list:
The Guardian [23]
a list of 135 universities in U.K.
Test 1 (145)
Accuracy
Existing
sCool
System
93%
91%
Coverage
Existing
sCool
System
95%
79%
Test 2 (135)
93%
88%
Dataset
90%
72%
Evaluation:
Independent evaluation of sCooL
17

Atlas

http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/
sCooL:Demo
18
sCooL:Questions
19
Rank
Searchable field
Display name
1
polytechnic university of milan
Polytechnic University of Milan
2
university of milan
University of Milan
3
catholic university of milan
Universit`a Cattolica del Sacro Cuore
4
iulm university of milan
IULM University of Milan
5
university of milan bicocca
University of Milan Bicocca
6
milan university
University of Milan
7
politecnico of milan
Polytechnic University of Milan
8
milan polytechnic
Polytechnic University of Milan
sCooL: Appendix
Lucene search results for “University of Milan
”
20
Rank
String similarity algorithms
1
Levenshtein
2
Lucene Levenshtein
3
N-gram
4
Jaccard Similarity
5
Jaro Winkler
6
Hamming
7
Equals
8
Ignore case Equals
sCooL: Appendix
String similarity algorithms
21
Balancing between Accuracy and Coverage
7000
0.6
Correct
6000
1
0.98
0.5
Wrong
0.96
Null
5000
0.4
Total input
queries
4000
0.94
0.92
Coverage
0.3
3000
0.9
Accuracy
0.2
0.88
2000
0.86
0.1
Threshold similarity
1000
0.84
0
0.82
0
0
0
0.2
0.4
0.6 0.8
1
1.2 1.4
Threshold similarity
1.6
1.8
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8
2
2
Evaluation:
Comparing sCool with existing implementation
22






Cucerzan, S from Microsoft Research did great
work on large-scale disambiguation by Wikipedia data
in 2007
Jijkoun, V et. al. from Univ. of Amsterdam
proposed NEN in user generated content in 2008
Liu, X et. al. from Microsoft Research, China
conducted a joint inference on NER and NEN for
tweets in 2012
Magdy, W et. al. from IBM, Egypt invented NEN for
Arabic names in 2007
Jonnalagadda, S et. al. from Lnx Research, CA
developed NEMO, a NER and NEN system for PubMed
author affiliations 2011
Cohen, A from OHSU studied gene/protein NEN by
automatically generated libraries in 2005
About sCooL:
Related work
23
Download