Matching Bibliographic Data from Publication Lists with

advertisement
Matching Bibliographic Data from
Publication Lists with Large
Databases using N-Grams
Mehmet Ali Abdulhayoglu
Bart Thijs
OUTLINE
• Introduction
• Methodology
▫ N-gram Notion
▫ Levenshtein (Edit) Distance Based on N-grams
▫ Kernel Discriminant Analysis
• Results
• Conclusion and Discussions
2
Introduction
• CVs and publication lists of authors, applicants or
institutions are used for the application of evaluative
bibliometrics…(Job promotion, Institutional or
macro level assessments)
• It is generally needed to identify these publications
in large databases like Web Of Science, Scopus and
this process requires lots of manual work…
• Automation of identifying publications in large
databases save time and free up resources for
manual cleaning
3
Introduction
• The main issue is to deal with the existence of
different reference standards such as
APA(American Psychological Association),
MLA(Modern Language Association) etc.
• They may have different sequencing for the
components. That is, while co-author names are
placed in the very beginning of the reference,
they may be put at the end of the reference. Or
some of them may use abbreviations for the
authors or journal names.
4
Introduction
• Besides diverse standards, incomplete,
erroneous or censored data in publication list or
erroneous indexing in the database or changes in
publication (title, number or sequence of coauthors, publication year…)
• To grab the similarity of texts between the CV
references and indexed publications in
bibliometric databases, we applied a notion
namely N-grams
5
N-Grams
Example: The diffusion of H-related literature
• Word N-grams - adjacent sequence of n words
from a given string
(the diffusion of) (diffusion of h-related) (of hrelated literature)
• Character N-grams - adjacent sequence of n
characters from a given string
(_ _t) (_th) (the) (he_ ) (e_d ) (_di ) (dif ) and so on
6
N-grams
• Word N-Grams are suitable for full text studies and
not convenient for this study…
• Character N-Grams are very powerful and handy
especially for short texts and no need for stemming!
• 3-grams are chosen considering the components’
lengths (e.g. author names, publication year, page)
• Kondrak (2005) method for similarity measure
Levenshtein (Edit) Distance based on character Ngrams
7
Modified Levenshtein Distance
• The minimum number of single-character edits that have to be made in
order to change one string to another (Levenshtein, 1966).
• Operations: Add, Remove, Change
• Kondrak (2005) improved this notion by using N-grams instead of
single-character
• For the N-gram based edit distance between strings x and y, a matrix
is constructed where
is the minimum number of edit
operations needed to match
to
.
Remove
Add
Change
8
Levenshtein Distance
d
i
f
f
u
s
i
o
n
()
_ _d _di
dif
iff
ffu
fus
usi
sio
ion
0
1
2
3
4
5
6
7
8
9
d
_ _d 1
0
1
2
3
4
5
6
7
8
i
_di
2
1
0
1
2
3
4
5
6
7
f
dif
3
2
1
0
1
2
3
4
5
6
f
iff
4
3
2
1
0
1
2
3
4
5
.
ff.
5
4
3
2
1
0,33
1,33
2,33
3,33
4,33
()
9
Features of the Approach
• Ordering is crucial…
• For example
similarity between:
the diffusion vs. the diff. : 0,66
the diffusion vs. diff. the : 0,31
• Also, one can find two strings (Xanex and Nexan)
having exactly the same N-gram decompositions
which would give maximum similarity.
10
Application
• As can be expected that publication lists provide
detailed bibliographic information about the
publications such as its title, the journal title, the
names of the author and co-author(s),
publication year, volume and first and end page.
Components
Title
Journal
Co-authors
Volume
Begin Page
Publication
Year
SCORE1
2
3
1
4
5
6
SCORE2
2
-
1
3
4
5
SCORE3
1
2
-
3
4
5
SCORE4
1
-
2
-
-
-
SCORE5
1
-
-
-
-
-
SCORE6
1
2
-
-
-
-
SCORE7
-
1
2
-
-
-
SCORE8
-
1
-
-
-
-
Variables
11
Application
CV
WOS
WOS
WOS
WOS
WOS
WOS
Zhang, L., Thijs, B., Glänzel, W., The diffusion of H-related
literature. JOI, 2011, 5 (4), 583-593
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related
literature, JOURNAL OF INFORMETRICS, 5, 583, 2011
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related
literature, 5, 583, 2011
The diffusion of H-related literature, JOURNAL OF
INFORMETRICS, 5, 583, 2011
ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related
literature
The diffusion of H-related literature
The diffusion of H-related literature, JOURNAL OF
INFORMETRICS
3-Gram
Variables Score
SCORE1
0,67
SCORE2
0.73
SCORE3
0.34
SCORE4
SCORE5
0.65
0.36
SCORE6
SCORE9
0.41
0,73
Using these scores, we would like to decide whether the publication is
indexed in the given database. Discriminant Analysis is such a
convenient tool for this purpose.
12
Kernel Discriminant Analysis
• Since no assumptios are held for discriminant analysis,
this non-parametric method is applied
• Exploiting a kernel function (normal), it handles a nonlinear mapping by a linear mapping in a feature space
• As a result, it is based on estimating a non-parametric
density function for the observations
• There exist a smoothing parameter ‘r’ which determine
the degree of irregularity in the estimate of density
function. As suggested in Khattree and Naik (2000), we
tried several ‘r’ and reach the optimal solution
13
Kernel Discriminant Analysis
• SCORE1, SCORE2, SCORE3, SCORE4, SCORE5
and SCORE6
• SCORE9, SCORE5 and SCORE8
• While the former set is chosen to examine the
variables all including “Title” component and its
variations, the latter one is chosen to analyse as
a relatively more independent set with
“Maximum”, “Title” and “Journal Name”.
14
Data
• Training Set
6525 real pairs of applicants’ CVs (correct matched
pairs by manually) (Group 1)
3 x 6525 randomly unmatched pairs (Group 0) for the
same publications in Group 1
• Test Set
2570 new pairs to be classified completely different
from the ones in training set
The publications are queried through a sample of WoS
data having a size of 7387
15
Results
set1
set2
r
accuracy
false negatives
false positives
r
accuracy
false negatives
false positives
2
92.96
4.48
6.20
2
91.13
2.63
10.15
3
90.3
13.9
1.20
3
94.9
5.68
2.23
• false positive: wrongly assigned to Group 1
• false negative: wrongly assigned to Group 0
16
4
78.25
33.31
0.20
4
90.97
13.33
0.62
5
60.62
60.53
0
5
82.33
27.03
0.16
Results
Estimated 0
Estimated 1
Total
Observed 0
862
36
898
Observed 1
95
1555
1672
Total
957
1613
2570
•
•
•
•
false positive: 36
false negative: 95
94,3% of publications in Group 1 are classified correctly
97,8% of publications estimated in Group 1 are classified correctly
17
Results
• Even though vast majority of the publications are classified correctly
in Group 1,
• Similarity scores for Group 1 between 0,30 – 0,45 false negatives
• Similarity scores for Group 0 higher than 0,45 false positives
18
Conclusion
• By means of proposed model, 95% correct
classification is achieved…
• The matches which have a similarity score
0,6267 or higher indicate a precise correct
matching…
• For bibliometric evaluation processes, it will be
useful to decrease the manual work
However…
19
Parts not to be ruled out!
• Only for papers in English
• “false positives” remains as an issue to be solved
• Tolerance to “false positives” depends on the
Application (Micro vs. Macro Level Studies)
20
Thank you!
21
Download