1. M. Winterhager (Institutional classification systems in

Institutional classification schemes
in bibliometrics
Matthias Winterhager
Bielefeld University
euroCRIS Membership Meeting
Bonn
May13, 2013
Institutions:
a major object of bibliometric studies
Relevant data fields in bibliometric databases:
the case of Web of Science
 Institutional data: ready to use?


The advent of identifiers

Processing institutional data: how to count?
2
Web of Science anno 1982
3
Hardcover times ...
4
Institutional address data:
two dimensions
5
Web of Science anno 2013
6
Web of Science: Institutional Data
7
Web of Science Institutional Data
Reprint address
 Authors' institutional affiliation(s) (from 2008

onwards linked to author names)

Funding agencies

Publishers

Do not expect any institutional affiliation data
before 1965
8
Sample document
(from Scientometrics 2008)
9
Address Data in Web of Science
Author Affiliations („work done at“):

Tech Univ Denmark, Tech Knowledge Ctr Denmark, DARC
DTU Anal & Res Promot Ctr, Lyngby, Denmark

Ctr Sci & Technol Studies CEST, Bern, Switzerland

Inst Res Informat & Qual Assurance, Bonn, Germany
Reprint Address:

Larsen, PO (reprint author), Marievej 10A,2, Hellerup,
Denmark
Present Address („current potential“):

missing
10
Which bibliometric indicators do
depend on address data?
Almost all:
Every indicator that is (directly or indirectly) based
on distinct sets of national, organisational or
geographic entities
 Any normalised indicator (like observed vs.
expected citation ratios) that takes into account
regional (e.g. EU, country, state) instead of world
averages
 Any indicator on cross-national, -organizational or institutional cooperation as measured via coauthorships

11
Institutional address data: Issues (1)




Substantial amounts of records come without any
address data (they can or cannot be included in
world total counts for expected ratios)
Different proportions of missing address data per
discipline (humanities) and document types
Few records with address data in “backfiles” (before
1966)
Spelling variants, misspellings, erroneous entries
(samples following)
12
Erroneous address data (1)
13
Erroneous address data (2)
14
Erroneous address data (3)
15
Uncontrolled affiliation (1)
16
Uncontrolled affiliation (2)
17
Institutional address data: Issues (2)


“Reality gap”: names for entities which
 never existed (fiction institutes)
 existed, but have been split, merged, renamed or
closed
Geographical and organisational aspects of an
address can hint to different directions; borderline
cases can be complicated to assign (Max-PlanckInstitutes outside Germany, EMBL, CERN, KIT,
Charité Berlin)
18
Abbreviations in address records
… can have different origins: author, publisher and
database producer
„Corporate and institution names may or may not be
abbreviated. To be comprehensive, search for the
full name of the institution … as well as the
abbreviation.“
„Abbreviations for corporate and institution names
used in the product database are listed below. Other
address elements … may also be abbreviated.“
(from: Web of Science Help)
19
Cleaning of German address data (1)
Project of the German competence centre for bibliometrics




Aim: assignment of (almost) every paper with at least
one German address from Web of Science (or Scopus)
to the relevant German institution(s)
Introduction of procedures to handle unstandardized,
incomplete and incorrect address data
Testing different algorithms for reduction and
standardization of data elements, extraction of cities and
tree-structures
Use of geocoding services
20
Cleaning of German address data (2)



Maintaining a large base on institute-specific string
patterns for thousands of German institutions
Assignment process still heavily based on patternmatching procedures
Growing database of institutions (with “history”),
currently ~2.000 main institutions (data from 19952011), mapping identifiers to external sources
21
Institutional Identifiers
Identifiers of the project are currently being mapped to
“Research Explorer”, the research directory of the
Deutsche Forschungsgemeinschaft (DFG, German
Research Foundation) and of the German Academic
Exchange Service (DAAD) in cooperation with the German
Rectors' Conference (HRK).
22
Institutional Identifier
2
(I )
Initiatives are underway, but it will take several years to bring
such standards into operation on a broad scale (as can be seen
from the case of the author identifier initiative – ORCID)
23
Processing institutional data:
How to count?
Challenges coming from international clinical trials and
from high energy physics:
Hundreds of different addresses from thousands of
authors on a single publication – how to attribute
publication (and citation) counts in the right way?
24
Counting methods





Complete (C): each basic unit gets 1 credit
complete-normalized (CN): all the basic units in a
publication share 1 credit
Straight (S): the first basic unit gets 1 credit
Whole (W): each unique basic unit gets a credit of 1
whole-normalized (WN): all the unique basic units 1
credit
Basic units can be countries, organisations, institutes.
Normalized counts are often called “fractional”.
25
No „gold standard“ for counting
Small units may be favoured by whole counting (W) – but
other effects make it difficult to give a general rule here.
„40 years of publication counting have not resulted in
general agreement on definitions of methods and
terminology nor in any kind of standardization.”
(Larsen, P.O: The state of art in publication counting,
Scientometrics, 77(2), 2008, 235-251)
26
Thank you for listening!
27