Institutional classification schemes in bibliometrics Matthias Winterhager Bielefeld University euroCRIS Membership Meeting Bonn May13, 2013 Institutions: a major object of bibliometric studies Relevant data fields in bibliometric databases: the case of Web of Science Institutional data: ready to use? The advent of identifiers Processing institutional data: how to count? 2 Web of Science anno 1982 3 Hardcover times ... 4 Institutional address data: two dimensions 5 Web of Science anno 2013 6 Web of Science: Institutional Data 7 Web of Science Institutional Data Reprint address Authors' institutional affiliation(s) (from 2008 onwards linked to author names) Funding agencies Publishers Do not expect any institutional affiliation data before 1965 8 Sample document (from Scientometrics 2008) 9 Address Data in Web of Science Author Affiliations („work done at“): Tech Univ Denmark, Tech Knowledge Ctr Denmark, DARC DTU Anal & Res Promot Ctr, Lyngby, Denmark Ctr Sci & Technol Studies CEST, Bern, Switzerland Inst Res Informat & Qual Assurance, Bonn, Germany Reprint Address: Larsen, PO (reprint author), Marievej 10A,2, Hellerup, Denmark Present Address („current potential“): missing 10 Which bibliometric indicators do depend on address data? Almost all: Every indicator that is (directly or indirectly) based on distinct sets of national, organisational or geographic entities Any normalised indicator (like observed vs. expected citation ratios) that takes into account regional (e.g. EU, country, state) instead of world averages Any indicator on cross-national, -organizational or institutional cooperation as measured via coauthorships 11 Institutional address data: Issues (1) Substantial amounts of records come without any address data (they can or cannot be included in world total counts for expected ratios) Different proportions of missing address data per discipline (humanities) and document types Few records with address data in “backfiles” (before 1966) Spelling variants, misspellings, erroneous entries (samples following) 12 Erroneous address data (1) 13 Erroneous address data (2) 14 Erroneous address data (3) 15 Uncontrolled affiliation (1) 16 Uncontrolled affiliation (2) 17 Institutional address data: Issues (2) “Reality gap”: names for entities which never existed (fiction institutes) existed, but have been split, merged, renamed or closed Geographical and organisational aspects of an address can hint to different directions; borderline cases can be complicated to assign (Max-PlanckInstitutes outside Germany, EMBL, CERN, KIT, Charité Berlin) 18 Abbreviations in address records … can have different origins: author, publisher and database producer „Corporate and institution names may or may not be abbreviated. To be comprehensive, search for the full name of the institution … as well as the abbreviation.“ „Abbreviations for corporate and institution names used in the product database are listed below. Other address elements … may also be abbreviated.“ (from: Web of Science Help) 19 Cleaning of German address data (1) Project of the German competence centre for bibliometrics Aim: assignment of (almost) every paper with at least one German address from Web of Science (or Scopus) to the relevant German institution(s) Introduction of procedures to handle unstandardized, incomplete and incorrect address data Testing different algorithms for reduction and standardization of data elements, extraction of cities and tree-structures Use of geocoding services 20 Cleaning of German address data (2) Maintaining a large base on institute-specific string patterns for thousands of German institutions Assignment process still heavily based on patternmatching procedures Growing database of institutions (with “history”), currently ~2.000 main institutions (data from 19952011), mapping identifiers to external sources 21 Institutional Identifiers Identifiers of the project are currently being mapped to “Research Explorer”, the research directory of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) and of the German Academic Exchange Service (DAAD) in cooperation with the German Rectors' Conference (HRK). 22 Institutional Identifier 2 (I ) Initiatives are underway, but it will take several years to bring such standards into operation on a broad scale (as can be seen from the case of the author identifier initiative – ORCID) 23 Processing institutional data: How to count? Challenges coming from international clinical trials and from high energy physics: Hundreds of different addresses from thousands of authors on a single publication – how to attribute publication (and citation) counts in the right way? 24 Counting methods Complete (C): each basic unit gets 1 credit complete-normalized (CN): all the basic units in a publication share 1 credit Straight (S): the first basic unit gets 1 credit Whole (W): each unique basic unit gets a credit of 1 whole-normalized (WN): all the unique basic units 1 credit Basic units can be countries, organisations, institutes. Normalized counts are often called “fractional”. 25 No „gold standard“ for counting Small units may be favoured by whole counting (W) – but other effects make it difficult to give a general rule here. „40 years of publication counting have not resulted in general agreement on definitions of methods and terminology nor in any kind of standardization.” (Larsen, P.O: The state of art in publication counting, Scientometrics, 77(2), 2008, 235-251) 26 Thank you for listening! 27