Lecture09 Bibliometric searching.ppt

advertisement
Bibliometric
[scientometric, webometric, informetric …]
searching
Data used for assessing impact of
scholarly output
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
Tefko Saracevic
1
Central idea
• Use of quantitative methods – statistics – to
study & characterize recorded communication ‘literature’ - of all kinds
• In order to:
– describe research output with various indicators &
distributions
– use in evaluating scholarly scientific performance
• New tools increased & changed significantly role
of searching & searchers
Tefko Saracevic
2
ToC
1.
2.
3.
4.
5.
6.
Goals, definitions
Reasons, applications – why?
Data sources for bibliometric analyses
Methods & measures – how?
A sample of examples
Implications for searching. Caveats
Tefko Saracevic
3
Bibliometrics, scientometrics, webometrics …
1. Goals, definitions
Tefko Saracevic
4
Metric studies
• Applied in many
fields:
Sociometrics,
Econometrics,
Biometrics …
– deal with statistical
properties, relations, &
principles of a variety
of entities in their
domain
Tefko Saracevic
• Metric studies in
information science
follow these by
concentrating on
statistical properties &
the discovery of
associated relations &
principles of
information objects,
structures, &
processes
5
Goals of metric studies
• To characterize
statistically entities
under study
– more ambitiously to
discover regularities &
relations in their
distributions &
dynamics in order to
observe predictive
regularities &
formulate laws
• describe numerically,
predict, apply
Tefko Saracevic
• Same in information
science
– portray statistically
entities under study:
• literature, documents, …
all kinds of inf. objects &
processes as related to
science, institutions, the
Web …
• but also people – authors
• more recently: also
scholarly productivity
6
Definitions
• biblio derived from
“biblion” Greek word
for book
• metrics derived from
“metrikos” Greek
word for
measurement
• Bibliometrics
– “...the application of
mathematical and
statistical methods to
books and other media of
communication .”
Alan Pritchard (1969)
– “… the quantitative
treatment of the properties
of recorded discourse and
behavior pertaining to it.”
Robert Fairthorne (1969)
Tefko Saracevic
7
Definitions … more
but with differing contexts
• Scientometrics
bibliometric & other
metric studies specifically
concentrating on science
• Informetrics
study of the quantitative
aspects of information in
any form - broadest
• Webometrics
quantitative analysis of
web-related phenomena
• Cybermetrics
quantitative aspects of
information resources on
the whole Internet
• E-metrics
measures of electronic
resources, particularly in
libraries
For simplicity, we will use here bibliometrics to cover all
Tefko Saracevic
8
Why? What? What for?
2. Base, reasons, use
Tefko Saracevic
9
Based on what entities have & could be
COUNTED
• In documents (as entities):
– authors
– their institutions,
countries
– sources – e.g. journals
– references – who &
what is cited
– age of references
• & anything else that is
countable
• In Web entities
– identifying
relationships between
Web objects
– link structures
•
•
•
•
•
out-links
in-links
self-links
nodes, central nodes
in a way analogous to
citations
And derivation of structures based on any of these
Tefko Saracevic
10
A lot is based on citations
• Citation analysis:
– analysis of data
derived from
references cited in
footnotes or
bibliographies of
scholarly publications
Used to be just counts
• Now it also leads to
examination & mapping
of intellectual impact of
scholars, projects,
institutions, journals,
disciplines, and nations
Becoming increasingly popular & widely used –
with important implications for searching
Tefko Saracevic
11
Reasons for bibliometric studies
• Understanding of patterns
– discovery of regularities, behavior
– “order out of documentary chaos” [Bradford, 1948]
• Analysis of structures & dynamics
– discovery of connections, relations, networks
– search for regularities - possible predictions
• Discovery of impacts, effects
• relation between entities & amounts of their various uses
– providing support for making of decisions, policies
Tefko Saracevic
12
Major branches of bibliometrics
Relational
• Older - patterns, structures,
relations, mappings
– where bibliometrics started
• Data on what was observed
– e.g. no. of articles/citations by/to
an author; no. of journals with
articles relevant to a topic; no. of
articles/citations in/to a journal …
• Used for description,
mapping of relations &
prediction
Tefko Saracevic
Evaluative
• Newer – impacts, effects
– where bibliometrics
became a big deal in many
arenas
• Data from what was
observed but looking for
– measures of impact,
prominence, ranking, bang
• Discovers who’s up &
how much up
• Used for decisions,
policies
13
Seeking …
Thelwall (2008)
Relational
Evaluative
• Relational bibliometrics
seeks to illuminate
relationships within
research, such as the
cognitive structure of
research fields, the
emergence of new
research fronts, or
national and international
co-authorship patterns
• Evaluative bibliometrics
seeks to assess the
impact of scholarly work,
usually to compare the
relative scientific
contributions of two or
more individuals or
groups
Tefko Saracevic
14
Major approaches
Empirical
• Collection & study of data
– establishment of measures
– statistical & graphic analyses
• We will pursue some of
these here
– concentrate on empirical
Tefko Saracevic
Theoretical
• Building of generalized
models, theories
– often mathematical, abstract
– becoming highly specialized
• We will NOT pursue this
here
– but you should be aware
that there are a lot of
theoretical efforts
15
Users
Relational
Evaluative – new audience
• Mostly scholars
• Mostly research oriented
• But also librarians for
decisions
• Library managers
• Analysts
• University administrators
(deans, provosts)
• Directors of institutional
research
• National governments &
ministries
• Grant & funding
agencies
– e.g. on collections,
purchase, weeding
Tefko Saracevic
16
Used in a variety of functions & areas
• In collection development
identifying the most-useful materials: by analyzing circulation
records; journal / e-journal usage statistics; etc.
• In information retrieval
identifying top-ranked documents, authors: those most highly-cited;
most highly co-cited; most popular; etc.
• In the sociology of knowledge
identifying structural and temporal relationships between
documents, authors, research areas, universities etc.
• In policy making
justifying, managing or prioritizing support for course of action in
a number of areas – e.g. science policy, institutional policy
Tefko Saracevic
17
Use of evaluative bibliometrics
• Academic, research & government institutions for:
–
–
–
–
promotion and tenure, hiring, salary raising
decisions for support of departments, disciplines
grants decision; research policy making
visualization of scholarly networks, identifying key contributions &
contributors
– monitoring scholarly developments
– determining journal citation impact
• Resource allocation:
– identifying authors most worthy of support;
– research areas most worthy of funding
– journals most worthy of support or purchase; etc.
Tefko Saracevic
18
Major bibliometric factors for evaluation of
academic performance
For individuals
For institutions
• Number of publications in
peer reviewed journals
• The impact factor of
those journals
• The h-index
• Total no. of publications
• Total no. of citations
• Various ratios - per
faculty, project …
Tefko Saracevic
19
Impact indicators and studies
• Several governments mandate citation analysis
to
–
–
–
–
asses quality of research and institutions
inform decisions on support
determine support for journal
rank institutions, programs, departments, projects
• Many institutions practice it regulalry
Tefko Saracevic
20
Where does stuff for analysis come from?
3. Data sources for bibliometric
analyses
Tefko Saracevic
21
Main sources for bibliometric analyses
• Bibliographies, indexes
– once popular, not any more
– once done manually - limited
• Documents in
databases
– computerization enabled
wide collection of data &
development of new
methods
• Science statistics
Tefko Saracevic
• And then there are
citations
– as they become automated
use of bibliometrics
exploded
• Web & Internet
– mining connections & other
networked aspects
– but also applying some
older methods to new data
22
Institute for Scientific Information
(ISI, now Thomson Reuters)
• ISI launched in 1962 by Eugene Garfield
– started by publishing Science Citation Index (SCI) &
later Social Science Citation Index (SSCI) and Arts &
Humanities Citation Index (A&HCI) [all still in Dialog]
– these morphed into Web of Science (WoS)
• All only cover an ISI selected set of journals
– thus all citation results & studies are based on that set
of journals, not the universe of journals and books,
but the citations themselves are to whatever is cited
– true of any database – Scopus, Google Scholar etc.
Tefko Saracevic
23
Impact of ISI citation databases
• Major source for bibliometric analysis
• Revolutionized use of citations
– e.g. easy citation counts, tracing, establishment of
connections … became possible
• Provided data for new types of analysis
– e.g. mapping of fields, identifying research fronts
• Laid base for evaluative bibliometrics
• Instigated new types of searching
– above & beyond subject searching
Tefko Saracevic
24
Expansion of citation data sources
• Starting in early
2000s citation data
are being offered by a
number of databases
other than Web of
Science, most notably
– Scopus
– Google Scholar
• and a host of others
Tefko Saracevic
• This expanded
dramatically
availability of data &
types of analyses
– a number of
innovations were
introduced
– use of such data also
expanded
• Challenge to WoS
databases
25
Connections
• Data from relational bibliometrics is used for
sorting, ranking, mapping … in evaluative
bibliometrics
• Raw data obtained from relational analyses is
then “milked” in many ways
– often combined with other data
• e.g. ranked citation counts and financial data, enrollment
data …
Tefko Saracevic
26
4. Methods & measures – how?
Tefko Saracevic
27
Overview
• A few older bibliometric
laws & methods:
• Lotka’s law
– deals with distribution of
authors in a field
• Bradford’s law
– deals with distribution of
articles relevant to a
subject across journals
where they appear
Tefko Saracevic
• From citations:
– citation age (or
obsolescence)
– co-citation
– clustering & co-citation
maps
– bibliographic coupling
– journal impact factor
– self citation (auto-citation)
– & many more.
28
Lotka’s law (1926) – papers & authors
Alfred Lotka (1880-1949, American mathematician, chemist and statistician)
Formal
Number of authors who had
published n papers in a
given field is roughly 1/n 2
the number of authors
who had published one
paper only
English
A large proportion of the total
literature in a field is
authored by a small
proportion of the total
number of authors, falling
down regularly, where the
majority of authors produce
but one paper
e.g. for 100 authors, who on average each
wrote one article each over a specific
period, we have also 25 authors with 2
articles (100/22=25), 11 with 3 articles
(100/32 ≈ 11), 6 with 4 articles (100/42 ≈ 6)
etc.
Tefko Saracevic
29
Bradford’s law (1934) – papers & journals
Samuel C. Bradford (1878-1948, British mathematician and librarian)
Formal
If scientific journals are arranged
in order of decreasing
productivity of articles on a
given subject, they may be
divided into a nucleus of
periodicals more particularly
devoted to the subject and
several groups or zones
containing the same number of
articles as the nucleus, when
the numbers of periodicals in
the nucleus and succeeding
zones will be as a : n : n2 : n3
n is called Bradford multiplier
Tefko Saracevic
English
• Basically states that most
articles in a subject are
produced by few journals
(called nucleus) and the rest
are made up of many
separate sources that
increase in numbers in a
regular, exponential way
• Like Lotka’s law this is a law
that generally follows laws of
diminishing returns
30
Bradford’s law: How he did it?
• He grouped periodicals with articles relevant to a subject
(from a bibliography) into 3 zones in order of decreasing
yield
– from journals with largest no. of articles to those with smallest; at
the end are journals with one article each on the subject
• Each zone had the SAME number of articles but different
no. of journals
• The number of journals in each zone increases
exponentially
– e.g. if there are 5 journals in the first zone that produced 12
relevant articles; there may be 10 journals in the second zone for
next 12 articles & 20 for next 12 – Bradford multiplier (n) found here
is 10/5=2
Tefko Saracevic
31
Cited half-life
Formal
English
• Definition: the number of
years that the number of
citations take to decline to
50% of its current total
value
• How far back in time one
must go to account for
one half of the citations a
journal receives in a
given year
– e.g. if in 2008 the journal XYZ
has a cited half life of 7.0 it
means that articles published
in XYZ between 2002 to 2008
(inclusive) account for 50% of
all citations to articles from
that journal (anyplace) in 2008
Tefko Saracevic
32
Citing half-life
Formal
English
• Definition: the median
age of all cited articles in
the journal during a given
year
• A measure of how current
(or how old) are the
references cited in a
journal
– e.g. if in 2008 for journal XYZ
citing half life was 9.0 it means
that 50% of articles cited
(references) in XYZ were
published between years 2000
and 2008 (inclusive)
Tefko Saracevic
33
Co-citation
a popular similarity measure between two entities
Formal
The frequency with which two
items of earlier literature are
cited together by the later
literature
1. frequency with which two
documents are cited together,
or
2. frequency with which two
authors are cited together
irrespective of what document
Tefko Saracevic
English
• As of 2.: How often are two
authors cited together
• If author A and B are both
cited by C, they may be said to
be related to one another, even
though they don’t directly
reference each other
– if A and B are both cited by many
other articles, they have a
stronger relationship. The more
items they are cited by, the
stronger their relationship is
34
Use of co-citation
• Co-citation is often used as a measure of similarity
– if authors or documents are co-cited they are likely to be similar
in some way
• This means that if collections of documents are arranged
according to their co-citation counts then this should
produce a pattern reflecting cognitive scientific
relationships
• Author co-citation analysis (ACA) is a technique in that it
measures the similarity of pairs of authors through the
frequency with which their work is co-cited
• These are then arranged in maps showing a structure of
an field, domain, area of research …
Tefko Saracevic
35
Map of Author Co-citation Analysis of information science
Zhao & Strotmann (2008)
Tefko Saracevic
36
Bibliographic coupling
Formal
• Links two items that reference
the same items, so that if A
and B both reference C, they
may be said to be related,
even though they don't directly
reference each other. The
more items they both reference
in common, the stronger their
relationship is
• It is backward chaining, while
co-citation is forward chaining
Tefko Saracevic
English
• Occurs when two works
reference a common third work
in their bibliographies e.g.
If in one article Saracevic cites
Kantor, P. &
in another article Belkin cites
Kantor. P.,
• but neither Saracevic or Belkin
cite each other in those articles
• then Saracevic & Belkin are
bibliographically coupled because
they cite Kantor
37
Journal Impact Factor
in Journal Citation Reports (JCR)
Formal
The average number of times
articles from the journal
published in the past two
years have been cited in
the JCR year.
The number of citations published
in the year X to articles in the
journal published in years X − 1
and X − 2, divided by the
number of articles published in
the journal in the years X − 1
and X − 2.
Tefko Saracevic
English
• Measures how often
articles in a specific
journal have been cited
– a Journal Impact Factor for
journal XYZ of 2.5 means that,
on average, the articles
published in XYZ one or two
year ago have been cited two
and a half times
• How to use Journal
Citation Reports
38
h-index - Hirsch (2005)
Formal
• For a scientist, is the
largest number h such
that s/he has at least h
publications cited at least
h times & the other
publications have less
citations each
– it is more than a straight
citation count because it
takes into account BOTH:
number of publications one
had AND number of
citations one received
Tefko Saracevic
English
• Number of papers a
scientist has published
that received the same
number of citations
• I published (as listed in
Scopus):
– 74 articles
– 31 of which were considered for hindex (their criteria)
– of these 15 were cited at least 15
times
– others were cited less
– my h-index is 15
39
h-index differences
• There are differences
in typical h values in
different fields,
determined in part by
– the average number of
references in a paper in the
field
– the average number of
papers produced by each
scientist in the field
– the size (number of
scientists) of the field
Tefko Saracevic
• Thus, comparison of
h-indexes of scientists
in different fields may
not be valid
• Keep it to the same
field!
– e.g. h indices in biological
sciences tend to be higher
than in physics
40
Citation frequency: citations are skewed
Research front
• A few articles are cited a
lot, others less, a lot very
little or not al all
– 80-20 distribution: 20% of
articles may account for
80% of the citations
– from 1900-2005, about one
half of one percent of cited
papers were cited over 200
times. Out of about 38
million source items about
half were not cited at all.
(Garfield, 2005)
Tefko Saracevic
• This led to identifying of a
“research front”
– cluster of highly cited
papers in a domain
– showing also links among
the highly cited papers in
form of maps
• indicating what papers are
frequently cited together i.e.
co-citated
• For searchers: identifying
current & evolving
research fronts in a
domain
41
Aggregate article & citation statistics
• Derived from citation
databases
– combined statistics for
a variety of entities
• “Milked” in great
many, even ingenious
ways
– e.g. a major
component in ranking
of universities (shown
later)
Tefko Saracevic
• The number of
citations to all articles
in a
– journal (base for Journal
Impact Factor)
– or all articles or
citations received by
•
•
•
•
author
research group
institution
country
42
5. A sample of examples
Tefko Saracevic
43
Scopus citation tracking for an author
Tefko Saracevic
44
Scopus journal analyzer three journals selected for comparison
could be further analyzed by tabs or listed in a table
Tefko Saracevic
45
Web of Science citation report for an author
Tefko Saracevic
46
Web of Science Journal Citation Report for three
journals
Tefko Saracevic
47
Histogram for JASIST using Garfield's HistCite
LCS= Local Citation Score; count of how much cited in JASIST
GCS=Global Citation Score; count of how much cited in all journals in WoS
LCR=Local Cited References; how many references from JASIST
NCR=Number of Cited References; how many references in the paper
Tefko Saracevic
48
WoS: Essential Science Indicators
Tefko Saracevic
49
WoS: Incites
Tefko Saracevic
50
SCImago Journal & Country Rank (SJR)
a great resource – from Spain
Tefko Saracevic
51
SJR Journal Analysis for Information Processing & Management
Tefko Saracevic
52
SJR Country Indicators
Tefko Saracevic
53
University rankings
• Times Higher Education ranking: QS World University
Rankings 2008 - Top 400 Universities
http://www.topuniversities.com/worlduniversityrankings/results/2008/
overall_rankings/fullrankings/
• Shanghai ranking: Academic Ranking of World
Universities – 2007 - Shanghai Jiao Tong University
http://www.arwu.org/rank/2007/ranking2007.htm
– Miscellaneous Information on University Rankings
http://www.arwu.org/rank/2008/200810/ARWU2008Resources.htm
• Leiden ranking: Top 100 & 250 universities, Europe &
world, 2008 - Centre for Science and Technology
Studies (CWTS), Leiden University, Netherlands
http://www.cwts.nl/ranking/LeidenRankingWebSite.html
Tefko Saracevic
54
What to watch for? Ethical issue as well
6. Implications for searching.
Caveats
Tefko Saracevic
55
Role of searchers
Relational bibliometric
searching
Evaluative bibliometric
searching
Older:
• Connected with subject
searches
Newer - higher responsibility:
• Called to perform searches
related to bibliometric
indicators of impact
– adding dimension of
authors, sources …
• Performing citation
analyses
– e.g. identifying key papers,
authors, sources
– citation pearl growing
Tefko Saracevic
– often by administrators,
decision makers, policy
wonks, managers
e.g. for tenure & promotion;
resource allocation; grants;
purchase decisions;
justification …
56
Implication for searching because of scatter
• Journals & articles are
scattered, so are authors
– many articles are in core
journals – easy to find
– BUT: a number of relevant
articles will be scattered
throughout other journals
– These need to be found
• not to miss relevant articles in
non-core journals
• High precision searching
concentrates on top producing
journals and authors in a
subject
• High recall searching includes
the long tail of authors and
journals
– but the long tail could be
very long
• need to know when to
stop
Key: Adjusting effectiveness & efficiency of searching
to laws of diminishing returns
Tefko Saracevic
57
Caveats for citations (and there are many)
• Citation rates & practices differ greatly among fields
– citation & publication practices are NOT homogenous within
specialties and fields of science (Leydesdorff, 2008)
•
•
•
•
The context could be negative
A citation may not be relevant to the work
The second, third … author may not be cited at all
Matthew effect (rich get richer) or success-breadssuccess mechanism works in citations
– already well-known individuals receive disproportionately high
rate of citation
• Self citation practices & citation padding
– author citing him/herself; journal articles citing their own journal
Tefko Saracevic
58
Caveat for author & citation disambiguation
• Distinguishing Saracevic, T. from other authors
is not hard – to zero in on that one author
– Belkin, N. is harder; Kantor, P still harder, Ying, Z. almost
impossible
– thus, VERY careful disambiguation is necessary
• sometimes very time consuming; sometimes never sure
• Citations in articles are often messy & careless
– e.g. my name while being cited was misspelled in many creative
ways
– no corrections are made by databases
– thus, variations have to be explored to be included in citation
counts
Tefko Saracevic
59
Caveats for h-index - (Hirsch, 2005)
• “Obviously, a single number can never give
more than a rough approximation to an
individual’s multifaceted profile, and many other
factors should be considered in combination in
evaluating an individual.”
• “Furthermore, the fact that there can always be
exceptions to rules should be kept in mind,
especially in life-changing decisions such as the
granting or denying of tenure.”
Tefko Saracevic
60
Caveat for webometrics & Web sources –
Thelwall (2008)
• Web data is not quality controlled
– caveat emptor (search for what it means)
• Web data is not standardized
– e.g. there does not seem to be a simple way to
separate out web citations in online journal articles
from those in online course reading lists
• It can be impossible to find the publication date
of a web page
– results typically combine new and old web pages
• Web data is incomplete in several senses and in
arbitrary ways
Tefko Saracevic
61
Caveat for Journal Impact Factor (JIF)
• Assumption: journals with higher JIFs tend to
publish higher impact research & hence tend to
be better regarded. But:
– JIFs vary greatly from field to field, because citation
practices differ greatly
– even within discrete subject fields, ranking journals
based upon JIFs is problematic – it is but one
measure, other characteristics are important
– because of popularity journal citations misused:
• recommendations to authors to cite other articles in a given
journal to improve its JIF
Tefko Saracevic
62
Caveat for coverage: differences can be
substantial
• Different databases cover different articles, citations,
handle them differently …
– there is no one answer to: “How many citations did X receive?”
• For the same author (institution …) different databases
will provide different
– no. of articles, citations; h-index; … overlap may not be great
– in citations there are even ghost citations (listed as citing an article
but there is no actual citation in the article)
• Careful comparisons & use of multiple databses are
necessary
• A whole literature on these inconsistencies emerged
– one of the frequent analyzers is Peter Jasco, U of Hawaii
Tefko Saracevic
63
Searching ….
Tefko Saracevic
64
Download