BIBLIOMETRICS Tefko Saracevic Rutgers University

advertisement
BIBLIOMETRICS
Tefko Saracevic
Rutgers University
http://www.scils.rutgers.edu/~tefko
© Tefko Saracevic
11
What is?
“… all studies which seek to quantify
processes of written communication.”
Pritchard
“… the quantitative treatment of the
propertiesd of recorded discourse and
behavior pertaining to it.”
Fairthorne
Recorded communication - ‘literature’->
quantitative methods
© Tefko Saracevic
2
Alan Pritchard 1969
Coined the term "bibliometrics"
"the application of mathematics and statistical
methods to books and other media of
communication“
Journal of Documentation (1969) 25(4):348-349
© Tefko Saracevic
3
and other related metrics …
 Also used to study broader than books,
articles …
Scientometrics
 covering science in general, not just publications
Infometrics
 all information objects
Webmetrics or cybermetrics
 web connections, manifestations
 using bibliometric techniques to study the
relationship or properties of different sites on the
web
© Tefko Saracevic
4
Concepts
Basic (primitive) concepts:
1. Subject
2. Recorded communication -> document,
information object
3. Subject literature
Bibliometrics related to:
 science of science
sociology of science - numerical methods
© Tefko Saracevic
5
Literature studies
Qualitative
often in humanities, librarianship
Quantitative
bibliometrics
Mixed
© Tefko Saracevic
6
Reasons for quantitative
studies of literature
Analysis of structure and dynamics
search for regularities - predictions possible
Understanding of patterns
“order out of documentary chaos”
verification of models, assumptions
Rationale for policies & design
© Tefko Saracevic
7
Why quantitative studies?
Qualitative methods often depend on
assertions. ‘authoritative’ statements,
anecdotal evidence
Science searches for regularities
Success of statistical methods in social
sciences
Need for justification & basis for decisions
Something can be counted - irresistible
© Tefko Saracevic
8
Application in ...
History of science
Sociology of science
Science policy; resource allocation
Library selection, weeding, policies
Information organization
Information management
utilization
© Tefko Saracevic
9
Historical note
Bibliometrics long precedes information science
But found intellectual home in information
science
study of a basic phenomenon - literature
It is not ‘hot’ lately, but still produces very
interesting results
Branched out into web studies (web is a
“literature” as well)
© Tefko Saracevic
10
What studied?
Governed by data available in documents
or information resources in general - that
what can be counted
author(s)
origin
 organization, country, language
source
 journal, publisher, patent …
© Tefko Saracevic
11
what … more
contents
 text, parts of text, subject, classes
representation
citations
 to a document, in a document, co-citation
utilization
 circulation, various uses
links
any other quantifiable attribute
© Tefko Saracevic
12
Tools
Science Citation Index
Compilation of variables from journals in a
subject
Use data
Publication counts from indexes, or other
data bases
Web structures, links
© Tefko Saracevic
13
Variable: authors
number in a subject, field, institution, country
growth
correlation with indicators like GNP, energy etc.
productivity e.g. Lotka’s law
collaboration - co-authorship, associated networks
dynamics - productive life, transcience, epidemics
papers/author in a subject
mapping
© Tefko Saracevic
14
Variable: origin
Rates of production, size, growth by
country, institution, language, subject
Comparison between these
Correlation with economic & other
indicators
© Tefko Saracevic
15
Variable: sources
Concentration most often on journals
Growth, dynamics, numbers
information explosion - exponential laws
time movements, life cycles
Scatter - quantity/yield distribution
Bradford’s law
 Various distributions
 by subject, language, country
© Tefko Saracevic
16
Variable: contents
Analysis of texts
distribution of words – Zipf’s law
words, phrases in various parts
subject analysis, classification
co-word analysis
© Tefko Saracevic
17
Variable: representation
frequency of use of index terms, classes
distribution laws - key terms where?
thesaurus structure
© Tefko Saracevic
18
Variable: citations
Studied a lot; many pragmatic results
base for citation indexes, web of science,
impact factors, co-citation studies etc
Derived:
number of references in articles
number of citations to articles
 research front; citation classics
bibliographic coup[ling
© Tefko Saracevic
19
citations … more
co-citations
 author connections, subject structure, networks,
maps
centrality
 of authors, papers
validation with qualitative methods
impact
© Tefko Saracevic
20
Variable: utilization
frequency
distribution of requests for sources, titles
 e.g. 20/80 law
relevance judgement distributions
circulation patterns
use patterns
© Tefko Saracevic
21
Variable: links
Development of link-based metrics
in-links, out-links
Web structure
Web page depth; update
PageRank vs quality
© Tefko Saracevic
22
Examples from classic
studies
Comparative publications over centuries
Number of journals founded over time
Number of abstracts published over time
National share of abstracts in chemistry
National scientific size vs. economy size
Bibliographic coupling and co-citation
Web structures, links
© Tefko Saracevic
23
Examples of laws & methods
Lotka’s law
Bradford’s law
Zipf’s law
Impact factor
Citation structures
Co-citation structures
© Tefko Saracevic
24
Alfred J. Lotka 1926
 Statistics—the frequency distribution of
scientific productivity
Purpose: to "determine, if possible, the part which
men of different calibre contribute to the
progress of science“
Looked at Chemical Abstracts Index, then
Geschichtstafeln der Physik
 J. Washington Acad. Sci. 16:317-325
© Tefko Saracevic
25
Lotka’s law: xn • y = C
The total number of authors y in a given subject,
each producing x publications, is inversely
proportional to some exponential function n of x.
Where:
 x
 y
=
=
 n
=
 C
=
number of publications
no. of authors credited with x
publications
constant (equals 2 for scientific
subjects)
constant
inverse square law of scientific productivity
© Tefko Saracevic
26
No. of authors
Lotka's Law - scientific publications
1 publ.
© Tefko Saracevic
2 publ.
3 publ.
xn • y = C
4 publ.
27
Samuel Clement Bradford 1934, 1948
 Distribution of quantity vs yield of sources of
information on specific subjects
 he studied journals as sources, but applicable to other
 what journals produce how many articles in a subject and how
are they distributed? or
 How are articles in a subject scattered across journals?
 Purpose: to develop a method for identification of the
most productive journals in a subject & deal with what
he called “documentary chaos”
First published in: Engineering (1934) 137:85-86, then in his book
Documentation, (1948)
© Tefko Saracevic
28
Bradford’s law
"If scientific journals are arranged in order of
decreasing productivity of articles on a
given subject, they may be divided into a
nucleus of periodicals more particularly
devoted to the subject and several groups or
zones containing the same number of
articles as the nucleus, when the numbers of
periodicals in the nucleus and succeeding
zones will be as a : n : n2 : n3 …"
© Tefko Saracevic
29
Bradford's Law of Scattering –
an idealized example
No. of
No. of articles
per source
source journals
60
1
3 2
35
30
1
25
2
9 2
9
8
4
6
10
5
7
27 5
4
3
5
© Tefko Saracevic
Total no. of
articles
60
130
70
30
50
18 130
32
60
35
130
20
15
30
Bradford's Law of Scattering – zones
nucleus
3 sources
130 articles
9 sources
130 articles
27 sources
130 articles
© Tefko Saracevic
Garfield hypothesis
31
George Kingsley Zipf 1935, 1949
 The psycho-biology of language: an introduction
to dynamic philology (1935)
 Human behavior and the principle of least effort:
An introduction to human ecology (1949)
Looked, among others, at frequency distributions
of words in given texts
counted distribution in James Joyces’ Ulysses
Provided an explanation as to why the found
distributions happen:
Principle of least effort
© Tefko Saracevic
32
Zipf’s law: r • f = c
Where:
r =
f =
rank (in terms of frequency)
frequency (no. of times the given word
is used in the text)
c = constant for the given text
For a given text the rank of a word multiplied by
the frequency is a constant
Works well for high frequency words, not so well
for low – thus a number of modifications
© Tefko Saracevic
33
Charles F. Gosnell 1944
Obsolescence
 He studied obsolescence of books in
academic libraries via their use
• College Res. Libr. (1994) 5:115-125
But this was extended to study of articles
via citations, and other sources
 Age of citations in articles in a subject:
half life – half of the citations are x year old etc
 different subjects have very different half-lives
© Tefko Saracevic
34
Curve of obsolescence
Age at time of use
© Tefko Saracevic
35
Eugene Garfield 1955
 Focused on scientific & scholarly communication
based on citations
• Science (1995) 122:108-111
Founded Institute for Scientific Information (ISI)
major proeduct now ISI Web of Knowledge
Impact factor for journals, based on how much is a
journal cited
Mapping of a literature in a subject
Citation indexes/web of knowledge
 MAJOR resources in bibliometric studies
© Tefko Saracevic
36
Citation matrix
citing
article
cited
article
cited
article
cited
article
© Tefko Saracevic
article
citing
article
citing
article
citing
article
citing
article
citing
article
citing
article
37
Science Citation Index
Association-of-ideas index
citing
article
cited
article
cited
article
cited
article
© Tefko Saracevic
article
citing
article
citing
article
citing
article
citing
article
citing
article
citing
article
38
Co-citation analysis
Articles that cite the same article are likely to both
be of interest to the reader of the cited article
citing
article
article
citing
article
© Tefko Saracevic
These two
articles are
likely to be
related
39
Impact factor (IF)
number of citations received in current year by
papers published in the journal in the previous two
years
divided by
number of papers published in the journal in the
previous two years
IF has become over time a crucial indicator of
journal quality and
given ISI a monopoly position in the evaluation of journal
quality
Reported in Journal Citation Reports (1976-)
© Tefko Saracevic
40
Garfield’s HistCite
“Bibiliographic Analysis and Visualization
Software”
Provides citation statistics & graphs for people,
journals, institutions …
various citations scores, no. of cited references in
articles … various graphs with connections
Example: articles and authors for JASIST (and
predecessor names) for 1956-2004
includes citations to authors
© Tefko Saracevic
41
Conclusion
Bibliometrics, & related scientometrics,
infometrics, webmetrics provide insight
into a number of properties of information
objects
some general, predictive “laws” formulated
structures have been exposed, graphed
myriad data collected & analyzed
A good area for research!
© Tefko Saracevic
42
Sources used in making this
presentation– among others
Ruth Palmquist Bibliometrics
Donna Bair-Mundy Boolean, bibliometrics, and
beyond
Short set of bibliometric exercises by J. Downie
http://people.lis.uiuc.edu/~jdownie/biblio/
© Tefko Saracevic
43
Download