file

advertisement
Supporting Online Material
Data
We parsed 8,592,483 PubMed records, extracting from each the PubMed ID,
journal name, year of publication, publication type, chemical names, and MeSH
terms. We chose a time-span between 1985 and 2004 characterized by a steady
growth of the number of MeSH terms, chemical names, and articles (Sup.Fig.1).
Our dataset for this period covers 12,039 journals that mention a total of 22,371
unique MeSH terms and 153,756 unique chemical names. There are 49 different
publication types.
Supplementary Figure 1. Number of articles, MeSH terms and
chemical names mentioned in PubMed since 1950.
Analysis
A list of terms (MeSH terms or chemical names) that accompanies most timestamped records in the PubMed database allows us to monitor how popularity of
terms changes in time. Given a time point t, we define Nti as the number of terms
that occur in PubMed before t exactly i times. To characterize the probability of
encountering terms with the same level of popularity (number of instances in
PubMed before the same point of time), we introduce a popularity variable q that
takes integer values 0, 1, 2, …; notation P(q | t, parameter values) represents the
expected proportion of terms with popularity q at time point t given our model
and parameter values.
When we model stochastic generation of scientific texts, we assume that each
time-stamped text is allowed to contain terms with zero popularity (novel terms),
and that the expected frequency of such terms is , a parameter that we call
novelty:
p(q  0 | Nt ,  ,  )   ,
(1)
where N t is a vector summarizing all popularity counts associated with time
point t. We further assume that the expected frequency of known (q > 0) terms is
p(q | Nt ,  ,  )  (1   )
Ntq q

N n
n 1
n 
t
,
(2)
where  is another model parameter that we call temperature.
We use equations (1) and (2) to compute the likelihood of any collection of term
mentions given parameter values, and, assuming an uninformative prior
parameter distribution, estimate the joint posterior distribution of  and .
In our analysis, we first estimate the novelty and temperature separately for topic
(MeSH) and method (chemical) content of articles published in journals that
mentioned at least 1,000 MeSH terms, and at least 1,000 chemicals within the
chosen interval, and had a known impact factor. This left us with a set of 1,757
journals. The journal’s impact factor was computed as an average of its IF values
reported between 1999 and 2004.
We use the following linear regression model with a stepwise regression analysis
framework to test for a five-way correlation among journal specific parameters ( 
and  are temperature and novelty, respectively) and the impact factor (IF) of a
journal,
IFi  A topic,i  B topic,i  C method,i  D method,i  E  error,
where subscript i refers to the ith journal and A, B, C, D and E are parameters of
the linear regression model. We assume that the error term follows a normal

distribution. Our analysis shows that estimates of B and D are not significantly
different from zero. The estimate for A is significantly larger than zero (4.55,
with 95% confidence interval [3, 6]) and estimate for C is significantly smaller
than zero (-9.8, with 95% confidence interval [-12.6, -7]).
We estimate model parameters and credible intervals for publication types using
a version of the Markov chain Monte Carlo approach (see Figure 1c and 1d; we
use the maximum posterior probability estimator in each case). Parameter
estimation for topics is done for publication types that mentioned at least 1,000
MeSH terms (same selection strategy applies to parameter estimation for
methods). ‘Average temperature’ and ‘average novelty’ refers to a weighted
average of temperature and novelty when all publication types are considered
together.
To fit the topic and method volumes to a Zipf’s (Pareto) distribution we use the
maximum likelihood estimate of -parameter of Zipf’s distribution; our estimates
of -values for topic and method volumes are 1.153 and 1.528, respectively.
Download