Uploaded by Jr Mike

Downloadpatternsofjournalpapersandtheirinfluencingfactors

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/318185962
Download patterns of journal papers and their influencing factors
Article in Scientometrics · July 2017
DOI: 10.1007/s11192-017-2456-1
CITATIONS
READS
14
524
2 authors, including:
Zequan Xiong
East China Normal University
8 PUBLICATIONS 280 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zequan Xiong on 09 April 2019.
The user has requested enhancement of the downloaded file.
Scientometrics (2017) 112:1761–1775
DOI 10.1007/s11192-017-2456-1
Download patterns of journal papers and their
influencing factors
Yufeng Duan1,2 • Zequan Xiong1,3
Received: 1 April 2017 / Published online: 1 July 2017
Ó Akadémiai Kiadó, Budapest, Hungary 2017
Abstract A two-step cluster analysis was performed on the absolute downloads and the
relative downloads of Chinese journal papers published between 2006 and 2008. Four
patterns were identified from the perspective of absolute downloads; the first three patterns
can be expressed as power functions, signifying their evident aging trends, although this
does not apply to pattern 4. Two patterns were identified from the perspective of relative
downloads, and both present power distributions with minor differences in decline speed.
Furthermore, we delved into the relationships between total downloads and article features
in varying patterns and found that there are only weak correlations between total downloads and title length, number of authors, and number of keywords. However, there are
moderate to high correlations between initial downloads—defined as downloads made
during the first year after publication—and total downloads, suggesting that it is possible to
forecast total downloads according to initial downloads. Additionally, it was found that
total instances of highly downloaded papers have no correlations with article features.
Keywords Downloads Download pattern Citation Aging Obsolescence Correlations
Introduction
With the development of the Internet and the digitalization of publishing, an increasing
number of academic papers can be accessed and utilized in digital form via electronic
databases, and these download behaviors are continuously recorded by usage monitoring
& Zequan Xiong
zqxiong@library.ecnu.edu.cn
1
Faculty of Economics and Management, East China Normal University, Shanghai 200241, People’s
Republic of China
2
Institute for Academic Evaluation and Development, East China Normal University,
Shanghai 200241, People’s Republic of China
3
Library, East China Normal University, Shanghai 200241, People’s Republic of China
123
1762
Scientometrics (2017) 112:1761–1775
systems and stored as big datasets. Although these data do not include unknowable
quantities, such as user motivations or goals, they disclose direct information about the
usage preferences of articles, such as which article was used, who used it, where that
person was, when it was used and so on (Kurtz and Bollen 2010). In addition, a tally of
downloads is a timely measurement of usage. Thus, some scholars are aware of the
importance of usage data in exploring user behaviors (Davis and Solla 2003; Davis and
Price 2006; Wang et al. 2012, 2013a, b), the obsolescence principles of articles (Moed
2005; Kurtz and Bollen 2010; Wang et al. 2014a, b), or even the supplementary effects of
citations in research evaluation (Bollen and van de Sompel 2008; Wan et al. 2010; Glänzel
and Gorraiz 2015).
Although increasing attention is given to downloads, systematic research on them is
seldom conducted. Existing studies are concerned with the following three aspects: the
comparison between downloads and citations (Bollen et al. 2005; Moed 2005;
Schloegl and Gorraiz 2010, 2011; Lippi and Favaloro 2013; Lu et al. 2016); the halflife of downloads (Liu 2012; Xu 2014); and the influencing factors of downloads
(Jamali and Nikzad 2011; Guerrero-Bote and Moya-Anegon 2014; Subotic and
Mukherjee 2014).
Despite the significant correlations between downloads and citations that have been
demonstrated, there are still some problems in terms of data selection and processing.
Restricted by commercial database suppliers, most research can use only one or several
journals as data sources. Moed (2005) put forward the two-factors model of downloads
using data from Tetrahedron Letters and noted that during the first 3 months after a paper
is cited, its number of downloads increased 25% compared to what one would expect this
number to be if the paper had not been cited. Jamali and Nikzad (2011) researched the
influence of title type on downloads using data from six journals in PLoS and found that
question titles and short titles tend to get more downloads. These studies evidently did not
consider journal features, such as impact factors and themes, which may also influence
downloads.
Additionally, some scholars use ScienceDirect as the data source for downloads in
combination with citations data from other databases (such as Web of Knowledge) to
analyze the correlations between downloads and citations (Schloegl and Gorraiz
2010, 2011; Schloegl et al. 2014). These studies, however, take samples as a whole unit
and ignore that there may be different download patterns for varying reasons even within
the same sample. In the study of literature aging, diverse citation patterns have been
identified based on the annual citations of highly cited articles (Aversa 1985), while
analogous studies based on annual downloads have not yet been reported. Therefore, the
mining of download patterns, the reasons for the formation of these patterns, and the
influencing factors of different types of papers will contribute to our studies on download
behaviors as well as on the correlations between downloads and citations.
In this paper, we used data from Chinese Library and Information Science (CLIS) to
explore whether there are diverse download patterns. We also delved into the patterns of
article downloads aging over time and the main factors determining downloads, which may
pave the way for further research on the correlations between different download patterns
and citation pattern and provide theoretical foundations for the feasibility of using
downloads as a complementary indicator for article influence evaluation.
123
Scientometrics (2017) 112:1761–1775
1763
Methods
Data source and processing
The data were harvested from 11 CLIS journals between 2006 and 2008 in the China
National Knowledge Infrastructure (CNKI), preliminarily resulting in 10,334 papers. These
journals are all core publications and fully embodied by CNKI; further, their print publication date is nearly in line with the on-line date, while some others such as Library and
Information Service and Journal of Library Science have long on-line lags and thus are not
included in our research. We eliminated some noisy data, such as catalogs, prefaces,
contributions, and news, and obtained DataSet 1, consisting of 9919 papers.
In DataSet 1, the data are composed of basic information such as titles, authors, journals, and downloads per year from 2006 to 2015. The actual downloads of a paper in the
Yþ1
ð12 M Þ, and its
year following its publication are computed as D0Yþ1 ¼ DY þ D12
Yþ1
Mþ
actual downloads two years after publication are expressed as D0Yþ2 ¼ D12
DYþ2
12 ð12 M Þ; therein, M denotes the remaining months of the current year after its
publication. Similarly, we obtained other years’ actual download data and stored them in
DataSet 2. It should be emphasized that in these calculations, we assumed an equal number
of downloads in each month throughout the year. Next, we normalized
these data and
obtained the ratio of downloads in each year to total downloads
D0
RYþ1 ¼ PYþ1D0 , which
were stored in DataSet 3. Ultimately, we collected three datasets: the raw downloads
dataset, DataSet 1; the absolute downloads dataset, DataSet 2; and the normalized
downloads dataset, DataSet 3. The following is an example to demonstrate the calculations
for DataSet 2 and DataSet 3 based on DataSet 1. Given a paper published in September
2008, its raw download in each year as of 2015 is stored in DataSet 1, shown in Table 1. In
45
0
DataSet 2, D02008þ1 ¼ D2008 þ D2008þ1
12 ð12 3Þ ¼ 36 þ 12 9 ¼ 69:75, and D2008þ2 ¼
D2008þ1
D2008þ2
45
30
12 3 þ 12 ð12 3Þ ¼ 12 3 þ 12 9 ¼ 33:75. Correspondingly, in DataSet 3,
R2008þ1 ¼ 69:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50 þ 16:50Þ 100% ¼
34:79%,
and
R2008þ2 ¼ 33:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50þ
16:50Þ 100% ¼ 16:83%.
Analyzing method
Normality test
We used the Q–Q (Quantile–Quantile) Plot to observe the distribution of total downloads
and examined whether it passed the K–S test. The Q–Q Plot shows a scatter plot with
observed values on the X axis and expected values on the Y axis. If all the scatter points are
close to the reference line, we can say that the dataset follows the given distribution.
Cluster analysis
A two-step cluster analysis was conducted to explore different download patterns. The
two-step cluster method is a scalable cluster analysis algorithm designed to handle very
large datasets. It can handle both continuous and categorical variables and is also considered more reliable and accurate when compared to traditional clustering methods such
123
1764
Scientometrics (2017) 112:1761–1775
Table 1 An example of calculations for different datasets
DataSet 1: raw download
2008 (year)
2009
2010
2011
2012
2013
2014
2015
36
45
30
25
20
10
24
14
DataSet 2: absolute download
1st year
2nd year
3rd year
4th year
5th year
6th year
7th year
69.75
33.75
26.25
21.25
12.50
20.50
16.50
DataSet 3: normalized download
1st year
2nd year
3rd year
4th year
5th year
6th year
7th year
34.79%
16.83%
13.09%
10.60%
6.23%
10.22%
8.23%
as the k-means clustering algorithm (Norusis 2007). As the name suggests, the two-step
cluster procedure involves two distinct steps: (1) pre-cluster the cases (or records) into
many small sub-clusters; (2) cluster the sub-clusters resulting from the pre-cluster step into
the desired number of clusters. This procedure can also automatically select the number of
clusters (Tkaczynski 2017). In this study, annual downloads in DataSet2 and DataSet3
were chosen as continuous variables, respectively, and the Bayesian information criterion
(BIC) was used for clustering. The clustering quality is estimated by the Silhouette
measure of cohesion and separation. This procedure measures the relationship of the
variables within and between clusters. A score above 0.0 would ensure that the withincluster distance and the between-cluster distance was valid among the different variables
(Tkaczynski 2017).
Correlation analysis
The Spearman correlation coefficient was carried out to test the correlations between
downloads and the features of papers in the context of different patterns.
Half-life analysis
Analogously to the concept of cited half-life, the download half-life of a paper is defined as
‘‘the median age of the articles that were downloaded in the considered year’’ (Schloegl
and Gorraiz 2010). The half-lives of different download patterns can be achieved by
computations on average downloads of papers per year.
Results
Normality test
We drew the Q–Q probability plot of downloads, shown in the left chart of Fig. 1. In this
normality test, the X axis represents the observed values in DataSet 2 and Y axis represents
the expected normal values. Overall, the curve of the total downloads of most articles
123
Scientometrics (2017) 112:1761–1775
1765
Fig. 1 Probability plot of downloads (left) and the log-transformed downloads (right)
significantly deviates from the expected values and presents remarkably skewed distribution, which is in line with the results from Lu et al. (2016). Given these results, we
employed the Spearman correlation coefficient rather than Pearson for further correlation
tests.
We converted the total downloads of each paper into logarithms and obtained a new
log-transformed dataset. We also used the Q–Q plot to test whether its distribution is
normal (the right chart in Fig. 1). The result shows that the log-transformed data satisfactorily fit normal distribution. Therefore, logarithmic function was applied to fit the total
downloads distribution, shown in Fig. 2, and the regression equation is y = -259.1
ln(x) ? 2409.4 with R2 = 0.930.
Fig. 2 Frequency distribution of total downloads of papers
123
1766
Scientometrics (2017) 112:1761–1775
Dynamic changing patterns of absolute downloads
The data of absolute downloads in DataSet 2 can be divided into four groups representing
four different changing patterns after two-step cluster analysis with a silhouette measure of
cohesion and separation of 0.5. The detailed information is shown in Table 2.
The annual downloads of each pattern were computed, as shown in Table 3 and Fig. 3.
With respect to pattern 1, pattern 2, and pattern 3, the same changing tendency is shown:
their download counts reach peak activity in the first year and thereafter show a downtrend,
Table 2 Information of each pattern from DataSet 2
Pattern
Paper number
included
Proportion of
the total (%)
Fitting expression and goodness of fit
1
4885
49.25
y1 = 36.219x-0.744
; R2 = 0.992
1
2
3512
35.41
; R2 = 0.992
y2 = 84.428x-0.712
2
3
1328
13.39
; R2 = 0.944
y3 = 132.75x-0.482
3
4
194
1.96
y4 = 7.7456x24 - 63.552x4 ? 282.32; R2 = 0.938
Table 3 Annual downloads in four patterns
Pattern
1st year
2nd year
3rd year
4th year
5th year
6th year
1
38.13
21.08
14.75
12.65
11.23
9.86
8.70
2
88.07
51.43
36.15
30.04
26.62
24.31
22.19
3
142.85
94.26
71.86
62.13
58.21
58.49
58.35
4
230.24
184.24
157.22
147.85
159.31
194.21
208.08
Pattern 1
Pattern 2
Pattern 3
Pattern 4
250
Average downloads
200
150
100
50
0
1st Year
2nd Year 3rd Year
4th Year
5th Year
6th Year
7th Year
Time
Fig. 3 Changing trend of the four patterns in terms of average absolute downloads
123
7th year
Scientometrics (2017) 112:1761–1775
1767
which fits negative power functions. The differences among them consist of absolute
downloads: the absolute downloads of pattern 2 and pattern 3 each year are nearly 2.3–2.6
times and 3.7–6.7 times as much as pattern 1, respectively.
By contrast, the downloads in pattern 4 show a process of decline to rise, hitting bottom
in the 4th year before climbing in the 7th year to nearly the number of downloads in the 1st
year. The most fitting function is binomial, and its absolute count reaches to 6.04–23.92
times the amount as the count for pattern 1.
Table 4 shows the download half-life, total downloads, and average downloads for each
pattern. Patterns 1 and 2, both with a 2-year-half-life, account for 85% of the total sample
and contribute to only 60% of total downloads. By contrast, pattern 4, which represents
highly downloaded papers, accounts for less than 2% of the total sample and contributes
more than 10% to the total downloads, with average downloads up to 1530. Pattern 4 also
shows a lower aging rate with a half-life of 3.47.
Dynamic changing pattern of relative download
We continued using the two-step cluster analysis for DataSet 3. The sample of 9919
downloads were lastly divided into two clusters with a Silhouette measure of cohesion and
a separation of 0.4, representing two different relative download patterns. This basic
information is shown in Table 5.
The proportion of annual downloads to total downloads in each pattern are shown in
Table 6 and Fig. 4. Overall, the annual proportions in each pattern all show a downtrend.
Download half-life and the proportion of the total sample in each pattern are show in
Table 7. Pattern A, with a half-life of 1.54, owns higher proportions than pattern B, with a
half-life of 2.68, in the first 2 years, while its decline pace is also faster: starting from the
third year, pattern A’s annual proportion begins to lag behind that of pattern B.
We also identified the aging trend of the two patterns from their absolute downloads: the
absolute downloads of pattern A are larger than those of pattern B in the 1st year but drop
50% the following year and gradually become less than that of pattern B in subsequent
years.
Table 4 Download half-life and average download in four patterns
Pattern
Download
half-life
Sample
number
Proportion of total
sample (%)
Total
download
Proportion of total
download (%)
Average
download
1
1.95
4885
49.25
620,142
22.04
2
2.00
3512
35.41
1,072,632
38.13
305.42
3
2.50
1328
13.39
823,751
29.28
620.30
4
3.47
194
1.96
296,957
10.56
1530.71
126.95
Table 5 Information of each pattern from DataSet 3
Pattern
Paper number included
Proportion of the total (%)
Fitting expression and goodness of fit
A
4426
44.62
yA = 0.3943x-1.029
; R2 = 0.9994
A
B
3512
55.38
; R2 = 0.9682
yB = 0.2291x-0.421
B
123
1768
Scientometrics (2017) 112:1761–1775
Table 6 Annual download proportion and average download in DataSet 3
Pattern
Year
1st year
2nd year
3rd year
4th year
5th year
6th year
7th year
Relative download
39.16%
19.90%
12.36%
Absolute download
81.13
42.02
26.21
19.77
Relative download
24.35%
16.63%
13.26%
12.30%
11.75%
11.19%
10.53%
Absolute download
67.52
47.07
38.04
34.77
33.86
34.37
33.88
A
9.43%
7.57%
6.25%
15.95
13.58
5.32%
11.79
B
40
Pattern A
Pattern B
35
Download ratio (%)
30
25
20
15
10
5
1st Year
2nd Year
3rd Year
4th Year
5th Year
6th Year
7th Year
Time
Fig. 4 Changing trend of the two download patterns in DataSet 3
Table 7 Download half-life and average downloads in the two patterns in DataSet 3
Pattern
Download
half-life
Sample
number
Proportion of total
sample (%)
A
1.54
4426
44.62
B
2.68
5493
55.38
Total
download
Proportion of total
download (%)
Average
download
995,491
35.38
224.92
1,817,991
64.62
330.97
Matrix analysis on absolute and relative download changing patterns
We established a sample matrix and a download matrix for DataSets 2 and 3, as shown in
Tables 8 and 9. As is evident, patterns 1 and 2, with smaller absolute downloads, distribute
equally in patterns A and B, while patterns 3 and 4, with larger downloads, are located
centrally in pattern B. The distribution indicates that articles with a higher download rate
123
Scientometrics (2017) 112:1761–1775
1769
Table 8 Sample matrix of different patterns in the two datasets
1
2
3
4
Total
A
2442 (24.62%)
1681 (16.95%)
291 (2.93%)
12 (0.12%)
4426 (44.62%)
B
2443 (24.63%)
1831 (18.46%)
1037 (10.46%)
182 (1.84%)
5493 (55.38%)
Total
4885 (49.25%)
3512 (35.41%)
1328 (13.39%)
194 (1.96%)
9919 (100%)
Table 9 Download matrix of different patterns in the two datasets
1
2
3
4
Total
A
312,339 (11.10%)
497,834 (17.70%)
171,579 (6.10%)
13,739 (0.49%)
B
307,803 (10.94%)
574,798 (20.43%)
652,172 (23.18%)
283,218 (10.07%)
1,817,991 (64.62%)
995,491 (35.38%)
Total
620,142 (22.04%)
1,072,632 (38.13%)
823,751 (29.28%)
296,957 (10.56%)
2,813,482 (100%)
tend to be of smaller change in terms of annual download proportion, and own longer halflives.
After sorting articles by downloads, the top 1% of papers, 99 in total, were identified as
highly downloaded papers. These received 190,665 downloads, comprising 6.78% of the
total downloads. Moreover, they all belong to pattern 4 in terms of absolute downloads; 97
of them are attributed to pattern B in terms of relative download, with the other two
belonging to pattern A.
Correlation analysis of download and paper features
Generally, people search for targeted papers by way of titles, keywords, and authors before
downloading them, so the papers with longer titles, a larger number of keywords and more
authors are apt to be searched and then downloaded. Hence, we analyzed the correlation
between total downloads of articles and their features, including title length, number of
authors, number of keywords, and impact factor, as shown in Table 10. Additionally, we
analyzed the correlation between total downloads and initial downloads—defined as
downloads made during the first year after publication—to observe the impact of earlier
downloads on total downloads.
Table 10 Correlations between total download and paper features in different patterns
Feature
Pattern
1
2
Title length
0.018
-0.062**
Number of authors
0.185**
0.043
3
4
-0.071**
-0.116
-0.010
Total sample
0.180*
0.019
0.253**
Number of keywords
0.228*
0.006
-0.028
-0.065
0.174**
Impact factor
0.127**
0.087**
-0.004
0.064
0.055**
Initial downloads
0.731**
0.474**
0.411**
0.869**
0.466**
** Correlation is significant at the 0.01 level (2-tailed); * correlation is significant at the 0.05 level (2-tailed)
123
1770
Scientometrics (2017) 112:1761–1775
As shown in Table 10, there are weak correlations between total downloads and features, including title length, number of authors, and number of keywords, regardless of
patterns. Moreover, we found that with regard to different patterns, the correlations of
downloads to different features vary. Title length is associated with downloads that have a
weak and negative correlation, such as in patterns 2 and 3—the longer the title length, the
lower the total download. However, in patterns 1 and 4, title length shows no correlation
with total download, but shows a weak and positive correlation with number of authors.
Overall, the higher correlations between total downloads and number of authors, number of
keywords and impact factors can be seen in pattern 1.
Only significant correlations between total downloads and initial downloads are
observed in sample and pattern 1, with coefficients being 0.869 and 0.731, respectively.
Hence, taking initial downloads as independent variable x and total downloads as
dependent variable y (shown in Fig. 5), we carried out curve estimation on the whole
sample in the new DataSet 2 and eventually found that the best fitting curve is the power
function, which can be expressed as y = 7.198x0.839 (R2 = 0.751).
We further analyzed the relationships between themes and patterns. In accordance with
Chinese Library Classification, we assigned every paper a classification number denoting a
certain theme. Classification numbers with more than 100 papers were selected for further
research, as shown in Table 11. Regardless of theme, the average downloads in pattern B
are higher than that in pattern A.
Next, we focused on the features of highly downloaded articles and analyzed the correlations between total downloads and title length, number of authors, and number of
keywords as well as impact factors, as shown in Table 12. It turns out that total downloads
do not associate with most features at all but have a weak and negative correlation with
initial downloads.
Fig. 5 Correlation between initial downloads and total downloads
123
Computer software
Information retrieval
Collection development and collection organization
Document indexing and cataloging
Information science
Enterprise economic theory
Enterprise planning and business decision making
Philology
Information resource management
The processing of intelligence data
Various document service
Information theory
Information industry economics (pandect)
TP31
G354
G253
G254
G350
F270
F272
G256
G203
G353
G255
G201
F49
Application of computer
G251
Librarianship of the world
Library management
G258
G259
Various types of library
G252
TP39
Library science
Service for readers
G250
Theme
Classification
Table 11 Download patterns in different themes
126
121
197
216
230
238
243
254
274
307
316
329
331
366
532
591
758
927
1598
Total
53
38
81
59
93
13
151
150
108
138
146
205
219
80
321
307
298
472
762
11,902
8915
12,440
11,428
21,659
991
50,150
50,647
28,397
20,367
26,888
51,514
42,148
10,470
71,961
59,543
56,327
101,171
173,050
224.57
234.61
153.58
193.70
232.89
76.231
332.12
337.65
262.94
147.59
184.16
251.29
192.46
130.88
224.18
193.95
189.02
214.35
227.10
73
83
116
157
137
225
92
104
166
169
170
124
113
286
211
284
460
455
836
Paper
number
Average
download
Paper
number
Download
Pattern B
Pattern A
30,261
37,515
20,904
77,492
58,704
31,460
45,771
46,556
75,037
37,387
45,226
56,764
45,504
51,332
81,110
85,812
122,723
154,905
272,833
Download
414.53
451.99
180.21
493.58
428.45
139.82
497.51
447.65
452.03
221.23
266.04
457.77
402.69
179.48
384.41
302.16
266.79
340.45
326.36
Average
download
Scientometrics (2017) 112:1761–1775
1771
123
1772
Scientometrics (2017) 112:1761–1775
Table 12 Correlations between downloads of highly downloaded papers and paper features
Title
length
Correlation coefficient
Significance
Author
number
Keyword
number
Impact
factor
Initial
downloads
-0.037
0.128
-0.020
0.030
-0.281**
0.718
0.206
0.844
0.769
0.005
** Correlation is significant at the 0.01 level (2-tailed)
Discussion
Literature aging law based on downloads
The traditional research on the literature aging law mainly relies on citations, generating
a series of mathematical models such as the negative exponential model, the Burton–
Kepler aging equation, the Brookes accumulation exponential model, and the Avrami
equation. It has also been reported that there are different citation patterns for articles
(Avramescu 1979; Aversa 1985). Meanwhile, because of the accessibility of data, there
is little research regarding aging law based on downloads, let alone whether there are
different patterns. With two-step cluster analysis, article downloads can be distinguished
as varying patterns. As far as absolute downloads are concerned, patterns 1, 2 and 3
follow power function y = at-b, where a denotes initial download and b represents
download changing rate. For pattern 4, its download pattern shows a process of decline
to rise, akin to a quadratic function. Taking the inevitable aging trend of literature into
consideration, we conjecture that pattern 4 would finally be subject to the power function
as are the former three. In the time frame of our study, the changing pattern of downloads in pattern 4 deviates its aging trajectory, which may be due to the effect of citation.
Moed (2005) noted that an article’s downloads after being cited for 3 months rose 25%
compared to that in the non-cited condition; Schloegl et al. (2014) also reported that the
downloads of an article always show an increase after being cited. Assuming that z denotes the influence of citation in period t on the downloads in period t ? 1, we speculate
that a more general function for absolute downloads changing over time is
Y = az ? (1 - a)y (0 \ a \ 1). For most papers, the influence can be ignored because
of few citations, namely, a & 0 and thus Y = y.
From the perspective of relative downloads, the changing trend occurring in pattern 4
does not appear, and the two patterns all follow power functions. The total downloads in
the context of relative downloads is set to be 1, which eliminates the differences in
absolute downloads among papers, thus deducing a more generalized literature aging
law: the downloads of a paper reach peak activity in the first year after its publication
and then gradually fall, finally approaching 0, which is considered as having the status of
death.
There are pros and cons to both perspectives. Concerning absolute downloads, we
identified the ‘‘unusual’’ download pattern of highly downloaded papers and speculated on
the impact of citations on downloads. However, this perspective relies on a research time
window that is too narrow to forecast the developing trend of pattern 4 in the future.
Regarding relative downloads, we focus more on the changing trend of downloads over
time to explore the aging pattern of literature but ignore some factors that may impact
absolute downloads, resulting in a generalized aging pattern.
123
Scientometrics (2017) 112:1761–1775
1773
Influencing factors of paper downloads
As mentioned above, there has been a batch of research on the correlations between paper
citation and its influencing factors, including title type (Jamali and Nikzad 2011; Fox and
Burns 2015), title length (Jacques and Sebire 2010; Jamali and Nikzad 2011), number of
authors (Borsuk et al. 2009; Rao 2014) and number of keywords (Uddin and Khan 2016),
but research on correlations between downloads and these factors is rare. At present, the
most relevant research is concerned with the correlation between downloads and title
length. Some scholars found that articles with short titles tend to receive more downloads
than the long ones (Jamali and Nikzad 2011; Lin 2012), while some others came to the
opposite conclusion (Habibzadeh and Yadollahie 2010; Jacques and Sebire 2010).
Therefore, the disparate results could be due to data differences. In this paper, we find
weak correlations between downloads and paper features including title length and number
of authors, as well as number of keywords. Moreover, downloads show different reactions
to these features in different patterns: for example, title length has a weak negative correlation with downloads in patterns 2 and 3 but no correlation in patterns 1, 4 and the total
sample. It is unreliable to simply study the relationship between only a single feature and
downloads without considering the different influencing factors of downloads that could
give rise to different changing patterns.
We found moderate to high correlations between initial downloads and total downloads
even with a coefficient reaching up to 0.869 for the whole sample. For papers with different
patterns, the higher the downloads, the weaker the correlation. Similarly, the instances of
highly downloaded papers show no significant correlations with paper features, indicating
that article quality, rather than these features, contribute to their higher downloads. In view
of the decline-then-rise trend of highly downloaded papers, and simultaneously considering the lag of paper citation, we speculate that citation plays a promoting role in the
rising behavior of downloads. After the source paper is downloaded, references to it would
garner more attention from researchers and contribute to more downloads, also leading to
its download aging pattern being different from the patterns of other types of papers.
Meanwhile, higher correlations between total downloads and number of authors, number of
keywords and impact factors can be observed in pattern 1. These correlations indicate that
articles downloaded at a low rate may be primarily dependent upon the factors that
increase the probability of being retrieved, such as longer titles, larger number of keywords
and number of authors.
Acknowledgements The authors are grateful to China National Knowledge Internet (CNKI) for making
available the download files analyzed in this paper.
Appendix
See Table 13.
123
1774
Scientometrics (2017) 112:1761–1775
Table 13 Basic information for 11 journals
Journal
Paper
number
Total
download
Average
download
Journal of Academic Libraries
431
184,902
Information Studies: Theory and Application
671
271,441
404.532
1211
448,009
369.950
Information Science
Document Information and Knowledge
Journal of Information
New Technology of Library and Information
Service
Library and Information
Library Tribune
429.007
450
155,264
345.031
1854
591,758
319.179
739
193,928
262.419
647
163,611
252.876
1267
280,108
221.080
Library Journal
986
200,474
203.320
Library Work and Study
871
172,700
198.278
Library
Total
792
151,287
191.019
9919
2,813,482
283.646
References
Aversa, E. S. (1985). Citation patterns of highly cited papers and their relationship to literature aging—A
study of the working literature. Scientometrics, 7(3–6), 383–389.
Avramescu, A. (1979). Actuality and obsolescence of scientific literature. Journal of the American Society
for Information Science, 30(5), 296–303.
Bollen, J., de Sompel, H. V., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact:
A comparison of download and citation data. Information Processing and Management, 41(6),
1419–1440.
Bollen, J., & van de Sompel, H. (2008). Usage impact factor: The effects of sample characteristics on usagebased impact metrics. Journal of the American Society for Information Science and Technology, 59(1),
136–149.
Borsuk, R. M., Budden, A. E., Leimu, R., Aarssen, L. W., & Lortie, C. J. (2009). The influence of author
gender, national language and number of authors on citation rate in ecology. Open Ecology Journal,
2(1), 25–28.
Davis, P. M., & Price, J. S. (2006). Ejournal interface can influence usage statistics: Implications for
libraries, publishers, and project counter. Journal of the Association for Information Science and
Technology, 57(9), 1243–1248.
Davis, P. M., & Solla, L. R. (2003). An IP-level analysis of usage statistics for electronic journals in
chemistry: Making inferences about user behavior. Journal of the Association for Information Science
and Technology, 54(11), 1062–1068.
Fox, C. W., & Burns, C. S. (2015). The relationship between manuscript title structure and success: Editorial
decisions and citation performance for an ecological journal. Ecology and Evolution, 5(10),
1970–1980.
Glänzel, W., & Gorraiz, J. (2015). Usage metrics versus altmetrics: Confusing terminology? Scientometrics,
102(3), 2161–2164.
Guerrero-Bote, V. P., & Moya-Anegon, F. (2014). Relationship between downloads and citations at journal
and paper levels, and the influence of language. Scientometrics, 101(2), 1043–1065.
Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Crosssectional study of 22 scientific journals. Croatian Medical Journal, 51(2), 165–170.
Jacques, T. S., & Sebire, N. J. (2010). The impact of article titles on citation hits: An analysis of general and
specialist medical journals. JRSM Short Reports, 1(1), 2.
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and
citations. Scientometrics, 88(2), 653–661.
Kurtz, M. J., & Bollen, J. (2010). Usage bibliometrics. Annual Review of Information Science and Technology, 44, 3–64.
123
Scientometrics (2017) 112:1761–1775
1775
Lin, J. (2012). Article title and its relation with the number of downloads and citations. Journal of Academic
Libraries, 04, 14–17.
Lippi, G., & Favaloro, E. J. (2013). Article downloads and citations: Is there any relationship? Clinica
Chimica Acta, 415, 195.
Liu, X. (2012). Download’ half-life establishment of scientific journal and bibliometrics importance. Chinese Journal of Scientific and Technical Periodicals, 04, 561–564.
Lu, W., Qian, K., & Tang, X. (2016). Correlation analysis of paper download and citations-with library and
information field. Information Science, 01, 3–8.
Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual
documents within a single journal. Journal of the American Society for Information Science and
Technology, 56(10), 1088–1097.
Norusis, M. J. (2007). SPSS 15.0 advanced statistical procedures companion. Chicago, IL: Prentice Hall.
Rao, I. K. R. (2014). Weak relations among the impact factors, number of citations, references and authors.
Collnet Journal of Scientometrics and Information Management, 8(1), 17–30.
Schloegl, C., & Gorraiz, J. (2010). Comparison of citation and usage indicators: The case of oncology
journals. Scientometrics, 82(3), 567–580.
Schloegl, C., & Gorraiz, J. (2011). Global usage versus global citation metrics: The case of pharmacology
journals. Journal of the American Society for Information Science and Technology, 62(1), 161–170.
Schloegl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads,
citations and readership data for two information systems journals. Scientometrics, 101(2), 1113–1128.
Subotic, S., & Mukherjee, B. (2014). Short and amusing: The relationship between title characteristics,
downloads, and citations in psychology articles. Journal of Information Science, 40(1), 115–124.
Tkaczynski, A. (2017). Segmentation using two-step cluster analysis. In T. Dietrich, S. Rundle-Thiele, & K.
Kubacki (Eds.), Segmentation in social marketing: Process, methods and application (pp. 109–125).
Singapore: Springer Singapore.
Uddin, S., & Khan, A. (2016). The impact of author-selected keywords on citation counts. Journal of
Informetrics, 10(4), 1166–1177.
Wan, J. K., Hua, P. H., Rousseau, R., & Sun, X. K. (2010). The journal download immediacy index (DII):
Experiences using a chinese full-text database. Scientometrics, 82(3), 555–566.
Wang, X., Mao, W., Xu, S., & Zhang, C. (2014a). Usage history of scientific literature: Nature metrics, and
metrics of nature, publications. Scientometrics, 98(3), 1923–1933.
Wang, X., Peng, L., Zhang, C., Xu, S., Wang, Z., Wang, C., et al. (2013a). Exploring scientists’ working
timetable: A global survey. Journal of Informetrics, 7(3), 665–675.
Wang, X., Wang, Z., Mao, W., & Liu, C. (2014b). How far does scientific community look back? Journal of
Informetrics, 8(3), 562–568.
Wang, X., Wang, Z., & Xu, S. (2013b). Tracing scientist’s research trends realtimely. Scientometrics, 95(2),
717–729.
Wang, X., Xu, S., Peng, L., Wang, Z., Wang, C., Zhang, C., et al. (2012). Exploring scientists’ working
timetable: Do scientists often work overtime? Journal of Informetrics, 6(4), 655–660.
Xu, X. (2014). Empirical research on half-life period of journal based on downloads. Journal of Intelligence,
06, 117–121.
123
View publication stats
Download