See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/318185962 Download patterns of journal papers and their influencing factors Article in Scientometrics · July 2017 DOI: 10.1007/s11192-017-2456-1 CITATIONS READS 14 524 2 authors, including: Zequan Xiong East China Normal University 8 PUBLICATIONS 280 CITATIONS SEE PROFILE All content following this page was uploaded by Zequan Xiong on 09 April 2019. The user has requested enhancement of the downloaded file. Scientometrics (2017) 112:1761–1775 DOI 10.1007/s11192-017-2456-1 Download patterns of journal papers and their influencing factors Yufeng Duan1,2 • Zequan Xiong1,3 Received: 1 April 2017 / Published online: 1 July 2017 Ó Akadémiai Kiadó, Budapest, Hungary 2017 Abstract A two-step cluster analysis was performed on the absolute downloads and the relative downloads of Chinese journal papers published between 2006 and 2008. Four patterns were identified from the perspective of absolute downloads; the first three patterns can be expressed as power functions, signifying their evident aging trends, although this does not apply to pattern 4. Two patterns were identified from the perspective of relative downloads, and both present power distributions with minor differences in decline speed. Furthermore, we delved into the relationships between total downloads and article features in varying patterns and found that there are only weak correlations between total downloads and title length, number of authors, and number of keywords. However, there are moderate to high correlations between initial downloads—defined as downloads made during the first year after publication—and total downloads, suggesting that it is possible to forecast total downloads according to initial downloads. Additionally, it was found that total instances of highly downloaded papers have no correlations with article features. Keywords Downloads Download pattern Citation Aging Obsolescence Correlations Introduction With the development of the Internet and the digitalization of publishing, an increasing number of academic papers can be accessed and utilized in digital form via electronic databases, and these download behaviors are continuously recorded by usage monitoring & Zequan Xiong zqxiong@library.ecnu.edu.cn 1 Faculty of Economics and Management, East China Normal University, Shanghai 200241, People’s Republic of China 2 Institute for Academic Evaluation and Development, East China Normal University, Shanghai 200241, People’s Republic of China 3 Library, East China Normal University, Shanghai 200241, People’s Republic of China 123 1762 Scientometrics (2017) 112:1761–1775 systems and stored as big datasets. Although these data do not include unknowable quantities, such as user motivations or goals, they disclose direct information about the usage preferences of articles, such as which article was used, who used it, where that person was, when it was used and so on (Kurtz and Bollen 2010). In addition, a tally of downloads is a timely measurement of usage. Thus, some scholars are aware of the importance of usage data in exploring user behaviors (Davis and Solla 2003; Davis and Price 2006; Wang et al. 2012, 2013a, b), the obsolescence principles of articles (Moed 2005; Kurtz and Bollen 2010; Wang et al. 2014a, b), or even the supplementary effects of citations in research evaluation (Bollen and van de Sompel 2008; Wan et al. 2010; Glänzel and Gorraiz 2015). Although increasing attention is given to downloads, systematic research on them is seldom conducted. Existing studies are concerned with the following three aspects: the comparison between downloads and citations (Bollen et al. 2005; Moed 2005; Schloegl and Gorraiz 2010, 2011; Lippi and Favaloro 2013; Lu et al. 2016); the halflife of downloads (Liu 2012; Xu 2014); and the influencing factors of downloads (Jamali and Nikzad 2011; Guerrero-Bote and Moya-Anegon 2014; Subotic and Mukherjee 2014). Despite the significant correlations between downloads and citations that have been demonstrated, there are still some problems in terms of data selection and processing. Restricted by commercial database suppliers, most research can use only one or several journals as data sources. Moed (2005) put forward the two-factors model of downloads using data from Tetrahedron Letters and noted that during the first 3 months after a paper is cited, its number of downloads increased 25% compared to what one would expect this number to be if the paper had not been cited. Jamali and Nikzad (2011) researched the influence of title type on downloads using data from six journals in PLoS and found that question titles and short titles tend to get more downloads. These studies evidently did not consider journal features, such as impact factors and themes, which may also influence downloads. Additionally, some scholars use ScienceDirect as the data source for downloads in combination with citations data from other databases (such as Web of Knowledge) to analyze the correlations between downloads and citations (Schloegl and Gorraiz 2010, 2011; Schloegl et al. 2014). These studies, however, take samples as a whole unit and ignore that there may be different download patterns for varying reasons even within the same sample. In the study of literature aging, diverse citation patterns have been identified based on the annual citations of highly cited articles (Aversa 1985), while analogous studies based on annual downloads have not yet been reported. Therefore, the mining of download patterns, the reasons for the formation of these patterns, and the influencing factors of different types of papers will contribute to our studies on download behaviors as well as on the correlations between downloads and citations. In this paper, we used data from Chinese Library and Information Science (CLIS) to explore whether there are diverse download patterns. We also delved into the patterns of article downloads aging over time and the main factors determining downloads, which may pave the way for further research on the correlations between different download patterns and citation pattern and provide theoretical foundations for the feasibility of using downloads as a complementary indicator for article influence evaluation. 123 Scientometrics (2017) 112:1761–1775 1763 Methods Data source and processing The data were harvested from 11 CLIS journals between 2006 and 2008 in the China National Knowledge Infrastructure (CNKI), preliminarily resulting in 10,334 papers. These journals are all core publications and fully embodied by CNKI; further, their print publication date is nearly in line with the on-line date, while some others such as Library and Information Service and Journal of Library Science have long on-line lags and thus are not included in our research. We eliminated some noisy data, such as catalogs, prefaces, contributions, and news, and obtained DataSet 1, consisting of 9919 papers. In DataSet 1, the data are composed of basic information such as titles, authors, journals, and downloads per year from 2006 to 2015. The actual downloads of a paper in the Yþ1 ð12 M Þ, and its year following its publication are computed as D0Yþ1 ¼ DY þ D12 Yþ1 Mþ actual downloads two years after publication are expressed as D0Yþ2 ¼ D12 DYþ2 12 ð12 M Þ; therein, M denotes the remaining months of the current year after its publication. Similarly, we obtained other years’ actual download data and stored them in DataSet 2. It should be emphasized that in these calculations, we assumed an equal number of downloads in each month throughout the year. Next, we normalized these data and obtained the ratio of downloads in each year to total downloads D0 RYþ1 ¼ PYþ1D0 , which were stored in DataSet 3. Ultimately, we collected three datasets: the raw downloads dataset, DataSet 1; the absolute downloads dataset, DataSet 2; and the normalized downloads dataset, DataSet 3. The following is an example to demonstrate the calculations for DataSet 2 and DataSet 3 based on DataSet 1. Given a paper published in September 2008, its raw download in each year as of 2015 is stored in DataSet 1, shown in Table 1. In 45 0 DataSet 2, D02008þ1 ¼ D2008 þ D2008þ1 12 ð12 3Þ ¼ 36 þ 12 9 ¼ 69:75, and D2008þ2 ¼ D2008þ1 D2008þ2 45 30 12 3 þ 12 ð12 3Þ ¼ 12 3 þ 12 9 ¼ 33:75. Correspondingly, in DataSet 3, R2008þ1 ¼ 69:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50 þ 16:50Þ 100% ¼ 34:79%, and R2008þ2 ¼ 33:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50þ 16:50Þ 100% ¼ 16:83%. Analyzing method Normality test We used the Q–Q (Quantile–Quantile) Plot to observe the distribution of total downloads and examined whether it passed the K–S test. The Q–Q Plot shows a scatter plot with observed values on the X axis and expected values on the Y axis. If all the scatter points are close to the reference line, we can say that the dataset follows the given distribution. Cluster analysis A two-step cluster analysis was conducted to explore different download patterns. The two-step cluster method is a scalable cluster analysis algorithm designed to handle very large datasets. It can handle both continuous and categorical variables and is also considered more reliable and accurate when compared to traditional clustering methods such 123 1764 Scientometrics (2017) 112:1761–1775 Table 1 An example of calculations for different datasets DataSet 1: raw download 2008 (year) 2009 2010 2011 2012 2013 2014 2015 36 45 30 25 20 10 24 14 DataSet 2: absolute download 1st year 2nd year 3rd year 4th year 5th year 6th year 7th year 69.75 33.75 26.25 21.25 12.50 20.50 16.50 DataSet 3: normalized download 1st year 2nd year 3rd year 4th year 5th year 6th year 7th year 34.79% 16.83% 13.09% 10.60% 6.23% 10.22% 8.23% as the k-means clustering algorithm (Norusis 2007). As the name suggests, the two-step cluster procedure involves two distinct steps: (1) pre-cluster the cases (or records) into many small sub-clusters; (2) cluster the sub-clusters resulting from the pre-cluster step into the desired number of clusters. This procedure can also automatically select the number of clusters (Tkaczynski 2017). In this study, annual downloads in DataSet2 and DataSet3 were chosen as continuous variables, respectively, and the Bayesian information criterion (BIC) was used for clustering. The clustering quality is estimated by the Silhouette measure of cohesion and separation. This procedure measures the relationship of the variables within and between clusters. A score above 0.0 would ensure that the withincluster distance and the between-cluster distance was valid among the different variables (Tkaczynski 2017). Correlation analysis The Spearman correlation coefficient was carried out to test the correlations between downloads and the features of papers in the context of different patterns. Half-life analysis Analogously to the concept of cited half-life, the download half-life of a paper is defined as ‘‘the median age of the articles that were downloaded in the considered year’’ (Schloegl and Gorraiz 2010). The half-lives of different download patterns can be achieved by computations on average downloads of papers per year. Results Normality test We drew the Q–Q probability plot of downloads, shown in the left chart of Fig. 1. In this normality test, the X axis represents the observed values in DataSet 2 and Y axis represents the expected normal values. Overall, the curve of the total downloads of most articles 123 Scientometrics (2017) 112:1761–1775 1765 Fig. 1 Probability plot of downloads (left) and the log-transformed downloads (right) significantly deviates from the expected values and presents remarkably skewed distribution, which is in line with the results from Lu et al. (2016). Given these results, we employed the Spearman correlation coefficient rather than Pearson for further correlation tests. We converted the total downloads of each paper into logarithms and obtained a new log-transformed dataset. We also used the Q–Q plot to test whether its distribution is normal (the right chart in Fig. 1). The result shows that the log-transformed data satisfactorily fit normal distribution. Therefore, logarithmic function was applied to fit the total downloads distribution, shown in Fig. 2, and the regression equation is y = -259.1 ln(x) ? 2409.4 with R2 = 0.930. Fig. 2 Frequency distribution of total downloads of papers 123 1766 Scientometrics (2017) 112:1761–1775 Dynamic changing patterns of absolute downloads The data of absolute downloads in DataSet 2 can be divided into four groups representing four different changing patterns after two-step cluster analysis with a silhouette measure of cohesion and separation of 0.5. The detailed information is shown in Table 2. The annual downloads of each pattern were computed, as shown in Table 3 and Fig. 3. With respect to pattern 1, pattern 2, and pattern 3, the same changing tendency is shown: their download counts reach peak activity in the first year and thereafter show a downtrend, Table 2 Information of each pattern from DataSet 2 Pattern Paper number included Proportion of the total (%) Fitting expression and goodness of fit 1 4885 49.25 y1 = 36.219x-0.744 ; R2 = 0.992 1 2 3512 35.41 ; R2 = 0.992 y2 = 84.428x-0.712 2 3 1328 13.39 ; R2 = 0.944 y3 = 132.75x-0.482 3 4 194 1.96 y4 = 7.7456x24 - 63.552x4 ? 282.32; R2 = 0.938 Table 3 Annual downloads in four patterns Pattern 1st year 2nd year 3rd year 4th year 5th year 6th year 1 38.13 21.08 14.75 12.65 11.23 9.86 8.70 2 88.07 51.43 36.15 30.04 26.62 24.31 22.19 3 142.85 94.26 71.86 62.13 58.21 58.49 58.35 4 230.24 184.24 157.22 147.85 159.31 194.21 208.08 Pattern 1 Pattern 2 Pattern 3 Pattern 4 250 Average downloads 200 150 100 50 0 1st Year 2nd Year 3rd Year 4th Year 5th Year 6th Year 7th Year Time Fig. 3 Changing trend of the four patterns in terms of average absolute downloads 123 7th year Scientometrics (2017) 112:1761–1775 1767 which fits negative power functions. The differences among them consist of absolute downloads: the absolute downloads of pattern 2 and pattern 3 each year are nearly 2.3–2.6 times and 3.7–6.7 times as much as pattern 1, respectively. By contrast, the downloads in pattern 4 show a process of decline to rise, hitting bottom in the 4th year before climbing in the 7th year to nearly the number of downloads in the 1st year. The most fitting function is binomial, and its absolute count reaches to 6.04–23.92 times the amount as the count for pattern 1. Table 4 shows the download half-life, total downloads, and average downloads for each pattern. Patterns 1 and 2, both with a 2-year-half-life, account for 85% of the total sample and contribute to only 60% of total downloads. By contrast, pattern 4, which represents highly downloaded papers, accounts for less than 2% of the total sample and contributes more than 10% to the total downloads, with average downloads up to 1530. Pattern 4 also shows a lower aging rate with a half-life of 3.47. Dynamic changing pattern of relative download We continued using the two-step cluster analysis for DataSet 3. The sample of 9919 downloads were lastly divided into two clusters with a Silhouette measure of cohesion and a separation of 0.4, representing two different relative download patterns. This basic information is shown in Table 5. The proportion of annual downloads to total downloads in each pattern are shown in Table 6 and Fig. 4. Overall, the annual proportions in each pattern all show a downtrend. Download half-life and the proportion of the total sample in each pattern are show in Table 7. Pattern A, with a half-life of 1.54, owns higher proportions than pattern B, with a half-life of 2.68, in the first 2 years, while its decline pace is also faster: starting from the third year, pattern A’s annual proportion begins to lag behind that of pattern B. We also identified the aging trend of the two patterns from their absolute downloads: the absolute downloads of pattern A are larger than those of pattern B in the 1st year but drop 50% the following year and gradually become less than that of pattern B in subsequent years. Table 4 Download half-life and average download in four patterns Pattern Download half-life Sample number Proportion of total sample (%) Total download Proportion of total download (%) Average download 1 1.95 4885 49.25 620,142 22.04 2 2.00 3512 35.41 1,072,632 38.13 305.42 3 2.50 1328 13.39 823,751 29.28 620.30 4 3.47 194 1.96 296,957 10.56 1530.71 126.95 Table 5 Information of each pattern from DataSet 3 Pattern Paper number included Proportion of the total (%) Fitting expression and goodness of fit A 4426 44.62 yA = 0.3943x-1.029 ; R2 = 0.9994 A B 3512 55.38 ; R2 = 0.9682 yB = 0.2291x-0.421 B 123 1768 Scientometrics (2017) 112:1761–1775 Table 6 Annual download proportion and average download in DataSet 3 Pattern Year 1st year 2nd year 3rd year 4th year 5th year 6th year 7th year Relative download 39.16% 19.90% 12.36% Absolute download 81.13 42.02 26.21 19.77 Relative download 24.35% 16.63% 13.26% 12.30% 11.75% 11.19% 10.53% Absolute download 67.52 47.07 38.04 34.77 33.86 34.37 33.88 A 9.43% 7.57% 6.25% 15.95 13.58 5.32% 11.79 B 40 Pattern A Pattern B 35 Download ratio (%) 30 25 20 15 10 5 1st Year 2nd Year 3rd Year 4th Year 5th Year 6th Year 7th Year Time Fig. 4 Changing trend of the two download patterns in DataSet 3 Table 7 Download half-life and average downloads in the two patterns in DataSet 3 Pattern Download half-life Sample number Proportion of total sample (%) A 1.54 4426 44.62 B 2.68 5493 55.38 Total download Proportion of total download (%) Average download 995,491 35.38 224.92 1,817,991 64.62 330.97 Matrix analysis on absolute and relative download changing patterns We established a sample matrix and a download matrix for DataSets 2 and 3, as shown in Tables 8 and 9. As is evident, patterns 1 and 2, with smaller absolute downloads, distribute equally in patterns A and B, while patterns 3 and 4, with larger downloads, are located centrally in pattern B. The distribution indicates that articles with a higher download rate 123 Scientometrics (2017) 112:1761–1775 1769 Table 8 Sample matrix of different patterns in the two datasets 1 2 3 4 Total A 2442 (24.62%) 1681 (16.95%) 291 (2.93%) 12 (0.12%) 4426 (44.62%) B 2443 (24.63%) 1831 (18.46%) 1037 (10.46%) 182 (1.84%) 5493 (55.38%) Total 4885 (49.25%) 3512 (35.41%) 1328 (13.39%) 194 (1.96%) 9919 (100%) Table 9 Download matrix of different patterns in the two datasets 1 2 3 4 Total A 312,339 (11.10%) 497,834 (17.70%) 171,579 (6.10%) 13,739 (0.49%) B 307,803 (10.94%) 574,798 (20.43%) 652,172 (23.18%) 283,218 (10.07%) 1,817,991 (64.62%) 995,491 (35.38%) Total 620,142 (22.04%) 1,072,632 (38.13%) 823,751 (29.28%) 296,957 (10.56%) 2,813,482 (100%) tend to be of smaller change in terms of annual download proportion, and own longer halflives. After sorting articles by downloads, the top 1% of papers, 99 in total, were identified as highly downloaded papers. These received 190,665 downloads, comprising 6.78% of the total downloads. Moreover, they all belong to pattern 4 in terms of absolute downloads; 97 of them are attributed to pattern B in terms of relative download, with the other two belonging to pattern A. Correlation analysis of download and paper features Generally, people search for targeted papers by way of titles, keywords, and authors before downloading them, so the papers with longer titles, a larger number of keywords and more authors are apt to be searched and then downloaded. Hence, we analyzed the correlation between total downloads of articles and their features, including title length, number of authors, number of keywords, and impact factor, as shown in Table 10. Additionally, we analyzed the correlation between total downloads and initial downloads—defined as downloads made during the first year after publication—to observe the impact of earlier downloads on total downloads. Table 10 Correlations between total download and paper features in different patterns Feature Pattern 1 2 Title length 0.018 -0.062** Number of authors 0.185** 0.043 3 4 -0.071** -0.116 -0.010 Total sample 0.180* 0.019 0.253** Number of keywords 0.228* 0.006 -0.028 -0.065 0.174** Impact factor 0.127** 0.087** -0.004 0.064 0.055** Initial downloads 0.731** 0.474** 0.411** 0.869** 0.466** ** Correlation is significant at the 0.01 level (2-tailed); * correlation is significant at the 0.05 level (2-tailed) 123 1770 Scientometrics (2017) 112:1761–1775 As shown in Table 10, there are weak correlations between total downloads and features, including title length, number of authors, and number of keywords, regardless of patterns. Moreover, we found that with regard to different patterns, the correlations of downloads to different features vary. Title length is associated with downloads that have a weak and negative correlation, such as in patterns 2 and 3—the longer the title length, the lower the total download. However, in patterns 1 and 4, title length shows no correlation with total download, but shows a weak and positive correlation with number of authors. Overall, the higher correlations between total downloads and number of authors, number of keywords and impact factors can be seen in pattern 1. Only significant correlations between total downloads and initial downloads are observed in sample and pattern 1, with coefficients being 0.869 and 0.731, respectively. Hence, taking initial downloads as independent variable x and total downloads as dependent variable y (shown in Fig. 5), we carried out curve estimation on the whole sample in the new DataSet 2 and eventually found that the best fitting curve is the power function, which can be expressed as y = 7.198x0.839 (R2 = 0.751). We further analyzed the relationships between themes and patterns. In accordance with Chinese Library Classification, we assigned every paper a classification number denoting a certain theme. Classification numbers with more than 100 papers were selected for further research, as shown in Table 11. Regardless of theme, the average downloads in pattern B are higher than that in pattern A. Next, we focused on the features of highly downloaded articles and analyzed the correlations between total downloads and title length, number of authors, and number of keywords as well as impact factors, as shown in Table 12. It turns out that total downloads do not associate with most features at all but have a weak and negative correlation with initial downloads. Fig. 5 Correlation between initial downloads and total downloads 123 Computer software Information retrieval Collection development and collection organization Document indexing and cataloging Information science Enterprise economic theory Enterprise planning and business decision making Philology Information resource management The processing of intelligence data Various document service Information theory Information industry economics (pandect) TP31 G354 G253 G254 G350 F270 F272 G256 G203 G353 G255 G201 F49 Application of computer G251 Librarianship of the world Library management G258 G259 Various types of library G252 TP39 Library science Service for readers G250 Theme Classification Table 11 Download patterns in different themes 126 121 197 216 230 238 243 254 274 307 316 329 331 366 532 591 758 927 1598 Total 53 38 81 59 93 13 151 150 108 138 146 205 219 80 321 307 298 472 762 11,902 8915 12,440 11,428 21,659 991 50,150 50,647 28,397 20,367 26,888 51,514 42,148 10,470 71,961 59,543 56,327 101,171 173,050 224.57 234.61 153.58 193.70 232.89 76.231 332.12 337.65 262.94 147.59 184.16 251.29 192.46 130.88 224.18 193.95 189.02 214.35 227.10 73 83 116 157 137 225 92 104 166 169 170 124 113 286 211 284 460 455 836 Paper number Average download Paper number Download Pattern B Pattern A 30,261 37,515 20,904 77,492 58,704 31,460 45,771 46,556 75,037 37,387 45,226 56,764 45,504 51,332 81,110 85,812 122,723 154,905 272,833 Download 414.53 451.99 180.21 493.58 428.45 139.82 497.51 447.65 452.03 221.23 266.04 457.77 402.69 179.48 384.41 302.16 266.79 340.45 326.36 Average download Scientometrics (2017) 112:1761–1775 1771 123 1772 Scientometrics (2017) 112:1761–1775 Table 12 Correlations between downloads of highly downloaded papers and paper features Title length Correlation coefficient Significance Author number Keyword number Impact factor Initial downloads -0.037 0.128 -0.020 0.030 -0.281** 0.718 0.206 0.844 0.769 0.005 ** Correlation is significant at the 0.01 level (2-tailed) Discussion Literature aging law based on downloads The traditional research on the literature aging law mainly relies on citations, generating a series of mathematical models such as the negative exponential model, the Burton– Kepler aging equation, the Brookes accumulation exponential model, and the Avrami equation. It has also been reported that there are different citation patterns for articles (Avramescu 1979; Aversa 1985). Meanwhile, because of the accessibility of data, there is little research regarding aging law based on downloads, let alone whether there are different patterns. With two-step cluster analysis, article downloads can be distinguished as varying patterns. As far as absolute downloads are concerned, patterns 1, 2 and 3 follow power function y = at-b, where a denotes initial download and b represents download changing rate. For pattern 4, its download pattern shows a process of decline to rise, akin to a quadratic function. Taking the inevitable aging trend of literature into consideration, we conjecture that pattern 4 would finally be subject to the power function as are the former three. In the time frame of our study, the changing pattern of downloads in pattern 4 deviates its aging trajectory, which may be due to the effect of citation. Moed (2005) noted that an article’s downloads after being cited for 3 months rose 25% compared to that in the non-cited condition; Schloegl et al. (2014) also reported that the downloads of an article always show an increase after being cited. Assuming that z denotes the influence of citation in period t on the downloads in period t ? 1, we speculate that a more general function for absolute downloads changing over time is Y = az ? (1 - a)y (0 \ a \ 1). For most papers, the influence can be ignored because of few citations, namely, a & 0 and thus Y = y. From the perspective of relative downloads, the changing trend occurring in pattern 4 does not appear, and the two patterns all follow power functions. The total downloads in the context of relative downloads is set to be 1, which eliminates the differences in absolute downloads among papers, thus deducing a more generalized literature aging law: the downloads of a paper reach peak activity in the first year after its publication and then gradually fall, finally approaching 0, which is considered as having the status of death. There are pros and cons to both perspectives. Concerning absolute downloads, we identified the ‘‘unusual’’ download pattern of highly downloaded papers and speculated on the impact of citations on downloads. However, this perspective relies on a research time window that is too narrow to forecast the developing trend of pattern 4 in the future. Regarding relative downloads, we focus more on the changing trend of downloads over time to explore the aging pattern of literature but ignore some factors that may impact absolute downloads, resulting in a generalized aging pattern. 123 Scientometrics (2017) 112:1761–1775 1773 Influencing factors of paper downloads As mentioned above, there has been a batch of research on the correlations between paper citation and its influencing factors, including title type (Jamali and Nikzad 2011; Fox and Burns 2015), title length (Jacques and Sebire 2010; Jamali and Nikzad 2011), number of authors (Borsuk et al. 2009; Rao 2014) and number of keywords (Uddin and Khan 2016), but research on correlations between downloads and these factors is rare. At present, the most relevant research is concerned with the correlation between downloads and title length. Some scholars found that articles with short titles tend to receive more downloads than the long ones (Jamali and Nikzad 2011; Lin 2012), while some others came to the opposite conclusion (Habibzadeh and Yadollahie 2010; Jacques and Sebire 2010). Therefore, the disparate results could be due to data differences. In this paper, we find weak correlations between downloads and paper features including title length and number of authors, as well as number of keywords. Moreover, downloads show different reactions to these features in different patterns: for example, title length has a weak negative correlation with downloads in patterns 2 and 3 but no correlation in patterns 1, 4 and the total sample. It is unreliable to simply study the relationship between only a single feature and downloads without considering the different influencing factors of downloads that could give rise to different changing patterns. We found moderate to high correlations between initial downloads and total downloads even with a coefficient reaching up to 0.869 for the whole sample. For papers with different patterns, the higher the downloads, the weaker the correlation. Similarly, the instances of highly downloaded papers show no significant correlations with paper features, indicating that article quality, rather than these features, contribute to their higher downloads. In view of the decline-then-rise trend of highly downloaded papers, and simultaneously considering the lag of paper citation, we speculate that citation plays a promoting role in the rising behavior of downloads. After the source paper is downloaded, references to it would garner more attention from researchers and contribute to more downloads, also leading to its download aging pattern being different from the patterns of other types of papers. Meanwhile, higher correlations between total downloads and number of authors, number of keywords and impact factors can be observed in pattern 1. These correlations indicate that articles downloaded at a low rate may be primarily dependent upon the factors that increase the probability of being retrieved, such as longer titles, larger number of keywords and number of authors. Acknowledgements The authors are grateful to China National Knowledge Internet (CNKI) for making available the download files analyzed in this paper. Appendix See Table 13. 123 1774 Scientometrics (2017) 112:1761–1775 Table 13 Basic information for 11 journals Journal Paper number Total download Average download Journal of Academic Libraries 431 184,902 Information Studies: Theory and Application 671 271,441 404.532 1211 448,009 369.950 Information Science Document Information and Knowledge Journal of Information New Technology of Library and Information Service Library and Information Library Tribune 429.007 450 155,264 345.031 1854 591,758 319.179 739 193,928 262.419 647 163,611 252.876 1267 280,108 221.080 Library Journal 986 200,474 203.320 Library Work and Study 871 172,700 198.278 Library Total 792 151,287 191.019 9919 2,813,482 283.646 References Aversa, E. S. (1985). Citation patterns of highly cited papers and their relationship to literature aging—A study of the working literature. Scientometrics, 7(3–6), 383–389. Avramescu, A. (1979). Actuality and obsolescence of scientific literature. Journal of the American Society for Information Science, 30(5), 296–303. Bollen, J., de Sompel, H. V., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing and Management, 41(6), 1419–1440. Bollen, J., & van de Sompel, H. (2008). Usage impact factor: The effects of sample characteristics on usagebased impact metrics. Journal of the American Society for Information Science and Technology, 59(1), 136–149. Borsuk, R. M., Budden, A. E., Leimu, R., Aarssen, L. W., & Lortie, C. J. (2009). The influence of author gender, national language and number of authors on citation rate in ecology. Open Ecology Journal, 2(1), 25–28. Davis, P. M., & Price, J. S. (2006). Ejournal interface can influence usage statistics: Implications for libraries, publishers, and project counter. Journal of the Association for Information Science and Technology, 57(9), 1243–1248. Davis, P. M., & Solla, L. R. (2003). An IP-level analysis of usage statistics for electronic journals in chemistry: Making inferences about user behavior. Journal of the Association for Information Science and Technology, 54(11), 1062–1068. Fox, C. W., & Burns, C. S. (2015). The relationship between manuscript title structure and success: Editorial decisions and citation performance for an ecological journal. Ecology and Evolution, 5(10), 1970–1980. Glänzel, W., & Gorraiz, J. (2015). Usage metrics versus altmetrics: Confusing terminology? Scientometrics, 102(3), 2161–2164. Guerrero-Bote, V. P., & Moya-Anegon, F. (2014). Relationship between downloads and citations at journal and paper levels, and the influence of language. Scientometrics, 101(2), 1043–1065. Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Crosssectional study of 22 scientific journals. Croatian Medical Journal, 51(2), 165–170. Jacques, T. S., & Sebire, N. J. (2010). The impact of article titles on citation hits: An analysis of general and specialist medical journals. JRSM Short Reports, 1(1), 2. Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics, 88(2), 653–661. Kurtz, M. J., & Bollen, J. (2010). Usage bibliometrics. Annual Review of Information Science and Technology, 44, 3–64. 123 Scientometrics (2017) 112:1761–1775 1775 Lin, J. (2012). Article title and its relation with the number of downloads and citations. Journal of Academic Libraries, 04, 14–17. Lippi, G., & Favaloro, E. J. (2013). Article downloads and citations: Is there any relationship? Clinica Chimica Acta, 415, 195. Liu, X. (2012). Download’ half-life establishment of scientific journal and bibliometrics importance. Chinese Journal of Scientific and Technical Periodicals, 04, 561–564. Lu, W., Qian, K., & Tang, X. (2016). Correlation analysis of paper download and citations-with library and information field. Information Science, 01, 3–8. Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual documents within a single journal. Journal of the American Society for Information Science and Technology, 56(10), 1088–1097. Norusis, M. J. (2007). SPSS 15.0 advanced statistical procedures companion. Chicago, IL: Prentice Hall. Rao, I. K. R. (2014). Weak relations among the impact factors, number of citations, references and authors. Collnet Journal of Scientometrics and Information Management, 8(1), 17–30. Schloegl, C., & Gorraiz, J. (2010). Comparison of citation and usage indicators: The case of oncology journals. Scientometrics, 82(3), 567–580. Schloegl, C., & Gorraiz, J. (2011). Global usage versus global citation metrics: The case of pharmacology journals. Journal of the American Society for Information Science and Technology, 62(1), 161–170. Schloegl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads, citations and readership data for two information systems journals. Scientometrics, 101(2), 1113–1128. Subotic, S., & Mukherjee, B. (2014). Short and amusing: The relationship between title characteristics, downloads, and citations in psychology articles. Journal of Information Science, 40(1), 115–124. Tkaczynski, A. (2017). Segmentation using two-step cluster analysis. In T. Dietrich, S. Rundle-Thiele, & K. Kubacki (Eds.), Segmentation in social marketing: Process, methods and application (pp. 109–125). Singapore: Springer Singapore. Uddin, S., & Khan, A. (2016). The impact of author-selected keywords on citation counts. Journal of Informetrics, 10(4), 1166–1177. Wan, J. K., Hua, P. H., Rousseau, R., & Sun, X. K. (2010). The journal download immediacy index (DII): Experiences using a chinese full-text database. Scientometrics, 82(3), 555–566. Wang, X., Mao, W., Xu, S., & Zhang, C. (2014a). Usage history of scientific literature: Nature metrics, and metrics of nature, publications. Scientometrics, 98(3), 1923–1933. Wang, X., Peng, L., Zhang, C., Xu, S., Wang, Z., Wang, C., et al. (2013a). Exploring scientists’ working timetable: A global survey. Journal of Informetrics, 7(3), 665–675. Wang, X., Wang, Z., Mao, W., & Liu, C. (2014b). How far does scientific community look back? Journal of Informetrics, 8(3), 562–568. Wang, X., Wang, Z., & Xu, S. (2013b). Tracing scientist’s research trends realtimely. Scientometrics, 95(2), 717–729. Wang, X., Xu, S., Peng, L., Wang, Z., Wang, C., Zhang, C., et al. (2012). Exploring scientists’ working timetable: Do scientists often work overtime? Journal of Informetrics, 6(4), 655–660. Xu, X. (2014). Empirical research on half-life period of journal based on downloads. Journal of Intelligence, 06, 117–121. 123 View publication stats