Parasite prevalence and sample size: misconceptions and solutions Roger Jovani and José L. Tella Department of Applied Biology, Estació n Bioló gica de Doñ ana, Consejo Superior de Investigaciones Cientı́ficas, Avenid Maria Luisa s/n, 41013 Sevilla, Spain Parasite prevalence (the proportion of infected hosts) is a common measure used to describe parasitaemias and to unravel ecological and evolutionary factors that influence host–parasite relationships. Prevalence estimates are often based on small sample sizes because of either low abundance of the hosts or logistical problems associated with their capture or laboratory analysis. Because the accuracy of prevalence estimates is lower with small sample sizes, addressing sample size has been a common problem when dealing with prevalence data. Different methods are currently being applied to overcome this statistical challenge, but far from being different correct ways of solving a same problem, some are clearly wrong, and others need improvement. Introduction In a given population with N individuals, the prevalence (P) denotes the proportion of infected individuals by a given parasite species or group of species (e.g. a parasite genus). The actual prevalence of a population, however, is usually unknown because the number of sampled hosts (n; the sample size) is generally lower than total population size (N). However, we can easily obtain an estimate (p) by dividing the number of infected individuals (i) by the number of sampled ones [pZ(i/n)*100], where iZ0,1,2.n, and nZ1,2,3.N. Nonetheless, the accuracy of this estimate is known to be affected by sample size. This problem has long concerned researchers because the results of statistical analyses dealing with prevalence data and the derived conclusions could depend greatly on the number of sampled hosts [1]. Here, we review current methods used to overcome this statistical problem, identify wrong approaches, improve others and suggest more powerful practices. Basic concepts The low accuracy of prevalence estimates when using a small sample size has a mathematical basis. Given that both the number of infected individuals and sample size are integers, sampling prevalence (i.e. the prevalence within a given sample of individuals) is constrained to a particular set of values for each sample size (Figure 1). Moreover, the behaviour of prevalence estimates for slight increases in the number of infected individuals Corresponding author: Jovani, R. (jovani@ebd.csic.es). changes dramatically when we move from small to large sample sizes. For instance, if we sample six adult lizards and find four to be infected (nZ6, iZ4) we obtain pZ(4/6)*100Z67%, but if we found five infected, then pZ83%, a difference of 16%. However, for nZ100, iZ4 returns pZ4%, and iZ5 returns pZ5%, only a 1% difference. This produces the effect that slight prevalence differences can be detected at high sample sizes but not at low ones. In other words, our uncertainty about the real prevalence (i.e. population prevalence, P) is higher at low sample sizes (Figure 2). The accuracy with which we calculate the prevalence decreases not only at low sample sizes (as expected), but also when the populational prevalence is close to 50% (Figure 2). At first, this property could seem counterintuitive, because rare events (e.g. a parasitised individual at PZ1% or an uninfected individual at PZ99%) are difficult to observe, particularly at low sample sizes. However, when populational prevalence is close to zero or 100%, sample prevalence is also close to or equal to zero or 100%, respectively, precisely because rare events are difficult to detect. At intermediate populational prevalences (e.g. PZ52%), the chance of finding parasitised or non-parasitised individuals is similar, and sample prevalence could then freely fluctuate from zero to 100%, falling away from the actual populational prevalence and increasing the uncertainty in the prevalence estimate. Minimum sample size The most intuitive method to overcome the high statistical uncertainty of prevalences calculated from low sample sizes is to reject data obtained from small sample sizes. However, the threshold for establishing a minimum sample size is highly variable because it depends on subjective decisions of the researchers. This leads to researchers either not considering a minimum sample size (e.g. [2]), or considering low minimum sample sizes, such as three [3], five [4], or eight [5]; medium sample sizes, such as ten [6], 15 [7] or 20 [8]; and higher ones up to 30 [9] or even 75 [10]. Obviously, more is better, but rather than being a linear relationship, uncertainty rapidly decreases as sample size increases up to 10–20 individuals, but not much more with further increasing sample sizes (Figure 2). Thus, a sample size around 15 could be used as a reasonable trade-off between not losing too much data from analyses and maintaining acceptable levels of uncertainty (around 1/3 Sampling prevalence (p) (a) 100 75 50 25 0 10 100 10 1 0.1 20 30 Sample size (n) (c) 100 1 10 100 1000 Sample size (n) Sampling prevalence (p) (b) Sampling prevalence (p) 0 40 50 75 50 25 0 1 10 100 1000 Sample size (n) TRENDS in Parasitology Figrue 1. All the possible values that prevalence could reach at different sample sizes. For instance, for a sample sizeZ1, prevalence could be either (0/1)*100Z0% or (1/1)*100Z100%; for a sample size of 2, there are three possibilities: (0/2)*100Z 0%, (1/2)*100Z50%, and (2/2)*100Z100%, and so on. This is illustrated for sample sizes (a) from 1 to 50 in lineal axes, (b) from 1 to 3500 in log-log axes, and (c) from 1 to 3500 with logarithmic x-axis and lineal y-axis. of the sampling prevalence). An additional recommendation is to test the robustness of the results of the analyses using different minimum sample size cut-offs [11,12]. 50 Standard error 40 30 20 10 0 100 50 10 20 30 40 50 60 70 80 90 100 Sample size (n) 0 TRENDS in Parasitology Figrue 2. Standard error of prevalence estimates at different sample sizes and populational prevalences. Standard error (SE) is calculated using the formula: SEZ 100*[p*q/(n-1)] where p is the sample prevalence, qZ1-p, and n is the sample size. The white line shows results for nZ15. www.sciencedirect.com Avoiding zero prevalences Gregory and Blackburn [13] claimed that the minimum but not the maximum prevalence that could be achieved in a sample of a population is affected by sample size. This influential paper caused several studies to reject zero prevalences from their analyses or to question its previous use. These authors stated correctly [13] that large sample sizes (e.g. 1000 hosts) are needed to detect very low prevalences (e.g. 0.1%), but that 100% prevalences could be detected with only one individual sampled if it is infected. They presented figures with both axes logtransformed (similar to Figure 1b), but this hid the symmetric shape of the actual relationship between sample size and prevalence (Figure 1a,c). That is, as happens with prevalences near zero, prevalences near 100% (e.g. 99.9%) could only be achieved with high sample sizes (e.g. 1000 hosts). Thus, according to this symmetry and the suggestions of Gregory and Blackburn [13], we should also reject 100% prevalences from analyses. Parasite prevalence differs widely both between and within host species [14,15]. Therefore, when assessing sources of variability in parasite prevalences (e.g. between marine and freshwater habitats [16]), a prevalence value of zero has the same ecological relevance as a prevalence of 0.1%; and the same happens between 100 and 99.9%. Thus, we are throwing out very relevant information by rejecting zero and 100% prevalences. Accordingly, zero (see Box 1) and 100% prevalences should be included in the analyses. Residuals of prevalence on sample size Another method used to control for the potential effects of sample size is to obtain the residuals of a linear regression between sampling prevalence (the dependent variable) and sample size (the independent variable), and use them as the dependent variable for comparative studies [17,18]. A clear example of this rationale is a study [18] in which residuals of parasite prevalence against sample size were used in an independent contrast analysis when a correlation was found among these variables, but not in another analysis of the same study, in which such a relationship was not found [18]. This approach follows previous methods that aimed to remove the effect of body size on body-size-related variables, such as home range area [19]. Moreover, it has been influenced by the need to control for sample size when analysing parasite richness, because the more host individuals that are examined, the more parasite species could be found [7,20]. Adding more support to this method, and using empirical log-log plots similar to Figure 1b, Gregory and Blackburn [13] suggested that a negative relationship between sample size and prevalence is expected as a mathematical artefact (and thus needed to be controlled for). In addition, they stated that nothing but negative slopes were found when simulating the effect of sample size (from 1 to 3500) on prevalence estimates when zero values were deliberately avoided. Clearly, however, there is no mathematical relationship between prevalence and sample size per se (Figure 1). We confirmed this by repeating the same simulation study done in Ref. [13]. We performed100 simulations in which 200 hypothetical Opinion Box 1. Zeros are also relevant prevalences species (or populations) took prevalence values varying between zero and 100% and sample sizes between 1 and 3500 (R.J. And J.L.T., unpublished). Among the resulting relationships, 49 were negative (Spearman rank correlations, r rangeZK0.0008 to K0.1714; meanZK0.0565) and 51 positive (r rangeZ0.0045–0.2675; meanZ0.0598), only two negative and four positive weak correlations being statistically significant. Moreover, the results were identical when zero prevalences were not allowed in the simulations. Our results differ greatly from those of Ref. [13], suggesting that the conclusions of Gregory and Blackburn were based on the visual examination of loglog plots similar to Figure 1b. In a more recent but largely unnoticed study regarding the relationships between prevalence and sample size, Gregory and Woolhouse [21] simulated the effect of sample sizes ranging from 10 to 1280 on the mean and the accuracy of prevalences estimated for a theoretical population with a prevalence of 80%. They concluded that prevalence estimates were not biased under any sample size, only causing greater inaccuracy at low sample sizes, being thus consistent with our own analysis (Figure 3; see later). However, Gregory and Woolhouse [21] did not link these challenging results with their previous work, and their first recommendations [13] have prevailed among researchers so far. Curiously, however, although clearly there is not a relationship between prevalence and sample size (Figure 1a), there have been reports not only of null empirical correlations between prevalence and sample size [18], but also of negative [18] and positive [22] correlations. Why? Because of the effect of detecting rare events at low sample sizes. This point is illustrated by simple simulations using different sample sizes and populational prevalences (Figure 3). At low sample sizes www.sciencedirect.com real absence of parasites; or be the result of a low prevalence and intensity of parasites, leading to an apparent absence because of low sample size or low methodological sensitivity [29]. However, the relevant question here is whether the inclusion of zero prevalences improves the conclusions of the research. We feel that a full understanding of natural variation in parasite burdens should also consider why some populations and species have zero or very low parasite prevalences, whereas others have 100% or very high prevalences. For comparative purposes, a zero prevalence is as informative as a 1% prevalance (and similarly for 99% or 100%), if calculated using appropriated sample sizes (as for any other prevalence data). However, by excluding zero prevalences, we are throwing out extreme values from analyses, and thus excluding potential host populations or species with interesting ecological or life-history traits that make them completely or almost completely free from parasites. Thus, we make here a plea that zero prevalences should be reported. As an alternative to scientific journals, we propose the creation of a website under scientific supervision for compiling data published in leading journals as well as ‘grey’ literature (such as non-international journals or conference proceedings), old data never published submitted by researchers, and data coming from future surveys of parasite prevalence, whether or not they are finally published. This would allow an easily available source of permanently updated data for researches worldwide, becoming one more of the invaluable virtual services provided by natural history museums in this century. (1–15), sampling prevalences and sample sizes were usually positively correlated at low populational prevalences, uncorrelated at intermediated populational prevalences, and negatively correlated at high populational prevalences (Figure 3a). At higher sample sizes (15–100), however, significant correlations were fewer and were evenly distributed (Figure 3b). This is because, at low populational prevalence, the chances of finding an infected (a) (b) 0.6 0.4 Spearman-r between n and p Non-parasitised individuals and populations are clearly not the scope of parasitologists. This has produced a parasitological literature traditionally biased towards positive prevalence values. As an example, the world avian host–haemoparasite catalogues [26,27] report, for a given bird species, all studies that found some parasites, but only one study example for species that were not parasitised. In 1982, the seminal paper by Hamilton and Zuk [28] extended the interest in parasites among evolutionary ecologists by suggesting a role for parasites in the evolution of plumage colouration and song in birds. Moreover, this generated a plethora of hypotheses that were also initially tested by making use of data previously published by parasitologists, thus suffering from their biases [7]. These new hypotheses encouraged evolutionary ecologists to initiate extensive parasite surveys, finding great variability in parasite burdens among and within species and previously hidden zero prevalences. This led to new questions about the ecological factors, host behaviours and life history traits shaping such a variation in nature [2,5,15,23]. Perhaps recalling the parasitological tradition, doubts arose about the effects of sample size on zero prevalences and the accuracy of prevalence estimates, even including suggestions that zero prevalences should be excluded from analyses and journal reports [13]. This publication bias has started to creep into ecological and zoological journals, enhanced by the fact that zero prevalences have already ceased to be surprising and thus generate less interest among journal editors. Failure to find parasites in a sample could either be because of a 0.2 0.0 –0.2 –0.4 –0.6 0 20 40 60 80 100 Populational prevalence (P) 0 20 40 60 80 100 Populational prevalence (P) TRENDS in Parasitology Figrue 3. Simulation of the effect of sample size on the correlation between sample size and prevalence at different actual populational prevalences. Each point indicates the Spearman correlation coefficient (in red p value !0.05) between sample size and prevalence for 100 simulated species with sample sizes randomly varying (a) from 1 to 15 or (b) from 15 to 100. individual among a low sample number is low, but the chances increase with increasing sample size; the reverse happens at higher populational prevalences (Figure 3a). However, when all the species have a minimum sample size above 15, the effect of rare events on sample prevalence is buffered at any population prevalence (Figure 3b). The problem with using the residuals from a regression between prevalence and sample size is that it artificially increases the prevalence estimates of some species and underestimates the prevalences of others. Moreover, this method could not be applied even if (by chance) a correlation exists between prevalence and sample size in a given data set. To illustrate this point, one can deliberately create a statistical relationship between sample size and populational prevalence by sampling more individuals from populations with a higher populational prevalence (Figure 4). The simulated sample size and the sample prevalence result correlate here, of course, because they have been made to do so deliberately (Figure 4a). There is also the expected correlation between (a) Sampling prevalence (p) 87.5 62.5 37.5 12.5 0 25 50 75 Sample size (n) References –25 Populational prevalence (P) TRENDS in Parasitology Figrue 4. An illustration of why residuals from prevalence–sample-size regressions cannot be used to correct for low sample sizes. (a) A simulated positive relationship between prevalence and sample size was created deliberately by retrieving sample prevalence (p) for 25 species (in red) with a populational prevalence PZ12.5% and n from 1 to 25, 25 (in blue) with PZ37.5% and n from 26 to 50, 25 (in green) with PZ 62.5% and n from 51 to 75, and 25 (in black) with PZ87.5% and n from 76 to 100. (b) The correlation between populational prevalence and sampling prevalence (Spearman rZ0.847, nZ100, p!0.0001) for the same simulated prevalences. (c) A graph showing that there was no relationship (Spearman rZ0.124, nZ100, pZ 0.219) between true populational prevalence and the prevalence estimated as the residuals of the regression line obtained in (a) between sample size and populational prevalence. www.sciencedirect.com Concluding remarks Current practices for the analysis of prevalence data must be revised. Statistical tools that take into account the sample size from which each proportion has been obtained [23], that weight for sample size [8], or use individual infection status (infected or not) as the dependent variable [16] are increasingly being used. In this way, information is not lost because of sample size restrictions, but more weight is given to those data with higher sample sizes, such as in meta-analysis [24] or generalized linear (mixed) models [25]. However, there are many circumstances in which methods that do not control for sample size when analysing prevalence data must still be used [17], and some decision must be taken about how to choose and analyse data. In these cases, we have shown that the use of residuals is a flawed method; avoiding zero prevalences is unfounded and supposes the loss of very relevant information; and rejecting prevalence data obtained from low sample sizes should be done in a conscientious way according to the shape of the curve in Figure 2, even to the extent of testing the robustness of the results at different minimum sample sizes. 0 87.5 87.5 62.5 37.5 12.5 12.5 25 62.5 62.5 37.5 37.5 87.5 12.5 Residuals n–p (c) Sampling prevalence (p) (b) 100 estimated and real prevalences in the initial data (Figure 4b). However, the correlation between real prevalences and calculated residuals is null. This means that the residuals are unrelated to populational prevalences! Residuals thus become statistical artefacts that cannot be used as estimators of prevalence for comparative purposes, and thus previous results obtained through this method should be taken with caution. Finally, it is worth noting that some relevant biological factors could be shaping a prevalence–sample-size relationship. For instance, in a recent study, Ricklefs et al. [15] used the number of individuals trapped (the sample size) as an index of bird species abundance in a given study area. In this way, they used the relationship between prevalence and sample size to assess the potential relationship between host density and parasite prevalence. Thus, the prevalence–sample sizerelationship should be seen as a potentially interesting pattern by itself, rather than a statistical artefact that should be controlled for. 1 Read, A.F. and Harvey, P.H. (1989) Reassessment of comparative evidence for Hamilton and Zuk theory on the evolution of secondary sexual characters. Nature 339, 619–620 2 Torchin, M.E. et al. (2003) Introduced species and their missing parasites. Nature 421, 628–630 3 Poiani, A. (1992) Ectoparasitism as a possible cost of social life: a comparative analysis using Australian passerines (Passeriformes). Oecologia 92, 429–441 4 Yezerinac, S.M. and Weatherhead, P.J. (1995) Plumage coloration, differential attraction of vectors and haematozoa infections in birds. J. Anim. Ecol. 64, 528–537 5 Schalk, G. and Forbes, M.R. (1997) Male biases in parasitism of mammals: effects of study type, host age, and parasite taxon. Oikos 78, 67–74 6 Poulin, R. (1996) Sexual inequalities in helminth infections: a cost of being a male? Am. Nat. 147, 287–295 7 Tella, J.L. (2002) The evolutionary transition to coloniality promotes higher blood parasitism in birds. J. Evol. Biol. 15, 32–41 8 Scheuerlein, A. and Ricklefs, R.E. (2004) Prevalence of blood parasites in European passeriform birds. Proc. Biol. Sci. 271, 1363–1370 9 Arneberg, P. et al. (1998) Host densities as determinants of abundance in parasite communities. Proc. R. Soc. Lond. B. Biol. Sci. 265, 1283–1289 10 Poulin, R. and Mouritsen, K.N. (2003) Large-scale determinants of trematode infections in intertidal gastropods. Mar. Ecol. Prog. Ser. 254, 187–198 11 John, J. (1995) Parasites and the avian spleen: helminths. Biol. J. Linn. Soc 54, 87–106 12 Pruett-Jones, S.G. et al. (1990) Parasites and sexual selection in birds of Paradise. Am. Zool. 30, 287–298 13 Gregory, R.D. and Blackburn, T.M. (1991) Parasite prevalence and host sample size. Parasitol. Today 7, 316–318 14 Sol, D. et al. (2000) Geographical variation in blood parasites in feral pigeons: the role of vectors. Ecography 23, 307–314 15 Ricklefs, R.E. et al. (2005) Community relationships of avian malaria parasites in southern Missouri. Ecol. Monogr. 75, 543–559 16 Mendes, L. et al. (2005) Disease limited distributions? Contrasts in the prevalence of avian malaria in shorebird species using marine and freshwater habitats. Oikos 109, 396–404 17 Harvey, P.H. and Pagel, M.D., eds (1991) The Comparative Method in Evolutionary Biology, Oxford University Press 18 Poulin, R. and Valtonen, E.T. (2001) Nested assemblages resulting from host size variation: the case of endoparasite communities in fish hosts. Int. J. Parasitol. 31, 1194–1204 19 Garland, T., Jr. et al. (1992) Procedures for the analysis of comparative data using phylogenetically independent contrasts. Syst. Biol. 41, 18–32 20 Walther, B.A. and Morand, S. (1998) Comparative performance of species richness estimation methods. Parasitology 116, 395–405 21 Gregory, R.D. and Woolhouse, M.E.J. (1993) Quantification of parasite aggregation: a simulation study. Acta Trop. 54, 131–139 22 Pruett-Jones, M. and Pruett-Jones, S. (1991) Analysis and ecological correlates of tick burdens in a New Guinea avifauna. In Bird-Parasite Interactions (Loye, J.E. and Zuk, M., eds), pp. 155–176, Oxford University Press 23 Tella, J.L. et al. (1999) Habitat, world geographic range, and embryonic development of host explain the prevalence of avian hematozoa at small spatial and phylogenetic scales. Proc. Natl. Acad. Sci. U. S. A. 96, 1785–1789 24 Hedges, L.V. and Olkin, I. (1985) Statistical Methods for MetaAnalysis, Academic Press 25 Paterson, S. and Lello, J. (2003) Mixed models: getting the best use of parasitological data. Trends Parasitol. 19, 370–375 26 Bennett, G.F. et al. (1982) Host–Parasite Catalogue of the Avian Haematozoa, Occasional Papers in Biology. Memorial University of Newfoundland 27 Bishop, M.A. and Bennett, G.F. (1992) Host-Parasite Catalogue of the Avian Haematozoa (Suppl. 1), Occasional Papers in Biology. Memorial University of Newfoundland 28 Hamilton, W.D. and Zuk, M. (1982) Heritable true fitness and bright birds: a role for parasites? Science 218, 384–387 29 Cooper, J.E. and Anwar, M.A. (2001) Blood parasites of birds: a plea for more cautious terminology. Ibis 143, 149–150 ScienceDirect collection reaches six million full-text articles Elsevier recently announced that six million articles are now available on its premier electronic platform, ScienceDirect. This milestone in electronic scientific, technical and medical publishing means that researchers around the globe will be able to access an unsurpassed volume of information from the convenience of their desktop. ScienceDirect’s extensive and unique full-text collection covers over 1900 journals, including titles such as The Lancet, Cell, Tetrahedron and the full suite of Trends and Current Opinion journals. With ScienceDirect, the research process is enhanced with unsurpassed searching and linking functionality, all on a single, intuitive interface. The rapid growth of the ScienceDirect collection is due to the integration of several prestigious publications as well as ongoing addition to the Backfiles – heritage collections in a number of disciplines. The latest step in this ambitious project to digitize all of Elsevier’s journals back to volume one, issue one, is the addition of the highly cited Cell Press journal collection on ScienceDirect. Also available online for the first time are six Cell titles’ long-awaited Backfiles, containing more than 12,000 articles highlighting important historic developments in the field of life sciences. The six-millionth article loaded onto ScienceDirect entitled "Gene Switching and the Stability of Odorant Receptor Gene Choice" was authored by Benjamin M. Shykind and colleagues from the Dept. of Biochemistry and Molecular Biophysics and Howard Hughes Medical Institute, College of Physicians and Surgeons at Columbia University. The article appears in the 11 June issue of Elsevier’s leading journal Cell, Volume 117, Issue 6, pages 801–815. www.sciencedirect.com www.sciencedirect.com