jovani & tella trends parasitol 2006.doc

advertisement
Parasite prevalence and sample size:
misconceptions and solutions
Roger Jovani and José L. Tella
Department of Applied Biology, Estació n Bioló gica de Doñ ana, Consejo Superior de Investigaciones Cientı́ficas,
Avenid Maria Luisa s/n, 41013 Sevilla, Spain
Parasite prevalence (the proportion of infected hosts) is
a common measure used to describe parasitaemias and
to unravel ecological and evolutionary factors that
influence host–parasite relationships. Prevalence estimates are often based on small sample sizes because of
either low abundance of the hosts or logistical problems
associated with their capture or laboratory analysis.
Because the accuracy of prevalence estimates is lower
with small sample sizes, addressing sample size has
been a common problem when dealing with prevalence
data. Different methods are currently being applied to
overcome this statistical challenge, but far from being
different correct ways of solving a same problem, some
are clearly wrong, and others need improvement.
Introduction
In a given population with N individuals, the prevalence
(P) denotes the proportion of infected individuals by a
given parasite species or group of species (e.g. a parasite
genus). The actual prevalence of a population, however, is
usually unknown because the number of sampled hosts (n;
the sample size) is generally lower than total population
size (N). However, we can easily obtain an estimate (p) by
dividing the number of infected individuals (i) by the
number of sampled ones [pZ(i/n)*100], where iZ0,1,2.n,
and nZ1,2,3.N. Nonetheless, the accuracy of this
estimate is known to be affected by sample size. This
problem has long concerned researchers because the
results of statistical analyses dealing with prevalence
data and the derived conclusions could depend greatly on
the number of sampled hosts [1]. Here, we review current
methods used to overcome this statistical problem,
identify wrong approaches, improve others and suggest
more powerful practices.
Basic concepts
The low accuracy of prevalence estimates when using a
small sample size has a mathematical basis. Given that
both the number of infected individuals and sample size
are integers, sampling prevalence (i.e. the prevalence
within a given sample of individuals) is constrained to a
particular set of values for each sample size (Figure 1).
Moreover, the behaviour of prevalence estimates for
slight increases in the number of infected individuals
Corresponding author: Jovani, R. (jovani@ebd.csic.es).
changes dramatically when we move from small to large
sample sizes. For instance, if we sample six adult
lizards and find four to be infected (nZ6, iZ4) we
obtain pZ(4/6)*100Z67%, but if we found five infected,
then pZ83%, a difference of 16%. However, for nZ100,
iZ4 returns pZ4%, and iZ5 returns pZ5%, only a 1%
difference. This produces the effect that slight prevalence
differences can be detected at high sample sizes but not at
low ones. In other words, our uncertainty about the real
prevalence (i.e. population prevalence, P) is higher at low
sample sizes (Figure 2).
The accuracy with which we calculate the prevalence
decreases not only at low sample sizes (as expected), but
also when the populational prevalence is close to 50%
(Figure 2). At first, this property could seem counterintuitive, because rare events (e.g. a parasitised individual at PZ1% or an uninfected individual at PZ99%) are
difficult to observe, particularly at low sample sizes.
However, when populational prevalence is close to zero or
100%, sample prevalence is also close to or equal to zero
or 100%, respectively, precisely because rare events are
difficult to detect. At intermediate populational prevalences (e.g. PZ52%), the chance of finding parasitised or
non-parasitised individuals is similar, and sample prevalence could then freely fluctuate from zero to 100%,
falling away from the actual populational prevalence and
increasing the uncertainty in the prevalence estimate.
Minimum sample size
The most intuitive method to overcome the high
statistical uncertainty of prevalences calculated from
low sample sizes is to reject data obtained from small
sample sizes. However, the threshold for establishing a
minimum sample size is highly variable because it
depends on subjective decisions of the researchers. This
leads to researchers either not considering a minimum
sample size (e.g. [2]), or considering low minimum sample
sizes, such as three [3], five [4], or eight [5]; medium
sample sizes, such as ten [6], 15 [7] or 20 [8]; and higher
ones up to 30 [9] or even 75 [10]. Obviously, more is
better, but rather than being a linear relationship,
uncertainty rapidly decreases as sample size increases
up to 10–20 individuals, but not much more with further
increasing sample sizes (Figure 2). Thus, a sample size
around 15 could be used as a reasonable trade-off
between not losing too much data from analyses and
maintaining acceptable levels of uncertainty (around 1/3
Sampling prevalence (p)
(a)
100
75
50
25
0
10
100
10
1
0.1
20
30
Sample size (n)
(c)
100
1 10 100 1000
Sample size (n)
Sampling
prevalence (p)
(b)
Sampling
prevalence (p)
0
40
50
75
50
25
0
1 10 100 1000
Sample size (n)
TRENDS in Parasitology
Figrue 1. All the possible values that prevalence could reach at different sample
sizes. For instance, for a sample sizeZ1, prevalence could be either (0/1)*100Z0%
or (1/1)*100Z100%; for a sample size of 2, there are three possibilities: (0/2)*100Z
0%, (1/2)*100Z50%, and (2/2)*100Z100%, and so on. This is illustrated for sample
sizes (a) from 1 to 50 in lineal axes, (b) from 1 to 3500 in log-log axes, and (c) from 1
to 3500 with logarithmic x-axis and lineal y-axis.
of the sampling prevalence). An additional recommendation is to test the robustness of the results of the
analyses using different minimum sample size cut-offs
[11,12].
50
Standard error
40
30
20
10
0
100
50
10 20 30 40 50 60
70 80 90 100
Sample size (n)
0
TRENDS in Parasitology
Figrue 2. Standard error of prevalence estimates at different sample sizes and
populational prevalences. Standard error (SE) is calculated using the formula: SEZ
100*[p*q/(n-1)] where p is the sample prevalence, qZ1-p, and n is the sample size.
The white line shows results for nZ15.
www.sciencedirect.com
Avoiding zero prevalences
Gregory and Blackburn [13] claimed that the minimum
but not the maximum prevalence that could be achieved in
a sample of a population is affected by sample size. This
influential paper caused several studies to reject zero
prevalences from their analyses or to question its previous
use. These authors stated correctly [13] that large sample
sizes (e.g. 1000 hosts) are needed to detect very low
prevalences (e.g. 0.1%), but that 100% prevalences could
be detected with only one individual sampled if it is
infected. They presented figures with both axes logtransformed (similar to Figure 1b), but this hid the
symmetric shape of the actual relationship between
sample size and prevalence (Figure 1a,c). That is, as
happens with prevalences near zero, prevalences near
100% (e.g. 99.9%) could only be achieved with high sample
sizes (e.g. 1000 hosts). Thus, according to this symmetry
and the suggestions of Gregory and Blackburn [13], we
should also reject 100% prevalences from analyses.
Parasite prevalence differs widely both between and
within host species [14,15]. Therefore, when assessing
sources of variability in parasite prevalences (e.g. between
marine and freshwater habitats [16]), a prevalence value
of zero has the same ecological relevance as a prevalence of
0.1%; and the same happens between 100 and 99.9%.
Thus, we are throwing out very relevant information by
rejecting zero and 100% prevalences. Accordingly, zero
(see Box 1) and 100% prevalences should be included in
the analyses.
Residuals of prevalence on sample size
Another method used to control for the potential effects of
sample size is to obtain the residuals of a linear regression
between sampling prevalence (the dependent variable)
and sample size (the independent variable), and use them
as the dependent variable for comparative studies [17,18].
A clear example of this rationale is a study [18] in which
residuals of parasite prevalence against sample size were
used in an independent contrast analysis when a
correlation was found among these variables, but not in
another analysis of the same study, in which such a
relationship was not found [18]. This approach follows
previous methods that aimed to remove the effect of body
size on body-size-related variables, such as home range
area [19]. Moreover, it has been influenced by the need to
control for sample size when analysing parasite richness,
because the more host individuals that are examined, the
more parasite species could be found [7,20].
Adding more support to this method, and using
empirical log-log plots similar to Figure 1b, Gregory and
Blackburn [13] suggested that a negative relationship
between sample size and prevalence is expected as a
mathematical artefact (and thus needed to be controlled
for). In addition, they stated that nothing but negative
slopes were found when simulating the effect of sample
size (from 1 to 3500) on prevalence estimates when zero
values were deliberately avoided. Clearly, however, there
is no mathematical relationship between prevalence and
sample size per se (Figure 1). We confirmed this by
repeating the same simulation study done in Ref. [13]. We
performed100 simulations in which 200 hypothetical
Opinion
Box 1. Zeros are also relevant prevalences
species (or populations) took prevalence values varying
between zero and 100% and sample sizes between 1 and
3500 (R.J. And J.L.T., unpublished). Among the resulting
relationships, 49 were negative (Spearman rank correlations, r rangeZK0.0008 to K0.1714; meanZK0.0565)
and 51 positive (r rangeZ0.0045–0.2675; meanZ0.0598),
only two negative and four positive weak correlations
being statistically significant. Moreover, the results were
identical when zero prevalences were not allowed in the
simulations. Our results differ greatly from those of Ref.
[13], suggesting that the conclusions of Gregory and
Blackburn were based on the visual examination of loglog plots similar to Figure 1b.
In a more recent but largely unnoticed study regarding
the relationships between prevalence and sample size,
Gregory and Woolhouse [21] simulated the effect of sample
sizes ranging from 10 to 1280 on the mean and the
accuracy of prevalences estimated for a theoretical
population with a prevalence of 80%. They concluded
that prevalence estimates were not biased under any
sample size, only causing greater inaccuracy at low
sample sizes, being thus consistent with our own analysis
(Figure 3; see later). However, Gregory and Woolhouse
[21] did not link these challenging results with their
previous work, and their first recommendations [13] have
prevailed among researchers so far.
Curiously, however, although clearly there is not a
relationship between prevalence and sample size
(Figure 1a), there have been reports not only of null
empirical correlations between prevalence and sample
size [18], but also of negative [18] and positive [22]
correlations. Why? Because of the effect of detecting rare
events at low sample sizes. This point is illustrated by
simple simulations using different sample sizes and
populational prevalences (Figure 3). At low sample sizes
www.sciencedirect.com
real absence of parasites; or be the result of a low prevalence and
intensity of parasites, leading to an apparent absence because of
low sample size or low methodological sensitivity [29]. However,
the relevant question here is whether the inclusion of zero
prevalences improves the conclusions of the research. We feel
that a full understanding of natural variation in parasite burdens
should also consider why some populations and species have zero
or very low parasite prevalences, whereas others have 100% or
very high prevalences. For comparative purposes, a zero prevalence
is as informative as a 1% prevalance (and similarly for 99% or
100%), if calculated using appropriated sample sizes (as for any
other prevalence data). However, by excluding zero prevalences, we
are throwing out extreme values from analyses, and thus excluding
potential host populations or species with interesting ecological or
life-history traits that make them completely or almost completely
free from parasites.
Thus, we make here a plea that zero prevalences should be
reported. As an alternative to scientific journals, we propose the
creation of a website under scientific supervision for compiling data
published in leading journals as well as ‘grey’ literature (such as
non-international journals or conference proceedings), old data
never published submitted by researchers, and data coming from
future surveys of parasite prevalence, whether or not they are
finally published. This would allow an easily available source of
permanently updated data for researches worldwide, becoming one
more of the invaluable virtual services provided by natural history
museums in this century.
(1–15), sampling prevalences and sample sizes were
usually positively correlated at low populational prevalences, uncorrelated at intermediated populational prevalences, and negatively correlated at high populational
prevalences (Figure 3a). At higher sample sizes (15–100),
however, significant correlations were fewer and were
evenly distributed (Figure 3b). This is because, at low
populational prevalence, the chances of finding an infected
(a)
(b)
0.6
0.4
Spearman-r between n and p
Non-parasitised individuals and populations are clearly not the
scope of parasitologists. This has produced a parasitological
literature traditionally biased towards positive prevalence values.
As an example, the world avian host–haemoparasite catalogues
[26,27] report, for a given bird species, all studies that found some
parasites, but only one study example for species that were not
parasitised. In 1982, the seminal paper by Hamilton and Zuk [28]
extended the interest in parasites among evolutionary ecologists by
suggesting a role for parasites in the evolution of plumage
colouration and song in birds. Moreover, this generated a plethora
of hypotheses that were also initially tested by making use of data
previously published by parasitologists, thus suffering from their
biases [7].
These new hypotheses encouraged evolutionary ecologists to
initiate extensive parasite surveys, finding great variability in
parasite burdens among and within species and previously hidden
zero prevalences. This led to new questions about the ecological
factors, host behaviours and life history traits shaping such a
variation in nature [2,5,15,23]. Perhaps recalling the parasitological
tradition, doubts arose about the effects of sample size on zero
prevalences and the accuracy of prevalence estimates, even
including suggestions that zero prevalences should be excluded
from analyses and journal reports [13]. This publication bias has
started to creep into ecological and zoological journals, enhanced
by the fact that zero prevalences have already ceased to be
surprising and thus generate less interest among journal editors.
Failure to find parasites in a sample could either be because of a
0.2
0.0
–0.2
–0.4
–0.6
0 20 40 60 80 100
Populational prevalence (P)
0 20 40 60 80 100
Populational prevalence (P)
TRENDS in Parasitology
Figrue 3. Simulation of the effect of sample size on the correlation between sample
size and prevalence at different actual populational prevalences. Each point
indicates the Spearman correlation coefficient (in red p value !0.05) between
sample size and prevalence for 100 simulated species with sample sizes randomly
varying (a) from 1 to 15 or (b) from 15 to 100.
individual among a low sample number is low, but the
chances increase with increasing sample size; the reverse
happens at higher populational prevalences (Figure 3a).
However, when all the species have a minimum sample
size above 15, the effect of rare events on sample
prevalence is buffered at any population prevalence
(Figure 3b).
The problem with using the residuals from a regression
between prevalence and sample size is that it artificially
increases the prevalence estimates of some species and
underestimates the prevalences of others. Moreover, this
method could not be applied even if (by chance) a
correlation exists between prevalence and sample size in
a given data set. To illustrate this point, one can
deliberately create a statistical relationship between
sample size and populational prevalence by sampling
more individuals from populations with a higher populational prevalence (Figure 4). The simulated sample size
and the sample prevalence result correlate here, of course,
because they have been made to do so deliberately
(Figure 4a). There is also the expected correlation between
(a)
Sampling prevalence (p)
87.5
62.5
37.5
12.5
0
25
50
75
Sample size (n)
References
–25
Populational prevalence (P)
TRENDS in Parasitology
Figrue 4. An illustration of why residuals from prevalence–sample-size regressions
cannot be used to correct for low sample sizes. (a) A simulated positive relationship
between prevalence and sample size was created deliberately by retrieving sample
prevalence (p) for 25 species (in red) with a populational prevalence PZ12.5% and n
from 1 to 25, 25 (in blue) with PZ37.5% and n from 26 to 50, 25 (in green) with PZ
62.5% and n from 51 to 75, and 25 (in black) with PZ87.5% and n from 76 to 100. (b)
The correlation between populational prevalence and sampling prevalence
(Spearman rZ0.847, nZ100, p!0.0001) for the same simulated prevalences. (c) A
graph showing that there was no relationship (Spearman rZ0.124, nZ100, pZ
0.219) between true populational prevalence and the prevalence estimated as the
residuals of the regression line obtained in (a) between sample size and
populational prevalence.
www.sciencedirect.com
Concluding remarks
Current practices for the analysis of prevalence data must
be revised. Statistical tools that take into account the
sample size from which each proportion has been obtained
[23], that weight for sample size [8], or use individual
infection status (infected or not) as the dependent variable
[16] are increasingly being used. In this way, information
is not lost because of sample size restrictions, but more
weight is given to those data with higher sample sizes,
such as in meta-analysis [24] or generalized linear (mixed)
models [25].
However, there are many circumstances in which
methods that do not control for sample size when
analysing prevalence data must still be used [17], and
some decision must be taken about how to choose and
analyse data. In these cases, we have shown that the use
of residuals is a flawed method; avoiding zero prevalences
is unfounded and supposes the loss of very relevant
information; and rejecting prevalence data obtained from
low sample sizes should be done in a conscientious way
according to the shape of the curve in Figure 2, even to the
extent of testing the robustness of the results at different
minimum sample sizes.
0
87.5
87.5
62.5
37.5
12.5
12.5
25
62.5
62.5
37.5
37.5
87.5
12.5
Residuals n–p
(c)
Sampling
prevalence (p)
(b)
100
estimated and real prevalences in the initial data
(Figure 4b). However, the correlation between real
prevalences and calculated residuals is null. This means
that the residuals are unrelated to populational prevalences! Residuals thus become statistical artefacts that
cannot be used as estimators of prevalence for comparative purposes, and thus previous results obtained through
this method should be taken with caution.
Finally, it is worth noting that some relevant biological
factors could be shaping a prevalence–sample-size
relationship. For instance, in a recent study, Ricklefs et
al. [15] used the number of individuals trapped (the
sample size) as an index of bird species abundance in a
given study area. In this way, they used the relationship
between prevalence and sample size to assess the
potential relationship between host density and parasite
prevalence. Thus, the prevalence–sample sizerelationship
should be seen as a potentially interesting pattern by
itself, rather than a statistical artefact that should be
controlled for.
1 Read, A.F. and Harvey, P.H. (1989) Reassessment of comparative
evidence for Hamilton and Zuk theory on the evolution of secondary
sexual characters. Nature 339, 619–620
2 Torchin, M.E. et al. (2003) Introduced species and their missing
parasites. Nature 421, 628–630
3 Poiani, A. (1992) Ectoparasitism as a possible cost of social life: a
comparative analysis using Australian passerines (Passeriformes).
Oecologia 92, 429–441
4 Yezerinac, S.M. and Weatherhead, P.J. (1995) Plumage coloration,
differential attraction of vectors and haematozoa infections in birds.
J. Anim. Ecol. 64, 528–537
5 Schalk, G. and Forbes, M.R. (1997) Male biases in parasitism of
mammals: effects of study type, host age, and parasite taxon. Oikos 78,
67–74
6 Poulin, R. (1996) Sexual inequalities in helminth infections: a cost of
being a male? Am. Nat. 147, 287–295
7 Tella, J.L. (2002) The evolutionary transition to coloniality promotes
higher blood parasitism in birds. J. Evol. Biol. 15, 32–41
8 Scheuerlein, A. and Ricklefs, R.E. (2004) Prevalence of blood parasites
in European passeriform birds. Proc. Biol. Sci. 271, 1363–1370
9 Arneberg, P. et al. (1998) Host densities as determinants of abundance in
parasite communities. Proc. R. Soc. Lond. B. Biol. Sci. 265, 1283–1289
10 Poulin, R. and Mouritsen, K.N. (2003) Large-scale determinants of
trematode infections in intertidal gastropods. Mar. Ecol. Prog. Ser.
254, 187–198
11 John, J. (1995) Parasites and the avian spleen: helminths. Biol.
J. Linn. Soc 54, 87–106
12 Pruett-Jones, S.G. et al. (1990) Parasites and sexual selection in birds
of Paradise. Am. Zool. 30, 287–298
13 Gregory, R.D. and Blackburn, T.M. (1991) Parasite prevalence and
host sample size. Parasitol. Today 7, 316–318
14 Sol, D. et al. (2000) Geographical variation in blood parasites in feral
pigeons: the role of vectors. Ecography 23, 307–314
15 Ricklefs, R.E. et al. (2005) Community relationships of avian malaria
parasites in southern Missouri. Ecol. Monogr. 75, 543–559
16 Mendes, L. et al. (2005) Disease limited distributions? Contrasts in the
prevalence of avian malaria in shorebird species using marine and
freshwater habitats. Oikos 109, 396–404
17 Harvey, P.H. and Pagel, M.D., eds (1991) The Comparative Method in
Evolutionary Biology, Oxford University Press
18 Poulin, R. and Valtonen, E.T. (2001) Nested assemblages resulting
from host size variation: the case of endoparasite communities in fish
hosts. Int. J. Parasitol. 31, 1194–1204
19 Garland, T., Jr. et al. (1992) Procedures for the analysis of comparative
data using phylogenetically independent contrasts. Syst. Biol. 41, 18–32
20 Walther, B.A. and Morand, S. (1998) Comparative performance
of species richness estimation methods. Parasitology 116,
395–405
21 Gregory, R.D. and Woolhouse, M.E.J. (1993) Quantification of parasite
aggregation: a simulation study. Acta Trop. 54, 131–139
22 Pruett-Jones, M. and Pruett-Jones, S. (1991) Analysis and ecological
correlates of tick burdens in a New Guinea avifauna. In Bird-Parasite
Interactions (Loye, J.E. and Zuk, M., eds), pp. 155–176, Oxford
University Press
23 Tella, J.L. et al. (1999) Habitat, world geographic range, and
embryonic development of host explain the prevalence of avian
hematozoa at small spatial and phylogenetic scales. Proc. Natl.
Acad. Sci. U. S. A. 96, 1785–1789
24 Hedges, L.V. and Olkin, I. (1985) Statistical Methods for MetaAnalysis, Academic Press
25 Paterson, S. and Lello, J. (2003) Mixed models: getting the best use of
parasitological data. Trends Parasitol. 19, 370–375
26 Bennett, G.F. et al. (1982) Host–Parasite Catalogue of the Avian
Haematozoa, Occasional Papers in Biology. Memorial University of
Newfoundland
27 Bishop, M.A. and Bennett, G.F. (1992) Host-Parasite Catalogue of the
Avian Haematozoa (Suppl. 1), Occasional Papers in Biology. Memorial
University of Newfoundland
28 Hamilton, W.D. and Zuk, M. (1982) Heritable true fitness and bright
birds: a role for parasites? Science 218, 384–387
29 Cooper, J.E. and Anwar, M.A. (2001) Blood parasites of birds: a plea for
more cautious terminology. Ibis 143, 149–150
ScienceDirect collection reaches six million full-text articles
Elsevier recently announced that six million articles are now available on its premier electronic platform, ScienceDirect. This
milestone in electronic scientific, technical and medical publishing means that researchers around the globe will be able to access
an unsurpassed volume of information from the convenience of their desktop.
ScienceDirect’s extensive and unique full-text collection covers over 1900 journals, including titles such as The Lancet, Cell,
Tetrahedron and the full suite of Trends and Current Opinion journals. With ScienceDirect, the research process is enhanced with
unsurpassed searching and linking functionality, all on a single, intuitive interface.
The rapid growth of the ScienceDirect collection is due to the integration of several prestigious publications as well as ongoing
addition to the Backfiles – heritage collections in a number of disciplines. The latest step in this ambitious project to digitize all of
Elsevier’s journals back to volume one, issue one, is the addition of the highly cited Cell Press journal collection on ScienceDirect.
Also available online for the first time are six Cell titles’ long-awaited Backfiles, containing more than 12,000 articles highlighting
important historic developments in the field of life sciences.
The six-millionth article loaded onto ScienceDirect entitled "Gene Switching and the Stability of Odorant Receptor Gene Choice"
was authored by Benjamin M. Shykind and colleagues from the Dept. of Biochemistry and Molecular Biophysics and Howard
Hughes Medical Institute, College of Physicians and Surgeons at Columbia University. The article appears in the 11 June issue of
Elsevier’s leading journal Cell, Volume 117, Issue 6, pages 801–815.
www.sciencedirect.com
www.sciencedirect.com
Download