Electronic Supplementary Material 3. Addressing alternative

advertisement
Electronic Supplementary Material 3. Addressing alternative explanations for a correlation
between the occurrence of rare alleles and expected heterozygosity.
ESM 3a Allele-calling errors.
The process of calling microsatellite alleles in next generation sequencing data is inevitably errorprone, particularly when the data are low coverage. The likelihood-based program lobSTR appears
effective at making the best calls given the data, but cannot of course guard against issues such as
slippage during library preparation. Having said this, errors appear to be rather or even very rare,
as evidenced by the very high probability, 0.699, that a called allele is the reference sequence allele.
For the low variability loci I analyse this is very much the expected level of equality.
To assess the possible impact of mis-called alleles on the patterns I described it is useful to
categorise all three possible classes of event:
i) A mis-called allele creates a new allele that is neither a PNM or one of the other alleles present in
all four population groups.
ii) An allele is mis-called in a way that creates a PNM.
iii) An allele is mis-called to create a non-PNM allele that is called elsewhere in the dataset.
Case (i) can be dismissed because such instances will be excluded from the analysis. Case (ii) will
be ignored if the allele created is already present as a PNM. If the mis-called allele creates a PNM de
novo then there exists the possibility that populations with lower quality data will carry more
PNMs. However, this possibility will only be realized if the same population has higher
heterozygosity, a scenario that is addressed below. Finally, case (iii) has the potential to impact
heterozygosity. However, the impact will be very small because most population group – locus
combinations are based one >200 allele calls. More importantly, there is no particular reason why
such mis-calls will predictably increase (or decrease heterozygosity). Thus, although at first glance
it might seem plausible that mis-called alleles might increase both heterozygosity and the
frequency of PNMs, in practice the stringent requirement for all populations to carry the same
number and identity of alleles means that no net change in heterozygosity is expected.
ESM 3b Controlling more generally for population-specific issues.
Mis-called alleles present just one possibility for why a given population or population group might
carry more PNMs that others. Demographic history and sample structure are two others. For
example, expanding populations are likely to carry more rare alleles while population groups
comprising disparate populations will tend to carry alleles at more even frequencies, as evidenced
by Africa having on average 18% higher heterozygosity than the other three groups. Wherever
populations differ in heterozygosity, the possibility exists that the same population also carries
more PNMs, for reasons that may or may not be directly related. Such a pattern can potentially
drive a correlation between PNM occurrence and heterozygosity even when no causal relationship
exists.
To remove the risk of trends driven by population group characteristics I transformed the data
such that heterozygosity values within each population group all have the same mean and variance.
This was achieved by subtracting the group mean from each and then dividing by the group
standard deviation. This creates the standard normal distribution with mean=0 and unit standard
deviation. When this is done, trends driven by population-specific properties such as demographic
history, the frequency of allele mis-calling and others should be removed. All that will remain will
be trends in which, PNMs are genuinely associated with higher heterozygosity relative both to the
same locus in other populations and to other loci with the same number of alleles in the same
population.
ESM 3c Is there an influence of natural selection and selective sweeps?
Positive selection has the potential both to reduce heterozygosity and to remove PNMs. If the
selection is very recent and impacts humans in just one part of the world, selection might possibly
offer an explanation why population-locus combinations with PNMs also have relatively high
heterozygosity. Although other evidence argues against this, a recent review concluding “Classical
sweeps have been shown to be rare in humans and, if they do exist, they occur around loci with
large effect alleles”, I also tested a clear prediction arising from models based on selection.
Selection will tend to generation regions with high linkage disequilibrium, within which
microsatellites will tend to show the same pattern: the same population groups carrying higher
heterozygosity and PNMs. Consequently, models based on selection predict that PNMs will be
clustered. In practice, I find no evidence of clustering, the probability that adjacent PNMs are found
in the same population being indistinguishable from a random ordering and this probability does
not vary with distance between adjacent PNMs (see text).
Download