Tree Genetics and Genomes - Springer Static Content Server

advertisement
Assessing microsatellite linkage disequilibrium in wild, cultivated,
and mapping populations of Theobroma cacao L and its impact
on association mapping.
Tree Genetics and Genomes
J. Conrad Stack1, Stefan Royaert2, Osman Gutiérrez3, Chifumi Nagai4, Ioná Santos Araújo Holanda5,
Raymond Schnell1, Juan-Carlos Motamayor1*
1Mars,
Incorporated, McLean, VA, USA
Center for Cocoa Science, CP55 Itajuípe, Bahia, Brazil, 45625-000
3USDA-ARS, Subtropical Horticulture Research Station, Miami, FL, USA
4Hawaii Agriculture Research Center, Kunia, HI, USA
5Universidade Federal Rural do Semi-Arido, Departamento de Ciências Vegetais, BR 110 - Km 47,
Bairro Pres. Costa e Silva, Mossoró-RN, Brazil CEP 59.625-900
2Mars
*corresponding author (juan.motamayor@effem.com)
PHASE parameters, convergence diagnostics, and sample partitioning
Usage
Due to the large number of alleles (and allele combinations) present in our data sets, PHASE
(v2.1.1) need to be compiled with minor modifications to constants.hpp (Line ~14: const
int KMAX=200;). Each time PHASE was run, a new random variable seed was used (-S
argument). A typical run looks like this:
PHASE -MS -d1 -l3 -X10 –S123456789 infile.txt outfile.out 5000 1000 2500
Modeling Parameter Sensitivity
A number of additional PHASE analyses were carried out in order to gauge the possible effects
that our modeling parameter choice might have had on the linkage disequilibrium values we
observed (i.e. Figure 2+ in the main text). This modeling parameter, represented in PHASE by the
-M argument, indicates that recombination either should (-MR) or should not (-MS) be explicitly
considered when determining the haplotype phase of the genotypes of the sampled individuals.
The no-recombination model (-MS) was our model of choice for the analyses presented in the
main text. The reasoning behind this choice was that we did not want to bias our downstream LD
calculations by using a haplotype phasing model that assumes LD decays with the distance
between markers (cite Stephens 2003).
According to the PHASE (v2.1.1) documentation, in a very limited trial, model choice did not
appear to have a substantial affect on error rates. All PHASE analyses mentioned below, using
either model, had converged according to the other criteria discussed in the main text. Due to the
extremely long runtimes exhibited by PHASE, most of our sensitivity analyses were run on a
subset of the main data. In Figure 1A, we compare LD values from a subset of both the
microsatellites (N=67) and sample individuals (N=200, 20 from each of the ten structural groups).
The sample individuals were the top twenty individuals in each structural group assigned to that
group by STRUCTURE. In Figure 1B, we compare LD values from the same subset of microsatellite
loci and all of the samples from the mapping population.
Figure 2: PHASE modeling comparison plots for a subset of the full population data (A) and the
mapping population (B). LD values for PHASE (no-recombination model, -MS) on the x-axis and
PHASE (recombination model, -MR) on the y-axis. Grey lines represent rough confidence intervals
generated by resampling haplotypes from PHASE output (N=500). All within linkage group
pairwise comparisons are shown above, for all linkage groups.
Convergence
Both convergence criteria mentioned in the main text (passing either Geweke’s or Heidel’s
convergence test) were implemented in R using the coda package. We only looked at the
convergence of the first column in the “_monitor” output trace file, which was recommended by
the program author in its documentation. The R snippet below demonstrates how these tests
were run:
require(coda)
ff = dir(path=finaldir,pattern=".*monitor",full.names=T)
for(ii in seq(length(ff)))
{
mcts = mcmc(read.table(ff[ii],header=F)[,1])
plot(mcts)
title(basename(ff[ii]))
print(geweke.diag(mcts)) # Geweke’s test
print(heidel.diag(mcts)) # Heidel’s test
print(effectiveSize(mcts)) # effective sample size
readline()
}
These convergence criteria appeared to be rather conservative in that “fuzzy caterpillar”
likelihood traces were often judged to have failed one or the other. Examples of this are shown
below (Figures 2 and 3). These runs were deemed to have converged, based on the two statistics
and our own judgment. A main reason why extensive re-runs were not carried out was the
extremely long and variable runtimes of PHASE, which ranged from a few hours to a few weeks.
Figure 2. Example likelihood trace where Geweke’s test failed. Plots were generated the
plot.mcmc function in coda with default arguments.
Figure 3. Example likelihood trace where Heidel’s test failed. Plots were generated the
plot.mcmc function in coda with default arguments
Sample partitioning
Below, pairwise LD values are compared for the full structured population data, which was phased
under two different partitioning schemes. All linkage groups were analyzed separately. In the first
partitioning scheme (“Unstructured”), the 778 samples were randomly divided into two groups
prior to haplotype phasing. This was done due to the extremely long runtime of PHASE, which
would not accommodate the full set of samples. All analyses started with the full set failed to
finish within two weeks of starting, and appeared likely not to finish with a month or two
(depending on the linkage group). In the second partitioning scheme (“Structured”), samples
were divided into groups based on their structural group assignment (Motamayor, 2008). We
compared the LD values resulting from the haplotypes inferred under both partitioning schemes
(Figure 4).
While a majority of the LD values did not differ substantially, a sizable minority did, mostly at the
low to middle range of values. We suspect that these differences were due in large part to the
genotype imputation that PHASE performs and to a lesser degree the haplotyping. In the
unstructured partitioning scheme, the genotype imputation algorithm has in theory a larger pool
of alleles to choose from for most loci. As a result, some of the unstructured LD values are
underestimated relative the structured LD values, where a more limited pool of alleles are
available (with a greater chance for inducing non-random associations between the alleles at a
pair of loci).
Interestingly, the qualitative patterns of LD decay versus physical distance between loci did not
change substantially between the two datasets (Figures 5 and 6). So, while sample partitioning is
clearly a serious consideration, one that warrants further exploration, it does not substantially
affect the finding presented in the main manuscript.
1.0
0.8
0.6
0.4
0.2
0.0
LD values, Ten Structural groups
0.0
0.2
0.4
0.6
0.8
1.0
LD values, Two random groups
Figure 4. Pairwise LD values calculated using haplotypes from PHASE when all 778 samples were
randomly assigned to one of two groups (x-axis, so-called “unstructed” set) and when the samples
were divided up into their respective structural groups (y-axis). Vertical and horizontal light gray
bars indicate the variability of the LD estimates according to our resampling procedure.
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
10
xvalue
xvalue
xvalue
0
Purus
yvalue
9
Nanay
yvalue
1.00
0.75
0.50
0.25
0.00
8
Nacional
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
7
Maranon
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
6
Iquitos
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
4
Hybrid
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
3
Guiana
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
2
Curaray
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
1
Criollo
yvalue
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
yvalue
Amelonado Contamana
1.00
0.75
0.50
0.25
0.00
20
xvalue
40 0
20
xvalue
40 0
20
xvalue
40 0
20
40 0
20
40 0
20
40 0
20
40 0
20
40 0
xvalue
xvalue
xvalue
xvalue
xvalue
Physical
distance
between
loci (in
megabases)
20
xvalue
40 0
20
xvalue
40 0
20
40
xvalue
Figure 5. LD values from unstructed (two random groups) haplotype phasing, broken down by structural (columns) and linkage
(rows) groups.
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
yvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
xvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
xvalue
yvalue
10
xvalue
xvalue
xvalue
0
Purus
yvalue
9
Nanay
yvalue
1.00
0.75
0.50
0.25
0.00
8
Nacional
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
7
Maranon
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
6
Iquitos
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
4
Hybrid
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
3
Guiana
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
2
Curaray
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
1
Criollo
yvalue
yvalue
yvalue
1.00
0.75
0.50
0.25
0.00
yvalue
Amelonado Contamana
1.00
0.75
0.50
0.25
0.00
20
xvalue
40 0
20
xvalue
40 0
20
xvalue
40 0
20
40 0
20
40 0
20
40 0
20
40 0
20
40 0
xvalue
xvalue
xvalue
xvalue
xvalue
Physical
distance
between
loci (in
megabases)
20
xvalue
40 0
20
xvalue
40 0
20
40
xvalue
Figure 6. LD values from structured (ten structural groups) haplotype phasing, broken down by structural (columns) and linkage
(rows) groups.
Download