Assessing microsatellite linkage disequilibrium in wild, cultivated, and mapping populations of Theobroma cacao L and its impact on association mapping. Tree Genetics and Genomes J. Conrad Stack1, Stefan Royaert2, Osman Gutiérrez3, Chifumi Nagai4, Ioná Santos Araújo Holanda5, Raymond Schnell1, Juan-Carlos Motamayor1* 1Mars, Incorporated, McLean, VA, USA Center for Cocoa Science, CP55 Itajuípe, Bahia, Brazil, 45625-000 3USDA-ARS, Subtropical Horticulture Research Station, Miami, FL, USA 4Hawaii Agriculture Research Center, Kunia, HI, USA 5Universidade Federal Rural do Semi-Arido, Departamento de Ciências Vegetais, BR 110 - Km 47, Bairro Pres. Costa e Silva, Mossoró-RN, Brazil CEP 59.625-900 2Mars *corresponding author (juan.motamayor@effem.com) PHASE parameters, convergence diagnostics, and sample partitioning Usage Due to the large number of alleles (and allele combinations) present in our data sets, PHASE (v2.1.1) need to be compiled with minor modifications to constants.hpp (Line ~14: const int KMAX=200;). Each time PHASE was run, a new random variable seed was used (-S argument). A typical run looks like this: PHASE -MS -d1 -l3 -X10 –S123456789 infile.txt outfile.out 5000 1000 2500 Modeling Parameter Sensitivity A number of additional PHASE analyses were carried out in order to gauge the possible effects that our modeling parameter choice might have had on the linkage disequilibrium values we observed (i.e. Figure 2+ in the main text). This modeling parameter, represented in PHASE by the -M argument, indicates that recombination either should (-MR) or should not (-MS) be explicitly considered when determining the haplotype phase of the genotypes of the sampled individuals. The no-recombination model (-MS) was our model of choice for the analyses presented in the main text. The reasoning behind this choice was that we did not want to bias our downstream LD calculations by using a haplotype phasing model that assumes LD decays with the distance between markers (cite Stephens 2003). According to the PHASE (v2.1.1) documentation, in a very limited trial, model choice did not appear to have a substantial affect on error rates. All PHASE analyses mentioned below, using either model, had converged according to the other criteria discussed in the main text. Due to the extremely long runtimes exhibited by PHASE, most of our sensitivity analyses were run on a subset of the main data. In Figure 1A, we compare LD values from a subset of both the microsatellites (N=67) and sample individuals (N=200, 20 from each of the ten structural groups). The sample individuals were the top twenty individuals in each structural group assigned to that group by STRUCTURE. In Figure 1B, we compare LD values from the same subset of microsatellite loci and all of the samples from the mapping population. Figure 2: PHASE modeling comparison plots for a subset of the full population data (A) and the mapping population (B). LD values for PHASE (no-recombination model, -MS) on the x-axis and PHASE (recombination model, -MR) on the y-axis. Grey lines represent rough confidence intervals generated by resampling haplotypes from PHASE output (N=500). All within linkage group pairwise comparisons are shown above, for all linkage groups. Convergence Both convergence criteria mentioned in the main text (passing either Geweke’s or Heidel’s convergence test) were implemented in R using the coda package. We only looked at the convergence of the first column in the “_monitor” output trace file, which was recommended by the program author in its documentation. The R snippet below demonstrates how these tests were run: require(coda) ff = dir(path=finaldir,pattern=".*monitor",full.names=T) for(ii in seq(length(ff))) { mcts = mcmc(read.table(ff[ii],header=F)[,1]) plot(mcts) title(basename(ff[ii])) print(geweke.diag(mcts)) # Geweke’s test print(heidel.diag(mcts)) # Heidel’s test print(effectiveSize(mcts)) # effective sample size readline() } These convergence criteria appeared to be rather conservative in that “fuzzy caterpillar” likelihood traces were often judged to have failed one or the other. Examples of this are shown below (Figures 2 and 3). These runs were deemed to have converged, based on the two statistics and our own judgment. A main reason why extensive re-runs were not carried out was the extremely long and variable runtimes of PHASE, which ranged from a few hours to a few weeks. Figure 2. Example likelihood trace where Geweke’s test failed. Plots were generated the plot.mcmc function in coda with default arguments. Figure 3. Example likelihood trace where Heidel’s test failed. Plots were generated the plot.mcmc function in coda with default arguments Sample partitioning Below, pairwise LD values are compared for the full structured population data, which was phased under two different partitioning schemes. All linkage groups were analyzed separately. In the first partitioning scheme (“Unstructured”), the 778 samples were randomly divided into two groups prior to haplotype phasing. This was done due to the extremely long runtime of PHASE, which would not accommodate the full set of samples. All analyses started with the full set failed to finish within two weeks of starting, and appeared likely not to finish with a month or two (depending on the linkage group). In the second partitioning scheme (“Structured”), samples were divided into groups based on their structural group assignment (Motamayor, 2008). We compared the LD values resulting from the haplotypes inferred under both partitioning schemes (Figure 4). While a majority of the LD values did not differ substantially, a sizable minority did, mostly at the low to middle range of values. We suspect that these differences were due in large part to the genotype imputation that PHASE performs and to a lesser degree the haplotyping. In the unstructured partitioning scheme, the genotype imputation algorithm has in theory a larger pool of alleles to choose from for most loci. As a result, some of the unstructured LD values are underestimated relative the structured LD values, where a more limited pool of alleles are available (with a greater chance for inducing non-random associations between the alleles at a pair of loci). Interestingly, the qualitative patterns of LD decay versus physical distance between loci did not change substantially between the two datasets (Figures 5 and 6). So, while sample partitioning is clearly a serious consideration, one that warrants further exploration, it does not substantially affect the finding presented in the main manuscript. 1.0 0.8 0.6 0.4 0.2 0.0 LD values, Ten Structural groups 0.0 0.2 0.4 0.6 0.8 1.0 LD values, Two random groups Figure 4. Pairwise LD values calculated using haplotypes from PHASE when all 778 samples were randomly assigned to one of two groups (x-axis, so-called “unstructed” set) and when the samples were divided up into their respective structural groups (y-axis). Vertical and horizontal light gray bars indicate the variability of the LD estimates according to our resampling procedure. yvalue yvalue yvalue yvalue yvalue yvalue yvalue yvalue yvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue 10 xvalue xvalue xvalue 0 Purus yvalue 9 Nanay yvalue 1.00 0.75 0.50 0.25 0.00 8 Nacional yvalue yvalue 1.00 0.75 0.50 0.25 0.00 7 Maranon yvalue yvalue 1.00 0.75 0.50 0.25 0.00 6 Iquitos yvalue yvalue 1.00 0.75 0.50 0.25 0.00 4 Hybrid yvalue yvalue 1.00 0.75 0.50 0.25 0.00 3 Guiana yvalue yvalue 1.00 0.75 0.50 0.25 0.00 2 Curaray yvalue yvalue 1.00 0.75 0.50 0.25 0.00 1 Criollo yvalue yvalue yvalue 1.00 0.75 0.50 0.25 0.00 yvalue Amelonado Contamana 1.00 0.75 0.50 0.25 0.00 20 xvalue 40 0 20 xvalue 40 0 20 xvalue 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 xvalue xvalue xvalue xvalue xvalue Physical distance between loci (in megabases) 20 xvalue 40 0 20 xvalue 40 0 20 40 xvalue Figure 5. LD values from unstructed (two random groups) haplotype phasing, broken down by structural (columns) and linkage (rows) groups. yvalue yvalue yvalue yvalue yvalue yvalue yvalue yvalue yvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue xvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue xvalue yvalue 10 xvalue xvalue xvalue 0 Purus yvalue 9 Nanay yvalue 1.00 0.75 0.50 0.25 0.00 8 Nacional yvalue yvalue 1.00 0.75 0.50 0.25 0.00 7 Maranon yvalue yvalue 1.00 0.75 0.50 0.25 0.00 6 Iquitos yvalue yvalue 1.00 0.75 0.50 0.25 0.00 4 Hybrid yvalue yvalue 1.00 0.75 0.50 0.25 0.00 3 Guiana yvalue yvalue 1.00 0.75 0.50 0.25 0.00 2 Curaray yvalue yvalue 1.00 0.75 0.50 0.25 0.00 1 Criollo yvalue yvalue yvalue 1.00 0.75 0.50 0.25 0.00 yvalue Amelonado Contamana 1.00 0.75 0.50 0.25 0.00 20 xvalue 40 0 20 xvalue 40 0 20 xvalue 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 xvalue xvalue xvalue xvalue xvalue Physical distance between loci (in megabases) 20 xvalue 40 0 20 xvalue 40 0 20 40 xvalue Figure 6. LD values from structured (ten structural groups) haplotype phasing, broken down by structural (columns) and linkage (rows) groups.