Convergence Diagnostics of Bayesian MCMC Chains

advertisement
Convergence Diagnostics for PHASE MCMC Chains
We tested the convergence of the MCMC chains with the CODA library
(Best et al., 1995), as implemented for the R v.2.11.1 environment (Plummer
et al., 2010; R Development Core Team, 2010). CODA convergence
diagnostics taken into account in our analyses included: heterogeneity
among chains (Gelman and Rubin, 1992), convergence test for individual
chains (Geweke, 1992), stationarity and half-width interval tests
(Heidelberger and Welch, 1983), as well as the inspection of the density
distributions and the auto- and cross-correlations plots. Point estimates of
the parameters in the different applications were calculated as the medians
of the second half from all stationary chains.
Haploype Inferences with PHASE
Haplotypes were obtained with the coalescent-based approach from
Stephens & Donnelly (2003). The use of different priors (c, , f) for the
haplotype reconstructions had little effect over the final set of inferred
haplotypes. The differences were restricted to a few genotypes that produced
alternative pairs of inferred haplotypes under the different models.
The four LGs investigated had different homozygosity and number of
markers, which led us to run chains of different lengths and thinning
intervals. Burn-in iterations, thinning intervals and final chain lengths of the
Bayesian chains that produced the reported results were as follows: LG2 (-
X10, 1000 25 250); LG9 (-X10, 1000 25 500); LG10 (-X10, 1000 15 250);
LG12 (-X10, 500 10 100). We measured the goodness of fit of the estimated
haplotypes to an approximate coalescent with recombination, using the
posterior pseudo-likelihood of the data under the model (Stephens and
Donnelly, 2003), with the convergence tools available within CODA (not
shown).
Recombination Estimates with PHASE
Recombination was estimated from population data (inferred haplotypes)
using the general recombination model from Li and Stephens (2003) and
Crawford et al. (2004), which allows hot and cold-spots of recombination to
independently occur in different segments. We used four different sets of
priors by combining two recombination probabilities per base-pair (c: the
default value and the value obtained from the rate of the oaks genome
content to their linkage length) with two priors for the population genetics
recombination parameter (µ) and for the difference allowed (f) between the
estimated population genetics recombination parameter () and its prior (µ).
The default values for the hotspots priors were used in all cases. See PHASE
documentation for a briefing on the recombination models and the priors.
Our first attempts to calculate the population genetics recombination
parameters were made with thinning intervals, final-chain lengths and burnin iterations shown in Table S5-1. Background recombination point-
estimates show that informative priors might be essential to reach
convergence, at least with this type of data. Only LG9 and LG10 point
estimates obtained with the “oak priors” attained convergence. The GelmanRubins shrink factor test and Geweke`s Z scores indicated convergence
failure in LG2 and LG12 (Figure S5-1 and Table S5-2). HeidelbergerWelch’s stationarity test failed only for the recombination estimate in the
second segment from LG.9 (Table S5-3) The half-width tests failed for LG2
and LG12 (Table S5-3).
Afterwards, we run a longer final set of iterations trying to confirm
convergence in LG9 and LG10 and to attain it in LG2 and LG12. We used a
thinning interval of 500 and a final chain length of 109 after 5x108 burn-in
iterations, for all LGs. The results obtained are shown in the main text
(Table 2). Convergence was confirmed for LG9 and LG10 (Figure S5-2,
Tables S5-4, S5-5) and the recombination parameters point estimates varied
only slightly. Furthermore, LG12 also passed the CODA convergence tests,
although longer simulations would be needed to obtain accurate point
estimates.
Our attempts to obtain recombination estimates with the modified
recombination probability per base pair (c) failed for three out the four LGs,
even though we used much longer simulations (Table S5-6). Only LG9
simulations obtained with the oak priors seemed to reach convergence, with
a point estimate rather close to the one obtained with the default
recombination probability value. We did not further pursue convergence in
other segments because one single simulation would last far over one month
in our computers.
REFERENCES S2
Best MG, Cowles MK, Vines SK (1995) CODA Manual version 0.30. MRC
Biostatistics Unit, Cambridge, UK.
Crawford D, Bhangale T, Li N, Hellenthal G, Rieder M, Nickerson D,
Stephens M (2004) Evidence for substantial fine-scale variation in
recombination rates across the human genome. Nature Genetics, 36,
700-706.
Gelman A, Rubin DB (1992) Inference from iterative simulations using
multiple sequences. Statistical Science, 7, 457-472
Geweke J (1992) Evaluating the accuracy of sampling-based approaches to
calculating posterior moments. In Bayesina Statistics 4, (Eds.
Bernardo JM, Berger JO, Dawid AP, Smith AFM). Clarendon Press,
Oxford, UK
Heidelberger P, Welch P (1983) Simulating run length control in the
presence of an initial transient. Operations Research, 31, 1109-1144
Li N, Stephens M (2003) Modeling linkage disequilibrium and identifying
recombination hotspots using single-nucleotide polymorphism data.
Genetics, 165, 2213-2233.
Plummer M, Best N, Cowles K,Vines K (2006) CODA: Output analysis and
diagnostics for MCMC. R package version
0.13-5. URL
http://CRAN.R-project.org/package=coda
R Development Core Team (2010). R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Stephens M, Donnelly P (2003) A comparison of Bayesian methods for
haplotype reconstruction. American Journal of Human Genetics, 73,
1162-1169.
Table S5-1 Point estimates of the recombination parameters, together with burn-in periods and final
chain lengths, for the first recombination estimates obtained with the default recombination
probability per base pair (c=1E-08).
Background
Long Chain Burn-in Linkage Group Recombination
Default priors (µ=4E-04, f=1E06)
5E+07
2.5E+05
LG2
2.4E-08
2.5E+07
1.3E+06
LG9
3.4E-08
2.5E+07
1.3E+06
LG10
4.1E-08
2.5E+07
1.3E+06
LG12
2.2E-08
r1
r2
0.82
0.81
0.62
0.88
0.54
0.85
NA
0.43
Oak priors (µ=0.04, f=1E04)
1E+08
5E+07
5E+07
2.5E+06
1E+08
5E+07
5E+08
2.5E+06
0.76
1.00
0.75
0.85
0.43
0.99
NA
0.38
LG2
LG9
LG10
LG12
4.8E-08
0.0331
0.0028
4.5E-08
Figure S5-1 Gelman-Rubin-Brooks plots of Gelman and Rubin (1992) shrink factor for the Bayesian chains used in the recombination estimates
shown in Table S5-1 (only for the analyses with the “oak priors”). The pseudo-likelihood tests the goodness of fit of the inferred haplotypes to an
approximate coalescent with recombination (Stephens and Donnelly, 2003).
Table S5-2 Geweke Z-scores for the Bayesian chains used to estimate recombination parameter
shown in Table S5-1 (only for the analyses with the ”oak priors”). Tests failures are indicated in
bold characters.
Z-scores
Haplotype
Parameter
Chain 1 Chain 2 Chain 3 Chain 4 Chain 5
LG2
Background
r1
r2
-1.000
0.998
0.323
1.000
2.230
-0.079
0.130
0.537
-0.258
-0.835
-2.995
-0.125
0.920
1.160
-0.080
LG9
Background
r1
r2
1.390
0.600
-0.610
-1.986
1.440
0.070
-1.172
0.964
-1.194
1.426
-0.537
-1.648
0.899
0.450
-0.451
LG10
Background
r1
0.106
1.492
-1.194
0.570
-0.269
0.055
0.423
0.913
-1.178
0.540
LG12
Background
r1
r2
-0.242
0.218
0.594
1.163
2.581
1.174
1.270
-0.330
1.335
1.564
0.378
1.587
0.859
-0.623
0.722
Table S5-3 Heidelberger and Welch stationarity and half-width tests for the Bayesian chains used in the recombination estimates shown in Table S5-1
(only for the analyses with the ”oak priors”). Failures to pass the tests are indicated by bold characters.
Haplotype Parameter
LG2
LG9
LG10
LG12
Background
r1
r2
Stationarity
Test
P-value
C1
C2
C3
C4
C5
Half-width
Test
C1
C2
Mean
C3
Half-width
C4
1.04E-07 1.08E-07
C5
C1
C2
C3
C4
C5
passed
passed
0.362 0.758 0.184 0.786 0.996
passed/failed
0.000
0.028
0.002
0.001 0.055 7.96E-09 2.07E-09 0.003
0.715 0.288 0.732 0.163 0.339
passed
1.348
1.341
1.330
1.280
1.273
0.016 0.016
0.025
0.014
0.013
passed
0.684 0.300 0.460 0.684 0.966
passed
0.682
0.671
0.666
0.688
0.682
0.009 0.008
0.013
0.006
0.007
passed
Background
passed
r1
passed/failed
r2
0.261 0.084 0.465 0.777 0.870
passed
12.360 12.760
12.860
13.030
13.190
0.620 0.878
0.594
0.436
0.672
0.563 0.084 0.549 0.072 0.643
passed
1.910
1.910
1.930
1.930
1.920
0.027 0.023
0.031
0.036
0.031
0.558 0.718 0.065 0.215 0.048
passed/NA
1.920
1.940
1.910
1.940
NA
0.026 0.022
0.028
0.042
NA
Background
r1
passed
0.676 0.614 0.604 0.672 0.686
passed
8.200
9.100
10.210
7.940
8.660
0.728 0.976
0.627
0.416
0.699
passed
0.342 0.571 0.618 0.965 0.405
passed
1.480
1.640
1.580
1.430
1.570
0.047 0.039
0.033
0.044
0.100
Background
r1
r2
passed
0.611 0.754 0.661 0.306 0.871
passed/failed
1.576
1.190
1.426
1.330
2.229
0.261 0.243
0.401
0.242
0.783
passed
0.617 0.714 0.304 0.954 0.394
passed
1.552
1.530
1.539
1.550
1.554
0.025 0.026
0.027
0.019
0.032
passed
0.342 0.107 0.609 0.512 0.984
passed
0.683
0.687
0.683
0.680
0.787
0.030 0.031
0.040
0.027
0.070
Figure S5-2 Gelman-Rubin-Brooks plots of Gelman and Rubin (1992) shrink factor for the Bayesian chains used in the reported recombination
estimates (Table 2, main text; only for the analyses with the “oak priors”).
Table S5-4 Geweke Z-scores for the Bayesian chains used to estimate recombination parameter
shown in Table 2 of the main text (only for the analyses with the “”oak priors”). Tests failures are
indicated in bold characters.
Z-scores
Haplotype
Parameter
Chain 1 Chain 2 Chain 3 Chain 4
LG2
Background
r1
r2
0.074
0.785
-0.508
0.371
0.499
0.857
0.411
-0.020
0.691
-1.206
-0.669
-2.297
LG9
Background
r1
r2
0.229
-0.783
0.482
1.267
-1.791
0.256
0.492
-0.786
-1.312
1.267
-1.791
0.256
LG10
Background
r1
-0.652
-0.388
-0.285
-0.727
0.726
-1.218
-1.169
-0.847
LG12
Background
r1
r2
1.727
-0.246
0.342
-2.315
-1.862
-1.319
1.399
-0.453
0.707
1.152
0.987
1.160
Table S5-5 Heidelberger and Welch stationarity and half-width tests for the Bayesian chains used in the recombination estimates shown in Table 2 of
the main text (only for the analyses with the ”oak priors”). Failures to pass the tests are indicated by bold characters.
Haplotype Parameter
LG2
LG9
LG10
LG12
Stationarity
Test
P-value
C1
C2
C3
C4
Half-width
Test
C1
C2
Mean
C3
C4
Half-width
C1
C2
C3
C4
Background
passed
0.622 0.307 0.757 0.555
failed
0.025
0.032
0.225
0.010
0.025 0.029 0.160 0.013
r1
passed
0.345 0.778 0.620 0.695
passed
1.270
1.346
1.326
1.295
0.013 0.008 0.009 0.009
r2
passed
0..841 0.387 0.507 0.414
passed
0.695
0.703
0.679
0.668
0.012 0.005 0.011 0.005
Background
passed
0.543 0.392 0.134 0.366
passed
12.720 12.820 13.000 13.040
0.186 0.437
0.237 0.343
r1
passed
0.730 0.433 0.145 0.406
passed
1.910
1.930
1.910 1.920
0.012 0.010
0.012 0.012
r2
passed
0.064 0.902 0.985 0.400
passed
1.910
1.930
1.920 1.930
0.011 0.011
0.015 0.011
Background
passed
0.776 0.146 0.694 0.719
passed
9.560
8.490
7.970
8.250
0.725 0.351 0.218 0.269
r1
passed
0.934 0.866 0.666 0.529
passed
1.640
1.490
1.490
1.480
0.031 0.025 0.133 0.033
Background
passed
0.183 0.340 0.227 0.204
passed
8.550
8.250
8.610 8.100
0.371 0.232
0.294 0.244
r1
passed
0.579 0.458 0.225 0.121
passed
1.800
1.790
1.790 1.730
0.013 0.012
0.016 0.025
r2
passed
0.663 0.491 0.984 0.154
passed
1.520
1.460
1.490 1.400
0.015 0.017
0.017 0.020
Table S5-6 Recombination estimates, together with burn-in periods and final chain lengths,
obtained using an estimate of the oaks recombination probability per base pair based in published
data (ratio of the DNA content to the linkage length). Only the longest simulations for each segment
are shown.
Background
LG1
Long Chain Burn-in
Recombination
Default priors (µ =4E-04, f =1E06)
1.0E+09
5.0E+08 LG9
2.1E-07
5.0E+08
5.0E+09 LG10
2.7E-07
1.0E+09
5.0E+08 LG12
2.5E-08
r1
r2
0.92
0.83
0.89
0.94
NA
0.50
Oak priors (µ =0.04, f =1E04)
2E+09
1E+09
LG2
1E+09
2.5E+09 LG9
1E+09
2.5E+10 LG10
1E+09
5.0E+08 LG12
0.85
0.91
0.62
0.93
0.63
0.95
NA
0.60
1.6E-06
0.0119
5.3E-06
1.6E-06
1: Only simulations for LG9 with the "oak priors" seemed to reach convergence
Download