S1 Text - Figshare

advertisement
Text S1
Detailed description of BayesR
Priors
The Bayesian approach requires the assignment of prior distributions to all unknowns
in the model. The population mean μ, was assigned an uninformative uniform prior
density. We used the same four-distribution mixture of the SNP effects as Erbe et al.
[1]
The mixing proportions π = (p 1, ¼, p K ) are given a symmetric Dirichlet prior i.e.
p (p 1, ¼, p 4 ) ~ D (d , ¼, d ) , with δ=1. Note that Erbe et al. [1] assigned a known
value to the variance of all SNP effects (s g2 ), whereas here s g2 is informed from the
data. The prior for s g2 is chosen to be of a scaled inverse - c 2 distribution, Inv
- c 2 ( v0 ,S02 ) with known hyperparameters v0 and S02 . A scaled inverse - c 2
distribution was also assumed for s e2 . We used flat priors for the hyperparameters for
both variances (vo = -2 and S02 = 0).
Gibbs sampling
On the basis of the prior specification, Gibbs sampling was used to generate samples
from the posterior distributions of the parameters (using |. to denote conditioned on
the data and all other parameters). The Gibbs sampler proceeds as follows:
1. Sample the overall mean from the full conditional posterior distribution
æ
n
è
i=1
æ
p
ö s e2 ö
÷
ø n ø
m | . ~ N ç n -1 å ç yi - åxij b j ÷ ,
è
j=1
2. Calculate the probability that SNP j is in distribution k. The likelihood of SNP j
being in component k is:
n
2 ö
1 æ n 2
1
LogL ( j, k ) = log (p k ) - 2 ç åyi - m j,k å xij yi ÷ - logV,
ø 2
2s e è i=1
i=1
( )
where yi is the phenotype of individual i corrected for the overall mean and the
effects of all markers in the model, except marker j.
p
æ
ö
yi = ç yi - m - åxil bl ÷
è
ø
l¹ j
and
m jk =
å
å
n
x y
i
i=1 ij
n
2
2
e
i=1 ij
x + s / s k2
logV is the likelihood of the reduced model including only the effect of SNP j and an
residual effect:
æ s k2 n 2 ö
logV = n log (s ) + log ç 2 å xij +1÷ .Then the probability of SNP j being in
ès
ø
2
e
e i=1
distribution k is
K
( )
Pr x jÎk = 1/ å exp éë L ( i,l ) - L ( i, k ) ùû .
l=1
Based on a value sampled from a uninform distribution assign component k to SNP j.
3. Sample the regression coefficient for SNP j from mixture component k from the
full conditional posterior distribution.
b jk | . ~ N ( m jk , Sk2 ) , where
S =
2
k
å
s e2
n
2
i=1 ij
x + s e2 / s k2
and m jk as above.
4. Repeat step 2 and 3 for SNP j+1,..,p.
5. Sample s g2 from the full conditional posterior distribution:
p
æ
m
b 2 + vo S02 ö
å
g
j=1 j
2
2
÷,
s g | . ~ Inv - c ç v0 + mg ,
v0 + mg
çè
÷ø
where mg is the number of SNPs included in the current model.
6. Sample s e2 from the full conditional posterior distribution:
(
n
p
æ
åi=1 yi - m - å j=1 xij b j
2
2ç
s e | . ~ Inv - c v0 + n,
ç
vo + n
çè
) + v S ö÷
2
7. Update the mixing proportion by sampling from the posterior:
p | . ~ Dirichlet ( m1 + d , m2 + d ,m3 + d , m4 + d ) ,
where m1,…m4 are the number of markers in each distribution.
7. Compute new s k2 of the mixture components
ì p ´ 0 ´ s 2
1
g
ï
ïï p 2 ´ 10 -4 ´ s g2
2
sk ~ í
.
-3
2
ï p 3 ´ 10 ´ s g
ï
-2
2
ïî p 4 ´ 10 ´ s g
2
o 0
÷
÷ø
8. Randomly permute the order of SNPs to provide global moves and to increase
mixing.
Computational efficiency gain
Updating maker effects requires the computations of yi at step 2 of the algorithm. For
example, for sampling the jth marker effect, it is considerably more efficient to
p
æ
ö
compute yi = ç yi - m - åxil bl ÷ in the form of yi = e i + xij b j . If the e i are stored then
è
ø
l¹ j
xij b j can be added to the residual for use in sampling the new marker effect b j , and
once this is done the new residual is available by subtraction of the updated xij b j
from yi (e.g., e i = yi - xij b j ).
MCMC implementation
For all analyses the Markov chain was run for 50,000 cycles with the first 20,000
samples discarded as burn-in. Posterior estimates of parameters are based on 3,000
samples drawing every 10th sample after burn in.
Posterior analysis from the MCMC output
After having generated samples from the posterior distributions model parameters are
estimated by the sample means of their posterior probabilities respectively. Estimates
of m, π, b , s g2 ,s e2 are averages over different regression models drawn conditional on
different model vectors, known as “Bayesian model averaging”. The posterior
inclusion probability (PIP), defined as the proportion of iterations that included a
specific marker in the model, was used as a measure of the ability of the model to
identify associated SNPs.
Predictions
Based on SNP effect estimates obtained from the training population phenotypes of
the validation sample were predicted as:
ŷi = m +
å w b̂
jÎb j >0
ij
(
j
)
(
)
where b̂ j is the estimated effect of SNP j, wij = xij - 2 p j / 2 p j 1- p j and xij is
the number of copies of the reference allele (0,1,2) at SNP j for individual i with pj
being the frequency of the reference allele in the training population.
1. Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, et al. (2012)
Improving accuracy of genomic predictions within and between dairy
cattle breeds with imputed high-density single nucleotide polymorphism
panels. J Dairy Sci 95: 4114-4129.
Download