Appendices A-D - Proceedings of the Royal Society B

advertisement
Appendix A: Computation of fraternity coefficients
The fraternity coefficients can be approximated by a straightforward simulation approach, which
keeps track if the genes of two individuals are identical by descent. We assume a pedigree
consisting of n individuals, and denote by mi and fi the mother and father of individual i,
respectively. If either of the parents is unknown (e.g. i belonging to the first generation in the
pedigree or being an immigrant), we set mi=-1 or fi=-1. Let the pedigree be ordered so that the
parents always come before the offspring, i.e. mi<i and fi<i. In the simulation, two genes are
assigned for each individual, one originating from the mother and the other one from the father. The
simulation is initiated by assigning unique values for the genes that would originate from the
unknown parents, e.g. labelling them as 1,2,... The remaining genes are then assigned to the
individuals i=1,...,n by randomly choosing either of the genes assigned to the parents. If the values
are the same for two genes, these genes are identical by descent. The procedure is replicated N
times, and the fraternity coefficient ij is estimated as ̂ ij , the fraction of replicates for which the
two genes are identical for individuals i and j. We note that ij can be nonzero only if the
approximation provided by Eq. 1 is nonzero. Hence only a subset of the pairs needs to be examined,
lowering substantially the computational cost if  is a sparse matrix.
As the sum of Bernoulli trials is distributed binomially, the standard error of the estimate is
SE=
 ij (1   ij )
N
.
In the example of Fig. 1 N=106, hence the maximum standard error (obtained if ij=1/2) is
SE=0.0005.
Appendix B. Implementation of Bayesian estimation
Part of the presentation here follows closely those presented earlier in the context of animal
breeding, see e.g. Sorensen & Gianola (2002) and VanTassell & VanVleck (1996).
B.1. Notation
Let X be a k × k matrix and Y a n × n matrix. Then the outer product X  Y is the nk × nk matrix,
which can be written in block form as
 X 11Y  X 1k Y 
X  Y   

  .
 X k1Y  X kk Y
We will need the following three identities:
1. ( X  Y) 1  X 1  Y 1 .
2. | X  Y || X | n | Y | k .
3. Let X and Y be symmetric and let a and b be nk × 1 vectors. Then a T ( X  Y)b  Tr( XS) ,
where Tr() denotes trace of a matrix, S  aT( n) Yb ( n) is a k × k matrix, and the subscript refers to
partition to blocks of n so that a(n) is a n × k matrix.
To simplify the notation, we will denote in this Appendix the matrix of additive variance by VA
instead of G.
B.2. The model
The model described in the main paper includes additive and dominance effects,
y  Xβ  Z a a  Z d d  e.
To simplify the notation and to allow for a more general model, we combine the random effects a, d
into a single variable z, so that the model reads
y  Xβ  Zz  e.
To clarify the notation, we point out the following.

The data y is a vector of length nk, ordered first with trait and then with individual. Denoting by
yi,j the trait j of individual i, we have y  ( y1,1 , y2,1 ,, yn,1 , y1, 2 , y2, 2 , yn, 2 , y1,k  yn,k )T .

We assume that there are F fixed effects (including the common mean), which can be either
continuous or class variables (or a mixture of these). Let each class variable have f i classes (to
avoid over-parameterization, we normalize one of the values to zero, so that the number of
treatments is f i  1 ). For continuous variables we set f i  1 . The vector of fixed effects β is
then a fk vector, where f  i fi . The vector β is ordered first by trait, then by fixed effect,
and then by treatment (in case of class variables). Denoting by ijl the effect on trait l of having
treatment j for the fixed effect i, we have β  (1,1,1,, 1, f ,1,,  F , f F ,1, 1,1, 2 ,,  F , f F , k )T . The
1
incidence matrix X has the dimensions nk × fk.

Let there be R random effects, each having ri possible values (with none normalized to zero).
For example, in the case of the model consisting of additive and dominance effects, we have
R=2. We let r  i ri denote the total number of treatments, so that z is a rk vector, ordered
first with trait, then with random effect type, and then with the treatment within the random
effect: z  ( z1,1,1 ,, r1,r1 ,k ,, z R,rR ,1 , z1,1, 2 ,, z R,rR ,k )T . The incidence matrix Z is of dimension nk
× rk.
To treat fixed and random effects in a single framework, we write
y ~ N(Wh , Σ),
where W is the nk × (f+r)k block matrix W=(X Z), h the vector h=( z), and  the nk × nk matrix
Σ  VE  I n . We denote the prior for h by N(0,V), where V is the block matrix
V
V   F
 0
0 
.
VR 
Here VF denotes the prior for the fixed effects, and VR to the prior of the random effects. For
simplicity we have assumed a zero mean for h, but a non-zero mean could be used as readily. For
example, when assuming independent additive and dominance effects, VR is the block matrix
0 
V  A
.
VR   A
VD  Δ 
 0
B.3. Prior distributions
We assume a flat prior for the fixed effects, i.e. VF-1 is a zero matrix. For the variance components
VE, VA and VD the inverse Wishart prior is a commonly made assumption, partly because it allows
straightforward Gibbs sampling (Gelman et al. 2004). This prior is written as
VX ~ Inv  Wishart  X (S X1 )
where X  (E, A, D) . Another alternative is to use a flat prior, VX~1, but depending on the problem,
this choice can lead to an improper posterior.
B.4. The likelihood
The posterior density has the form
p (β, VR, Σ | y )  p (β, VR, , Σ) p ( y | β, VR, , Σ) .
We next discuss the two parts (prior and likelihood) separately.
1. The prior p (β, VR, , Σ)  p (β) p(VR , Σ) . As we assume a flat prior for the fixed components, we
have p (β)  1 . In case of just additive and dominance effects, we assumed independent priors
for the variance components, and hence p(VR , Σ)  p(VA ) p(VD ) p(VE ) . The probability
density of the Inverse-Wishart distribution W ~ Inv  Wishart  (S 1 ) is given (as a function of
W) by (Gelman et al. 2004)
 1

p ( W)  | W | (  k 1) / 2 exp  Tr( SW 1 ) .
 2

The expectation of the distribution is E(W)=S/(-k-1) for >k+1.
2. The likelihood of data, p (y | β, VR, , Σ) . The data y is distributed as
y ~ N(Xβ, Q) ,
where Q=ZVRZT+. Hence the likelihood is
 1

p(y | β,VR ,Σ)  | Q | 1 / 2 exp  (y  Xβ) T Q 1 (y  Xβ) .
 2

To facilitate the computations, we note that |Q|=|L|2, where L is the Cholesky decomposition of Q.
The REML likelihood reads (e.g. Shaw 1987)
 1

p RE (y | β,VR ,Σ)  | Q | 1 / 2 | X T Q 1 X | 1 / 2 exp  (y  Xβ) T Q 1 (y  Xβ) ,
 2

which differs from the ordinary likelihood only in the second term.
B.5. The Gibbs sampler
The Gibbs sampler is based on updating each parameter (or a block of parameters) in turn, always
drawing from the full conditional distribution given the current values of the remaining parameters.
Below we describe how the Gibbs sampler can be used with block updating.
B.5.1. Variance components
Updating VE, VA and VD. These are variance-covariance matrices of multinormal distributions with
a known (zero) mean. They can hence be updated as (Gelman et al. 2004, pages 50, 87; Sorensen &
Gianola 2002, page 578-584)
VX | h ~ Inv  Wishart  ( Λ 1 ),
where
  v X  n,
Λ  S X  xX 1 x T
Here X=A, X=, or X=I, corresponding to the cases X  {A, D, E} . In case of X=E, x=y-Wh,
organized as a k × n times matrix. In case of X  {A, D} , x is the k × n times matrix consisting of
the corresponding elements of the vector h.
B.5.2. Multivariate regression
To update the vector h as a single block, we utilize multivariate linear regression. To do so, we note
that as a function of h,


 1

p(h, V | y )  exp  (y  Wh ) T Σ 1 (y  Wh )  h T V 1h 
 2

As a function of h, we have
(y  Wh ) T Σ 1 (y  Wh )  (Wh ) T Σ 1y  yΣ 1 (Wh )  h T ( WT Σ 1 W)h  c ,
where c is a constant vector. Hence, as a function of h,


 1

p(h, V | y )  exp   h T ( W T Σ 1 y )  (y T Σ 1 W)h  h T ( W T Σ 1 W  V 1 )h 
 2

To complete the square, we note that, as a function of h,
(h  z) T ( W T Σ 1 W  V 1 )(h  z)
 h T ( W T Σ 1 W  V 1 )h  z T ( W T Σ 1 W  V 1 )h  h T ( W T Σ 1 W  V 1 )z  c.
Hence h is distributed as N(z, ( WT Σ 1 W  V 1 ) 1 ), where z is the solution to
( WT Σ 1 W  V 1 )z  WT Σ 1y .
We sampled h using the Cholesky decomposition of the matrix W T Σ 1 W  V 1 . Other sampling
methods that also avoid the need for explicitly inverting the matrix are available (Garcia-Cortes &
Sorensen 2001; Korsgaard et al. 2003).
B.6. Metropolis-Hastings sampler
Another possibility is to use a Metropolis-Hastings algorithm, which is based on proposing changes
in a parameter or block of parameters and accepting or rejecting the change. The probability of
acceptance depends on the posterior likelihoods in the current and proposed states, and on the
likelihoods of proposing the new state from the current state, and proposing the current state from
the new state (Gelman et al. 2004). The acceptance ratio of the algorithm can be tuned by changing
the variance of the proposal distribution. If too large changes are attempted, the acceptance ratio
becomes too low. If too small changes are attempted, the acceptance ratio is very high, but the
MCMC does not mix well. In the case of multinormal target and proposal distributions, the optimal
acceptance ratio is known to be about 0.44 if a single parameter is updated, and it decreases to about
0.23 if many parameters are updated in a block (Gelman et al. 2004). To obtain such a value, the
variance of the proposal distribution can be tuned either by hand or using an adaptive algorithm.
We note that the choice of the proposal distribution does not affect the resulting posterior
distribution, but just the rate at which the MCMC converges to the target distribution. Hence there
is a lot of freedom in the choice of the proposal distributions, the optimal proposal being such that it
is computationally feasible and leads to rapid mixing in the MCMC.
While the total variance (e.g. VT  VA  VD  VE ) can generally be well estimated, the data often
contains only a weak signal on the relative contributions made by the individual causal components.
This can lead to poor mixing in the Gibbs sampler, as the algorithm is not able to move quickly
between the alternative hypotheses (e.g. large VA and small VD , versus small VA and large VD ). To
improve the mixing, we developed a Metropolis-Hastings algorithm which accounts for the above
structure of the posterior density. Furthermore, as the likelihood does not depend on the individual
specific values of a, d, and e, we need not to estimate these at all, but just the variance components
VA , VD , VE . If needed, posteriors for the individual specific values can be drawn later conditional
on the posteriors of the variance components.
We consider below the univariate and multivariate cases separately. In both cases, we consider the
variance components only, as the vector β can be updated using multivariate regression, see B.5.2.
B.6.1. Univariate case
In the univariate case, we re-parameterise the model by replacing the variance components
(VA , VD , VE ) by their sum VT and their relative contributions x  (VA , VD , VE ) / VT . Hence VT is a
positive number, and x a vector of length 3, with the property that the elements of x sum to unity.
More generally, if the model has other variance components, the length of x is the number of the
variance components. The following proposals are used to update the values of these parameters:

Proposal 1. The total variance VT can be updated using a log-normal proposal distribution
centred at the current values. The variance of the proposal distribution can be tuned to obtain a
reasonable acceptance ratio.

Proposal 2. The vector x can be updated using the Dirichlet distribution (see e.g. Gelman et al.
2004) as the proposal. A random sample from the Dirichlet distribution returns a vector of
positive numbers that sum to unity. The parameter vector of the Dirichlet distribution can be
chosen so that the expected value is the current value x, and that the variance leads to a required
acceptance ratio.
As illustrated by the Figure A1, this combination of proposals did lead to reasonably good mixing,
largely resolving the strong correlation between the components VA, VD and VE (see Fig. 2 in the
main paper).
Figure A1. The mixing of the MCMC using the proposals 1 and 2 in the univariate case. The panels
correspond to the cases n=64 (a) and n=1024 (b), both assuming the Siberian Jay pedigree, nonfitness case, and a model including additive, dominance, and environmental effects.
B.6.2. Multivariate case
In the multivariate case, the convenient Dirichlet distribution can not be utilized in a straightforward
manner, because we need to propose changes in positive definite matrices instead of positive
numbers. We developed a number of proposals, which change either the total covariance matrix VT ,
or the relative contributions of the individual causal components, or both at the same time. As noted
in the main text, a positive definite matrix can be represented as an ellipsoid, with the lengths of the
semi-axes corresponding to eigenvalues, and their directions to eigenvectors. We will utilize this
graphical representation in the definitions of the proposals, which share features of the search
algorithms by Kirkpatrick & Meyer (2004).

Proposal 1. Any of the covariance matrices VA , VD , VE can be updated using the Wishart
distribution as the proposal. The parameters of the Wishart distribution (scale matrix and
degrees of freedom) can be chosen so that the expected value is the current value, and that the
proposal leads to a required acceptance ratio.

Proposal 2. The total variance VT can be updated by multiplying it by a scalar, which is drawn
from a log-normal distribution centred at 1. This implies a proportional change to all causal
components, and does not change the relative contributions of the individual causal components.

Proposal 3. In case of the bivariate case, all of the covariance matrices VA , VD , VE can be
updated (either simultaneously or one at a time) by rotating them by the angle . The angle 
can be drawn e.g. from a uniform distribution [-max,max], where the choice of max affects the
acceptance ratio. In case of a multivariate case, also the rotation axis needs to be randomized.

Proposal 4. All of the covariance matrices VA , VD , VE can be updated (either simultaneously or
one at a time) by changing their shape but keeping their orientation. In the bivariate case this
can be done by multiplying the other by a scalar k and the other by 1/k.

Proposal 5. The relative contributions that the causal components make to a given diagonal
element of VT can be updated one at a time. This can be done using the Dirichlet distribution as
in the univariate case, but noting that the matrices VA , VD , VE need to remain positive definite.
To do so, we calculated the minimum value of the diagonal matrix (that leads to the smallest
eigenvalue being zero) for each of the matrices VA , VD , VE , and used the Dirichlet distribution
to propose a change to the fraction exceeding the minimum value.

Proposal 6. The relative contributions that the causal components make to a given nondiagonal element of VT can be updated one at a time. As with the diagonal elements, we first
calculated for each of the matrices VA , VD , VE the largest possible change that would keep the
matrices positive definite. We then used a uniform distribution centred at zero to propose a
change for the non-diagonal element of the two causal components with smallest allowed
changes. To keep the matrix VT unchanged, the proposal for the third causal component follows
from the condition that the sum of the proposed changes has to be zero.
The above proposals are not specific to the causal components (VA , VD , VE ) , but can be used in the
context of any number of random effects. Choosing the optimal set of proposals depends on the
nature of the problem at had, and hence it is hard to give general recommendations. When analyzing
the frog data, the combination of the proposals 1 and 2 was sufficient to lead to an acceptable
mixing in the MCMC.
B.7. Generation of data
To generate data, one needs to sample from the multinormal distribution N(0,V), where the
variance-covariance matrix V has dimensions nk × nk. Let us denote by L the Cholesky
decomposition of V so that V=LTL. Letting x=(x1,…xnk)T be a vector consisting of independent
N(0,1) distributed random variables, xL is distributed as N(0,V) (Grimmett & Stirzaker 2001).
The covariance matrix has the structure V  X  Y , where X is a k × k matrix and Y a n × n matrix.
It holds that L can be decomposed as L  L X  L Y , where LX and LY are the Cholesky
decompositions of the matrices X and Y, respectively.
Appendix C: Distance between probability distributions
We proposed the measure
2
1


f ( x)  g ( x) 
d ( f , g)  2  d
dx
R
f ( x)  g ( x)


1/ 2
as the distance between two probability distributions f and g on Rd. Here we show that d defines a
metric on the space of such probability distributions. To do so, the distance d has to satisfy the
following four conditions for all f, g, and w:
1.
2.
3.
4.
d ( f , g )  0 (non-negativity)
d ( f , g )  0 if and only if f=g (identity of indiscernibles)
d ( f , g )  d ( g , f ) (symmetry)
d ( f , g )  d ( f , w)  d ( w, g ) (triangle inequality)
It is trivial that the first three conditions hold. To show the triangle inequality, we denote by s the
function
(C.1) s( x, y) 
( x  y) 2
,
x y
defined for pairs of positive real numbers. Then
(C.2) s( x, y )  s( x, z )  s( z, y ) ,
and hence s defines a distance for positive real numbers. To see that (C.2) holds, we need to show
that
2s( x, z )s( z, y)  s( x, y) 2  s( x, z ) 2  s( z, y) 2 .
As the left hand side is non-negative, it suffices to show that
2s( x, z)s( z, y)2  s( x, y) 2  s( x, z) 2  s( z, y) 2 2 .
Simplifying the above inequality leads to
4( x  y ) 2 ( x  z ) 2 ( y  z ) 2 ( xy  xz  yz )
0,
( x  y) 2 ( x  z) 2 ( y  z ) 2
which is always true. The triangle inequality for d follows from (C.2) by the Cauchy–Schwarz
inequality
 f (x) g (x)dx   f (x)
2
dx g (x) 2 dx .
Appendix D: Marginal distributions of the G-matrices
Here we show results corresponding to Fig. 4 in the main text for all six pairs of the four traits.
References
Garcia-Cortes, L. A. & Sorensen, D. 2001 Alternative implementations of Monte Carlo EM
algorithms for likelihood inferences. Genetics Selection Evolution 33, 443-452.
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. 2004 Bayesian Data Analysis. Boca Raton:
Chapman & Hall/CRC.
Grimmett, G. & Stirzaker, D. 2001 Probability and Random Processes. Oxford, UK: Oxford
University Press.
Kirkpatrick, M. & Meyer, K. 2004 Direct estimation of genetic principal components: Simplified
analysis of complex phenotypes. Genetics 168, 2295-2306.
Korsgaard, I. R., Lund, M. S., Sorensen, D., Gianola, D., Madsen, P. & Jensen, J. 2003 Multivariate
Bayesian analysis of Gaussian, right censored Gaussian, ordered categorical and binary
traits using Gibbs sampling. Genetics Selection Evolution 35, 159-183.
Shaw, R. G. 1987 Maximum-likelihood approaches applied to quantitative genetics of natural
populations. Evolution 41, 812-826.
Sorensen, D. & Gianola , D. 2002 Likelihood, Bayesian, and MCMC Methods in Quantitative
Genetics. New York: Springer-Verlag.
VanTassell, C. P. & VanVleck, L. D. 1996 Multiple-trait Gibbs sampler for animal models: Flexible
programs for Bayesian and likelihood-based (co)variance component inference. Journal of
Animal Science 74, 2586-2597.
Download