Appendix A: Computation of fraternity coefficients The fraternity coefficients can be approximated by a straightforward simulation approach, which keeps track if the genes of two individuals are identical by descent. We assume a pedigree consisting of n individuals, and denote by mi and fi the mother and father of individual i, respectively. If either of the parents is unknown (e.g. i belonging to the first generation in the pedigree or being an immigrant), we set mi=-1 or fi=-1. Let the pedigree be ordered so that the parents always come before the offspring, i.e. mi<i and fi<i. In the simulation, two genes are assigned for each individual, one originating from the mother and the other one from the father. The simulation is initiated by assigning unique values for the genes that would originate from the unknown parents, e.g. labelling them as 1,2,... The remaining genes are then assigned to the individuals i=1,...,n by randomly choosing either of the genes assigned to the parents. If the values are the same for two genes, these genes are identical by descent. The procedure is replicated N times, and the fraternity coefficient ij is estimated as ̂ ij , the fraction of replicates for which the two genes are identical for individuals i and j. We note that ij can be nonzero only if the approximation provided by Eq. 1 is nonzero. Hence only a subset of the pairs needs to be examined, lowering substantially the computational cost if is a sparse matrix. As the sum of Bernoulli trials is distributed binomially, the standard error of the estimate is SE= ij (1 ij ) N . In the example of Fig. 1 N=106, hence the maximum standard error (obtained if ij=1/2) is SE=0.0005. Appendix B. Implementation of Bayesian estimation Part of the presentation here follows closely those presented earlier in the context of animal breeding, see e.g. Sorensen & Gianola (2002) and VanTassell & VanVleck (1996). B.1. Notation Let X be a k × k matrix and Y a n × n matrix. Then the outer product X Y is the nk × nk matrix, which can be written in block form as X 11Y X 1k Y X Y . X k1Y X kk Y We will need the following three identities: 1. ( X Y) 1 X 1 Y 1 . 2. | X Y || X | n | Y | k . 3. Let X and Y be symmetric and let a and b be nk × 1 vectors. Then a T ( X Y)b Tr( XS) , where Tr() denotes trace of a matrix, S aT( n) Yb ( n) is a k × k matrix, and the subscript refers to partition to blocks of n so that a(n) is a n × k matrix. To simplify the notation, we will denote in this Appendix the matrix of additive variance by VA instead of G. B.2. The model The model described in the main paper includes additive and dominance effects, y Xβ Z a a Z d d e. To simplify the notation and to allow for a more general model, we combine the random effects a, d into a single variable z, so that the model reads y Xβ Zz e. To clarify the notation, we point out the following. The data y is a vector of length nk, ordered first with trait and then with individual. Denoting by yi,j the trait j of individual i, we have y ( y1,1 , y2,1 ,, yn,1 , y1, 2 , y2, 2 , yn, 2 , y1,k yn,k )T . We assume that there are F fixed effects (including the common mean), which can be either continuous or class variables (or a mixture of these). Let each class variable have f i classes (to avoid over-parameterization, we normalize one of the values to zero, so that the number of treatments is f i 1 ). For continuous variables we set f i 1 . The vector of fixed effects β is then a fk vector, where f i fi . The vector β is ordered first by trait, then by fixed effect, and then by treatment (in case of class variables). Denoting by ijl the effect on trait l of having treatment j for the fixed effect i, we have β (1,1,1,, 1, f ,1,, F , f F ,1, 1,1, 2 ,, F , f F , k )T . The 1 incidence matrix X has the dimensions nk × fk. Let there be R random effects, each having ri possible values (with none normalized to zero). For example, in the case of the model consisting of additive and dominance effects, we have R=2. We let r i ri denote the total number of treatments, so that z is a rk vector, ordered first with trait, then with random effect type, and then with the treatment within the random effect: z ( z1,1,1 ,, r1,r1 ,k ,, z R,rR ,1 , z1,1, 2 ,, z R,rR ,k )T . The incidence matrix Z is of dimension nk × rk. To treat fixed and random effects in a single framework, we write y ~ N(Wh , Σ), where W is the nk × (f+r)k block matrix W=(X Z), h the vector h=( z), and the nk × nk matrix Σ VE I n . We denote the prior for h by N(0,V), where V is the block matrix V V F 0 0 . VR Here VF denotes the prior for the fixed effects, and VR to the prior of the random effects. For simplicity we have assumed a zero mean for h, but a non-zero mean could be used as readily. For example, when assuming independent additive and dominance effects, VR is the block matrix 0 V A . VR A VD Δ 0 B.3. Prior distributions We assume a flat prior for the fixed effects, i.e. VF-1 is a zero matrix. For the variance components VE, VA and VD the inverse Wishart prior is a commonly made assumption, partly because it allows straightforward Gibbs sampling (Gelman et al. 2004). This prior is written as VX ~ Inv Wishart X (S X1 ) where X (E, A, D) . Another alternative is to use a flat prior, VX~1, but depending on the problem, this choice can lead to an improper posterior. B.4. The likelihood The posterior density has the form p (β, VR, Σ | y ) p (β, VR, , Σ) p ( y | β, VR, , Σ) . We next discuss the two parts (prior and likelihood) separately. 1. The prior p (β, VR, , Σ) p (β) p(VR , Σ) . As we assume a flat prior for the fixed components, we have p (β) 1 . In case of just additive and dominance effects, we assumed independent priors for the variance components, and hence p(VR , Σ) p(VA ) p(VD ) p(VE ) . The probability density of the Inverse-Wishart distribution W ~ Inv Wishart (S 1 ) is given (as a function of W) by (Gelman et al. 2004) 1 p ( W) | W | ( k 1) / 2 exp Tr( SW 1 ) . 2 The expectation of the distribution is E(W)=S/(-k-1) for >k+1. 2. The likelihood of data, p (y | β, VR, , Σ) . The data y is distributed as y ~ N(Xβ, Q) , where Q=ZVRZT+. Hence the likelihood is 1 p(y | β,VR ,Σ) | Q | 1 / 2 exp (y Xβ) T Q 1 (y Xβ) . 2 To facilitate the computations, we note that |Q|=|L|2, where L is the Cholesky decomposition of Q. The REML likelihood reads (e.g. Shaw 1987) 1 p RE (y | β,VR ,Σ) | Q | 1 / 2 | X T Q 1 X | 1 / 2 exp (y Xβ) T Q 1 (y Xβ) , 2 which differs from the ordinary likelihood only in the second term. B.5. The Gibbs sampler The Gibbs sampler is based on updating each parameter (or a block of parameters) in turn, always drawing from the full conditional distribution given the current values of the remaining parameters. Below we describe how the Gibbs sampler can be used with block updating. B.5.1. Variance components Updating VE, VA and VD. These are variance-covariance matrices of multinormal distributions with a known (zero) mean. They can hence be updated as (Gelman et al. 2004, pages 50, 87; Sorensen & Gianola 2002, page 578-584) VX | h ~ Inv Wishart ( Λ 1 ), where v X n, Λ S X xX 1 x T Here X=A, X=, or X=I, corresponding to the cases X {A, D, E} . In case of X=E, x=y-Wh, organized as a k × n times matrix. In case of X {A, D} , x is the k × n times matrix consisting of the corresponding elements of the vector h. B.5.2. Multivariate regression To update the vector h as a single block, we utilize multivariate linear regression. To do so, we note that as a function of h, 1 p(h, V | y ) exp (y Wh ) T Σ 1 (y Wh ) h T V 1h 2 As a function of h, we have (y Wh ) T Σ 1 (y Wh ) (Wh ) T Σ 1y yΣ 1 (Wh ) h T ( WT Σ 1 W)h c , where c is a constant vector. Hence, as a function of h, 1 p(h, V | y ) exp h T ( W T Σ 1 y ) (y T Σ 1 W)h h T ( W T Σ 1 W V 1 )h 2 To complete the square, we note that, as a function of h, (h z) T ( W T Σ 1 W V 1 )(h z) h T ( W T Σ 1 W V 1 )h z T ( W T Σ 1 W V 1 )h h T ( W T Σ 1 W V 1 )z c. Hence h is distributed as N(z, ( WT Σ 1 W V 1 ) 1 ), where z is the solution to ( WT Σ 1 W V 1 )z WT Σ 1y . We sampled h using the Cholesky decomposition of the matrix W T Σ 1 W V 1 . Other sampling methods that also avoid the need for explicitly inverting the matrix are available (Garcia-Cortes & Sorensen 2001; Korsgaard et al. 2003). B.6. Metropolis-Hastings sampler Another possibility is to use a Metropolis-Hastings algorithm, which is based on proposing changes in a parameter or block of parameters and accepting or rejecting the change. The probability of acceptance depends on the posterior likelihoods in the current and proposed states, and on the likelihoods of proposing the new state from the current state, and proposing the current state from the new state (Gelman et al. 2004). The acceptance ratio of the algorithm can be tuned by changing the variance of the proposal distribution. If too large changes are attempted, the acceptance ratio becomes too low. If too small changes are attempted, the acceptance ratio is very high, but the MCMC does not mix well. In the case of multinormal target and proposal distributions, the optimal acceptance ratio is known to be about 0.44 if a single parameter is updated, and it decreases to about 0.23 if many parameters are updated in a block (Gelman et al. 2004). To obtain such a value, the variance of the proposal distribution can be tuned either by hand or using an adaptive algorithm. We note that the choice of the proposal distribution does not affect the resulting posterior distribution, but just the rate at which the MCMC converges to the target distribution. Hence there is a lot of freedom in the choice of the proposal distributions, the optimal proposal being such that it is computationally feasible and leads to rapid mixing in the MCMC. While the total variance (e.g. VT VA VD VE ) can generally be well estimated, the data often contains only a weak signal on the relative contributions made by the individual causal components. This can lead to poor mixing in the Gibbs sampler, as the algorithm is not able to move quickly between the alternative hypotheses (e.g. large VA and small VD , versus small VA and large VD ). To improve the mixing, we developed a Metropolis-Hastings algorithm which accounts for the above structure of the posterior density. Furthermore, as the likelihood does not depend on the individual specific values of a, d, and e, we need not to estimate these at all, but just the variance components VA , VD , VE . If needed, posteriors for the individual specific values can be drawn later conditional on the posteriors of the variance components. We consider below the univariate and multivariate cases separately. In both cases, we consider the variance components only, as the vector β can be updated using multivariate regression, see B.5.2. B.6.1. Univariate case In the univariate case, we re-parameterise the model by replacing the variance components (VA , VD , VE ) by their sum VT and their relative contributions x (VA , VD , VE ) / VT . Hence VT is a positive number, and x a vector of length 3, with the property that the elements of x sum to unity. More generally, if the model has other variance components, the length of x is the number of the variance components. The following proposals are used to update the values of these parameters: Proposal 1. The total variance VT can be updated using a log-normal proposal distribution centred at the current values. The variance of the proposal distribution can be tuned to obtain a reasonable acceptance ratio. Proposal 2. The vector x can be updated using the Dirichlet distribution (see e.g. Gelman et al. 2004) as the proposal. A random sample from the Dirichlet distribution returns a vector of positive numbers that sum to unity. The parameter vector of the Dirichlet distribution can be chosen so that the expected value is the current value x, and that the variance leads to a required acceptance ratio. As illustrated by the Figure A1, this combination of proposals did lead to reasonably good mixing, largely resolving the strong correlation between the components VA, VD and VE (see Fig. 2 in the main paper). Figure A1. The mixing of the MCMC using the proposals 1 and 2 in the univariate case. The panels correspond to the cases n=64 (a) and n=1024 (b), both assuming the Siberian Jay pedigree, nonfitness case, and a model including additive, dominance, and environmental effects. B.6.2. Multivariate case In the multivariate case, the convenient Dirichlet distribution can not be utilized in a straightforward manner, because we need to propose changes in positive definite matrices instead of positive numbers. We developed a number of proposals, which change either the total covariance matrix VT , or the relative contributions of the individual causal components, or both at the same time. As noted in the main text, a positive definite matrix can be represented as an ellipsoid, with the lengths of the semi-axes corresponding to eigenvalues, and their directions to eigenvectors. We will utilize this graphical representation in the definitions of the proposals, which share features of the search algorithms by Kirkpatrick & Meyer (2004). Proposal 1. Any of the covariance matrices VA , VD , VE can be updated using the Wishart distribution as the proposal. The parameters of the Wishart distribution (scale matrix and degrees of freedom) can be chosen so that the expected value is the current value, and that the proposal leads to a required acceptance ratio. Proposal 2. The total variance VT can be updated by multiplying it by a scalar, which is drawn from a log-normal distribution centred at 1. This implies a proportional change to all causal components, and does not change the relative contributions of the individual causal components. Proposal 3. In case of the bivariate case, all of the covariance matrices VA , VD , VE can be updated (either simultaneously or one at a time) by rotating them by the angle . The angle can be drawn e.g. from a uniform distribution [-max,max], where the choice of max affects the acceptance ratio. In case of a multivariate case, also the rotation axis needs to be randomized. Proposal 4. All of the covariance matrices VA , VD , VE can be updated (either simultaneously or one at a time) by changing their shape but keeping their orientation. In the bivariate case this can be done by multiplying the other by a scalar k and the other by 1/k. Proposal 5. The relative contributions that the causal components make to a given diagonal element of VT can be updated one at a time. This can be done using the Dirichlet distribution as in the univariate case, but noting that the matrices VA , VD , VE need to remain positive definite. To do so, we calculated the minimum value of the diagonal matrix (that leads to the smallest eigenvalue being zero) for each of the matrices VA , VD , VE , and used the Dirichlet distribution to propose a change to the fraction exceeding the minimum value. Proposal 6. The relative contributions that the causal components make to a given nondiagonal element of VT can be updated one at a time. As with the diagonal elements, we first calculated for each of the matrices VA , VD , VE the largest possible change that would keep the matrices positive definite. We then used a uniform distribution centred at zero to propose a change for the non-diagonal element of the two causal components with smallest allowed changes. To keep the matrix VT unchanged, the proposal for the third causal component follows from the condition that the sum of the proposed changes has to be zero. The above proposals are not specific to the causal components (VA , VD , VE ) , but can be used in the context of any number of random effects. Choosing the optimal set of proposals depends on the nature of the problem at had, and hence it is hard to give general recommendations. When analyzing the frog data, the combination of the proposals 1 and 2 was sufficient to lead to an acceptable mixing in the MCMC. B.7. Generation of data To generate data, one needs to sample from the multinormal distribution N(0,V), where the variance-covariance matrix V has dimensions nk × nk. Let us denote by L the Cholesky decomposition of V so that V=LTL. Letting x=(x1,…xnk)T be a vector consisting of independent N(0,1) distributed random variables, xL is distributed as N(0,V) (Grimmett & Stirzaker 2001). The covariance matrix has the structure V X Y , where X is a k × k matrix and Y a n × n matrix. It holds that L can be decomposed as L L X L Y , where LX and LY are the Cholesky decompositions of the matrices X and Y, respectively. Appendix C: Distance between probability distributions We proposed the measure 2 1 f ( x) g ( x) d ( f , g) 2 d dx R f ( x) g ( x) 1/ 2 as the distance between two probability distributions f and g on Rd. Here we show that d defines a metric on the space of such probability distributions. To do so, the distance d has to satisfy the following four conditions for all f, g, and w: 1. 2. 3. 4. d ( f , g ) 0 (non-negativity) d ( f , g ) 0 if and only if f=g (identity of indiscernibles) d ( f , g ) d ( g , f ) (symmetry) d ( f , g ) d ( f , w) d ( w, g ) (triangle inequality) It is trivial that the first three conditions hold. To show the triangle inequality, we denote by s the function (C.1) s( x, y) ( x y) 2 , x y defined for pairs of positive real numbers. Then (C.2) s( x, y ) s( x, z ) s( z, y ) , and hence s defines a distance for positive real numbers. To see that (C.2) holds, we need to show that 2s( x, z )s( z, y) s( x, y) 2 s( x, z ) 2 s( z, y) 2 . As the left hand side is non-negative, it suffices to show that 2s( x, z)s( z, y)2 s( x, y) 2 s( x, z) 2 s( z, y) 2 2 . Simplifying the above inequality leads to 4( x y ) 2 ( x z ) 2 ( y z ) 2 ( xy xz yz ) 0, ( x y) 2 ( x z) 2 ( y z ) 2 which is always true. The triangle inequality for d follows from (C.2) by the Cauchy–Schwarz inequality f (x) g (x)dx f (x) 2 dx g (x) 2 dx . Appendix D: Marginal distributions of the G-matrices Here we show results corresponding to Fig. 4 in the main text for all six pairs of the four traits. References Garcia-Cortes, L. A. & Sorensen, D. 2001 Alternative implementations of Monte Carlo EM algorithms for likelihood inferences. Genetics Selection Evolution 33, 443-452. Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. 2004 Bayesian Data Analysis. Boca Raton: Chapman & Hall/CRC. Grimmett, G. & Stirzaker, D. 2001 Probability and Random Processes. Oxford, UK: Oxford University Press. Kirkpatrick, M. & Meyer, K. 2004 Direct estimation of genetic principal components: Simplified analysis of complex phenotypes. Genetics 168, 2295-2306. Korsgaard, I. R., Lund, M. S., Sorensen, D., Gianola, D., Madsen, P. & Jensen, J. 2003 Multivariate Bayesian analysis of Gaussian, right censored Gaussian, ordered categorical and binary traits using Gibbs sampling. Genetics Selection Evolution 35, 159-183. Shaw, R. G. 1987 Maximum-likelihood approaches applied to quantitative genetics of natural populations. Evolution 41, 812-826. Sorensen, D. & Gianola , D. 2002 Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. New York: Springer-Verlag. VanTassell, C. P. & VanVleck, L. D. 1996 Multiple-trait Gibbs sampler for animal models: Flexible programs for Bayesian and likelihood-based (co)variance component inference. Journal of Animal Science 74, 2586-2597.