Introduction to multivariate statistics Terry Speed, SICSA Summer School Statistical Inference in Computational Biology, Edinburgh, June 14-15, 2010 Lecture 2 1 The multivariate normal: Conditional densities In the notation of p.8 of Lecture 1, put ak = -qkp/qpp, so that yp = qpp(xp – a1x1 -…-ap-1xp-1). Recall that Yp was found to be independent of X1, …, Xp-1. In other words, the ak are numbers such that T = Xp – a1X1 -…-ap-1Xp-1 is independent of (X1, …, Xp-1), and this property uniquely characterizes the coefficients ak. To obtain the conditional density of Xp given X1 =x1,…,Xp-1 =xp-1, we must divide the density of (X1, …, Xp ) by the marginal density of (X1, …, Xp-1). From the above, we get an exponential with exponent -½ yp2/qpp. 2 Multivariate normal: Conditional densities, II Thus the conditional density of Xp given X1 = x1, …, Xp-1 = xp-1 is normal with expectation a1x1 +…+ap-1xp-1 and variance 1/qpp , i.e. (*) E(Xp | X1, …,Xp-1 ) = a1X1 +…+ap-1Xp-1 . Theorem. If (X1,…,Xp ) has a normal density, the conditional density of Xp given X1,…,Xp-1 is again normal,. Further, the conditional expectation (*) is the unique linear function of X1…,Xp-1 making T independent of (X1,…,Xp-1). The conditional variance equals var(T) = 1/qpp . I now turn to an alternative development, leading to a more general conclusion. But first, 3 Two simple facts about normal distributions Introduce the notation X ~ N(μ, Σ) to abbreviate the statement X is normal with center μ and covariance matrix Σ. Also note that I am no longer making my vectors and matrices bold. Fact 1: If X ~ N(μ, Σ) , and Y = AX + b, then Y ~ N(Aμ, AΣAT). Fact 2: If X ~ N(μ, Σ), then AX and BX are independent iff AΣBT = 0. The proofs of these facts are straightforward consequences of earlier results, e.g. we use the readily established formulae var(AX) = AΣAT, and cov(AX, BX) = AΣBT. 4 Conditional densities: a second approach Here is a more multivariate version of the our recent result. It is proved directly in Bishopʼs book, but the following derivation is simpler, given the independence result we use. Write our p-vector X = (X1T,X2T)T , where X1 is an r-vector, and X2 is an s-vector, s=p-r. Partition the covariance matrix var(X) = Σ of X into diagonal blocks Σ11 = var(X1), Σ22 = var(X2), and off-diagonal blocks Σ12 = Σ21T = cov(X1, X2). Theorem. If X ~ N(μ, Σ ), then X1 and X2.1 = X2 –Σ21Σ11-1X1 have the following distributions, and are independent: X1 ~ N(μ1, Σ11), X2.1 ~ N(μ2.1, Σ22.1), where μ2.1 = μ2 – Σ21Σ22-1μ1, and Σ22.1 = Σ22 - Σ21Σ11-1Σ12 . 5 Proof of the preceding theorem The main work lies in establishing that X1 and X2.1 are uncorrelated, and so independent. This and the formulae given follow from the two facts stated on p.4 above. It follows from what we have just proved that the conditional distribution of X2.1 given X1 is the same as its marginal distribution. But X2 is just X2.1 plus Σ21Σ11-1X1 , which is constant, given X1 . Hence Theorem. With the same notation and assumptions, X2 | X1 ~ N(μ2 + Σ21Σ22-1(X1-μ1), Σ22.1) . I leave you to fill in the details. 6 Details of the proof In the lecture, I screwed up the proof. Hereʼs the correct version, using the fact that matrix coefficients in the second argument of cov(.,.) must be transposed (last line, p.4 above). cov(X1,X2.1) = cov(X1, X2 –Σ21Σ11-1X1 ) = cov(X1 ,X2) – cov(X1 ,X1)Σ11-1Σ21T = Σ12 - Σ11Σ11-1Σ21T = 0. 7 The partitioned matrix inverse formula Bishopʼs approach makes essential use of the following formula, which we need later anyway. −1 M A B = −1 C D −D CM −MBD−1 , where −1 −1 −1 D + D CMBD M = (A − BD−1C)−1 . € This formula, which simplifies for symmetric matrices (i.e., when BT = C) permits us to interpret Σ22.1 as the reciprocal of Q22, where Q = Σ -1 is partitioned in the same way as Σ . Check. 8 Conditional independence with normals I turn now to the material in the paper Gaussian Markov distributions over finite graphs. The setting there is a random vector X = (Xγ: γ ε C) indexed by finite set C which will later be given a graph structure. The covariance matrix of X is denoted by K, and for subsets a, b of C, I use the notation Xa , Xb , Ka,b , Ka = Ka,a for the restrictions of X and K to these subsets. Also, ab and a\b denote intersection and difference, respectively. Proposition 1 For subsets a and b of C with aub = C, the following are equivalent. (i) Ka,b = Ka,abKab-1Kab,b ; (iʼ) Ka\b,b\a = Ka\b,abKab-1Kab, b\a ; (ii) (K-1)a\b,b\a = 0 ; (iii) Xa and Xb are c.i. given Xab . Corollary. Xα and Xβ are c.i. given X{α ,β }ʼ iff K-1(α,β) = 0. (Here c.i. abbreviates conditional independence, ʻ = complement.) 9 Illustration In the lecture I drew several figures illustrating the previous result. They are best summarized as follows: a\b Xa\b ab b\a independent of Xb\a given Xab 10