Introduction to multivariate statistics Terry Speed, SICSA Summer School Statistical Inference in Computational Biology, Edinburgh, June 14-15, 2010 Lecture 1 1 Use of probability distributions I will be presenting a view of multivariate statistics based on probability distributions. Although I will only be discussing theory, and not inference, the use of probability distributions underlies analysis methods which regard any data as having been generated by a (multivariate) statistical model. In the CS literature, this is referred to as the use of generative models. Bayesian methods, and likelihood methods more generally, use probability distributions, while many other methods, e.g. principal components analysis (PCA), neural networks (NN), and most clustering methods do not, at least in their standard forms. However, a good proportion of those that do not, can, in fact, be re-phrased in terms of probability models, including PCA, NN and clustering. Some people argue that the probability model paradigm is a big deal. I tend to agree, but donʼt make too much of this claim. Nonsense can be written by those who make use of the alleged power of probability models, and excellent work can be done by those who use non-probabilistic methods. 2 Two pure classes of multivariate models It is my observation that there are basically two pure classes of multivariate probability models: the discrete, and the normal (Gaussian). These can be and are extended and combined in many creative ways, by making use of other probability distributions such as the geometric, exponential, binomial, multinomial, Dirichlet, Poisson, and negative binomial, and by using general probability ideas such as independence, conditioning and mixing. While some other distributions have apparent multivariate analogues, they usually have heavy constraints, and are little no more than univariate distributions extended and combined together. There are exceptions to this statement, notably bivariate and hyperbolic distributions. By contrast, the discrete and normal are truly multivariate, and have much greater flexibility, though of course multivariate normals are constrained by their form. For this reason, I plan to focus on discrete and normal multivariate models today and tomorrow. 3 Comment Someone asked whether I regarded the Dirichlet as a truly multivariate distribution. I answered no. Dirichlets can be simulated from i.i.d. gamma random variables, by conditioning on the sum. In this sense, they are variants on i.i.d. random variables, see the remark on the previous page. There is a link between gamma and normal random variables, but it is not a primary relationship (speaking loosely). 4 The multivariate normal: Approach 1 Iʼll begin with multivariate normal (Gaussian) distributions in p dimensions. There are two main approaches to defining them and deriving the basic facts. Approach 1: Start with the univariate normal, and say a random p(column) vector X has a p-variate normal distribution if for every pvector of constants a, the linear combination aTX has a univariate normal distribution. This approach yields a few important results easily, e,g. the one asserting that an arbitrary linear combination of normals is again normal. (Iʼm dropping multivariate from now on.). But some basic results require the use of characteristic functions, which are not, in my view, elementary. Accordingly, I wonʼt pursue this approach. Those of you that are interested can see it further developed in the excellent book Multivariate analysis by K Mardia, J Kent and J Bibby, Academic Press. 5 Comment about multivariate normality It is far from true that a multivariate distribution is normal if all the marginals are univariate normal. The requirement of the previous page says much more: all linear combinations must be univariate normal. Not only is this hard to check in practice, it is hard to satisfy. Having said this, I observed that it is unreasonable to expect a set of data to fit the multivariate normal in every respect. Useful results can be obtained from approximations in this context, as in others. 6 The multivariate normal: Approach 2 A probability density in p dimensions is called normal and centered at the origin if it has the form φ(x) = γ-1 exp{ -½ q(x)} where γ is a normalizing constant, and q(x) = ΣjΣk qjkxjxk = xTQx is a quadratic form. (Note: Many people are more explicit here.) A normal density centered at μ = (μ1,…μp) is given by φ(x-μ). Elementary arguments show that no diagonal element of Q can vanish. Define Σ = var(X) = E{(X-EX)(X-EX)T} . Mostly, weʼll center at 0. The details relating Q and Σ which follow can be found in Introduction to probability theory and its applications, volume 2, by W Feller, Wiley, chapter III, section 6. See also Pattern recognition and machine learning by CM Bishop, Springer, section 2.3 for a parallel, but slightly different development. 7 The multivariate normal: Marginal densities Introduce the transformation y1 = x1, …, yp-1 = xp-1, yp = q1px1+…+qppxp. It can be seen that q(x) – yp2/qpp is a quadratic form in x1, …, xp-1 not involving xp. Thus q(x) = yp2/qpp + q*(y) where q*(y) is a quadratic form in y1, …,yp-1. This shows that the vector y = Ax has a normal density that factors into two normal densities for Yp and (Y1, …,Yp-1). Theorem. All marginal densities of a normal density are again normal. A simple inductive argument (see Feller for full details) based on the transformation above shows that there is a matrix C with positive determinant, such that Z = CX has components which are mutually independent normal random variables. 8 The multivariate normal: a basic fact Theorem. The matrices Q and Σ are mutually inverse, and ϒ2 = (2π)p.|Σ| . Proof. Put D = E(ZZT) = CΣCT. This is a matrix with diagonal elements E(Zj2) = σj2, and zeros off the diagonal. The density of Z is the product of normal densities n(xσj-1)σj-1 and hence induced by the matrix D-1 with diagonal elements σj-2. Now the density of Z is obtained from that of X by the substitution x =C-1z and multiplication by the determinant |C-1|. Thus zTD-1z = xTQx and (2π)p|D| = γ2|C|2 . It follows that Q = CTD-1C, and this implies Q = Σ-1. It also follows that |D| = |Σ|.|C|2, and hence ϒ2 = (2π)p|Σ|. 9 The multivariate normal: Covariance matrices With this we see that factorization of Σ corresponds to factorization of Q, and hence the very important Corollary. If (X1T, X 2T)T is normally distributed, then X1 and X2 are independent iff cov(X1, X2) = 0, that is, iff X1 and X2 are uncorrelated. A second important fact is the following Theorem. A matrix Σ is the covariance matrix of a normal density iff Σ is positive definite. Equivalently, A matrix Q induces a normal density by the formula given earlier iff it is positive definite. The proof is a simple induction, see Feller cited above. (Of course there are less elementary proofs of this fact.) 10 Elaboration on the Corollary from the previous page. Suppose that φ1(x1) = γ1-1exp{ -½ q1(x1)} and φ2(x2) = γ2-1exp{ -½ q2(x2)} are two independent normal densities with inverse covariance matrices Q1 and Q2 and covariance matrices Σ1 and Σ2 respectively. Then the product density φ(x1 ,x2) = (γ1 γ2)-1exp{ -½ [q1(x1)+q2(x2)]} has inverse covariance matrix Q and covariance matrix Σ, where Q1 0 Σ1 0 Q= and Σ = . 0 Q2 0 Σ2 It is clear that if Σ can be partitioned as on the right, i.e. the two components are uncorrelated, than so can Q, and so € independence can be inferred. 11