Postgraduate Lectures – MSSL/UCL A multivariate analysis primer Ignacio Ferreras i.ferreras@ucl.ac.uk May 3, 2016 Introduction Multivariate analysis deals with the study of datasets where several observations are done for each system within a sample. We can assume these observations depend on a number of parameters. For instance, in astrophysics: + Several photometric/spectroscopic measurements of a sample of stars. + The analysis of the surface brightness distributions of galaxies. + Separation of components from a multi-frequency survey of the CMB. In general, the goal is to: • Find the relationship between the data and the parameters (regression). • Arrange the data into a reduced set of classes (classification). • Reduce the dimensionality, or determine the driving parameter(s). • Reduce the noise of the observables based on the statistical properties. I. Ferreras A multivariate analysis primer Page 1 An example Consider a simple example with three observables, the mass, luminosity and size of spheroidal stellar systems. In this 3D parameter space, the observations populate a lowerdimensional space, revealing some relationship. In this case the virial theorem (blue plane) gives a good (although not complete) explanation of the relation. The question is: in general, can we use the data alone to inform us of possible physical correlations in a complex set of observations? (Tollerund et al. 2011, ApJ, 726, 108) I. Ferreras A multivariate analysis primer Page 2 Probability Distribution We can consider the observations and the sources as originating from a probability density function (pdf), such that a given observation x has a probability to be measured in the interval [x0 dx, x0 + dx]: p(x0)dx, where px is the probability density function, which can also be given by its cumulative version: Z x0 F (< x0) = p(⇠)d⇠. 1 The expected value of an arbitrary function of the variable g(x) is: hgi = E(g(x)) = Z +1 g(x)p(x)dx 1 and the uncertainty interval of x at the confidence level [c1, c2] (e.g. c1 = 0.05, c2 = 0.95 for the 90% C.L.) is [x1, x2] = [F 1(c1), F 1(c2)]. I. Ferreras A multivariate analysis primer Page 3 Typical distributions Binomial: The probability of an event succeeding (q) or failing (1 After n trials, the probability of k successes is: ✓ ◆ n k pk (n; q) = q (1 k q)n q). k Poisson: The limit of the binomial distribution when q ! 0 but nq ⌘ is finite. The probability of detecting k events (e.g. in a fixed interval of time), is: k e pk ( ) = k! Gaussian: The central limit theorem states that a sum of random variables, with finite variance, will approach the Gaussian distribution, defined by two parameters: mean (µ) and standard deviation ( ): I. Ferreras A multivariate analysis primer ⇣ 1 p(x; µ, ) = p exp 2⇡ (x 2 µ)2 ⌘ 2 Page 4 “chi-squared” ( 2): A sum of ⌫ centrally-distributed Gaussian variables with unit variance. ⌫ x x 2 1e 2 p⌫ (x) = ⌫ ⌫ 22 (2) This is used (and abused!) in model fitting, where one has a set of observed quantities {xi} with uncertainty { i} and a model y(x; ⇡j ) that explains those data with a set of parameters {⇡j }. The 2 is formed as follows: 2 (⇡j ) = X xi i y(xi; ⇡j ) 2 i and the likelihood related to the parameters is therefore: L(⇡j |x) = I. Ferreras A multivariate analysis primer Y i i 1 p 2⇡ e [xi y(xi ;⇡j )]2 2 2 i /e 1 2 (⇡ ) j 2 Page 5 Distribution Moments The pdf can be coded into a set of numbers (statistics) that are commonly used to describe the distribution. The nth order moment of a distribution is defined as the expected value of the nth power of the variable: n Mn ⌘ E(x ) = Z +1 xnp(x)dx, (1) 1 • n=0: Normalization, should always be M0 = 1 • n=1: Average. For a Gaussian M1 = µ • n=2: Related to the “width” of the distribution, for a Gaussian M2 = 2 + µ2 Higher order moments give more information about the shape of the pdf. A Gaussian is uniquely defined by (µ, ), since M>2 = 0. I. Ferreras A multivariate analysis primer Page 6 High order moments It is practical to remove the mean (µ ⌘ M1) from the data to compute the moments. These are now the central moments: µn = Z +1 (x µ)np(x)dx. (2) 1 Two important high-order moments are used to explore non-gaussianity: Skewness: (many similar definitions) 3 Third standardised moment: Indicates deviation from 1 ⌘ µ3 /µ2 . symmetry about the mean ( 1 = 0 for a Gaussian distribution). Kurtosis: 2 ⌘ (µ4/µ42) for a Gaussian). I. Ferreras A multivariate analysis primer 3 represents the “degree of peakiness” ( 2 =0 Page 7 Skewness & Kurtosis I. Ferreras A multivariate analysis primer Page 8 Extending to more variables We will extend our measurements from one random variable, x, to a set of n variables, to be written as a column vector1: xT = (x1 x2 x3 · · · xn) (3) We can thus define a mean vector: m = E(x), (4) +1 Z +1 (5) the correlation matrix: rij = E(xixj ) = Z 1 xixj p(xi, xj )dxidxj , 1 in vector notation: R = E(xxT). 1 (6) Hence the transpose symbol in the definition I. Ferreras A multivariate analysis primer Page 9 The covariance The correlation matrix is the extension of the second order moment to a set of n random variables. Analogously, we can define the central second order moment if we subtract the mean vector. This is the covariance matrix: C ⌘ E[(x m)(x m)T ], (7) the covariances being each of the crossed-terms (o↵-diagonal) in the matrix: cij = E[(xi mi)(xj mj )], (8) which trivially reduces to the individual variances for the diagonal terms of the covariance matrix. |cij | i · j . (9) The equality holds when xi and xj are fully correlated. I. Ferreras A multivariate analysis primer Page 10 Cross-covariance These definitions can be extended when considering two di↵erent random vectors x, y: Cross-correlation: Rxy = E[xyT] Cross-covariance: Cxy = E[(x mx)(y my ) T ] Correlations and covariances measure the dependence between the random variables using their second-order statistics. Examples: I. Autocorrelation function of galaxies to find clustering according to galaxy types. II. Cross-correlation of QSOs and galaxies to look for connections between QSO activity and galaxy formation. I. Ferreras A multivariate analysis primer Page 11 Multivariate Normal distribution It is the equivalent to a 1-parameter (univariate) Gaussian distribution, where the mean becomes a vector (µ), and the variance becomes a tensor (covariance ⌃). f (x) = 1 p (2⇡)p/2 det⌃ exp h 1 (x 2 T µ) ⌃ 1 (x µ) i (10) • Only first and second order statistics are needed • Linear transformations are Gaussian • Marginal and conditional densities are Gaussian • The contours of fixed probabilities are n-dimensional hyperellipsoids centered at mx. I. Ferreras A multivariate analysis primer Page 12 Multivariate Gaussian Density The covariance matrix is symmetric and positive definite, which means we can find a rotation (defined by an orthogonal matrix E) such that: T Cx = EDE = n X T i ei ei (11) i=1 with D being a diagonal matrix whose elements { i} are the variances of the rotated components ei. The figure shows the hyperellipsoid corresponding to the equation: (x I. Ferreras A multivariate analysis primer mx ) T C x 1 (x mx) = constant (12) Page 13 Quoting uncertainties For instance, the quoting of uncertainties often reduces to giving some “1 ” level, which – if the pdf of the measurement is Gaussian – should tell us that the observation ±1 error defines an interval within which the true value should be, with 68% probability, or within an interval ±2 the true value is contained with probability 95%, etc... Sometimes the pdf is known in detail, and one can quote the non-gaussian confidence levels, even show a contour map of the PDF for the parameters considered. The figure shows a typical example, with the 68, 90 and 95% confidence levels for the estimate of the age of the stellar populations in a galaxy from its spectroscopic data. (Ferreras & Yi, 2004, MNRAS, 350, 1322) I. Ferreras A multivariate analysis primer Page 14 Estimation Theory Normally, we do not have access to the probability density function (pdf). We define a set of “estimators” that allow us to determine the underlying propery of the pdf. If we take a set of N independent measurements, say the length of a rod, we define a data vector: xT = (x1 x2 · · · xN ) (13) Typical estimators for the mean and variance for the p = 1 (univariate) case are: N 1 X µ̂ = xi (14) N i=1 1 2 ˆ = I. Ferreras A multivariate analysis primer N N X ⇥ 1 i=1 xi µ̂ ⇤2 (15) Page 15 Likewise for the p-dimensional multivariate case. N 1 X µ̂ = xi N i=1 ˆ = ⌃ 1 N N X ⇥ 1 i=1 (16) ⇤⇥ µ̂ xi xi µ̂ ⇤T (17) The standard notation is to define µ and ⌃ as the population mean and ˆ as the sample mean and covariance. Note that covariance, and µ̂ and ⌃ sample and population values may di↵er if the observed data set is biased. Also beware that this result relies on the fact that the underlying distribution is Gaussian. You may also come across the scattering matrix, which is a scaled version of the covariance: ˜ = (N ⌃ ˆ = 1)⌃ N X ⇥ ⇤⇥ µ̂ xi xi i=1 I. Ferreras A multivariate analysis primer µ̂ ⇤T (18) Page 16 Multivariate likelihood In multivariate model fitting, a common approach is to compare a set of N observations (yi) to model predictions according to a parameter, or a set of parameters (f (xi; ⇡j )) with the likelihood: L(⇡j ) / e 2 (⇡ ) j 2 , (19) where 2 is the standard comparison between observations and model, scaled by the uncertainties ( i): 2 (⇡j ) ⌘ N h X yi i=1 f (xi; ⇡j ) i2 (20) i A good model fit to the data gives a minimum 2 of order N . If 2MIN N the model does not describe the data well (and any result from this likelihood should be discarded)2, and 2MIN ⌧ N implies the uncertainties must have been overestimated. 2 or the errors have been underestimated I. Ferreras A multivariate analysis primer Page 17 If we use L(⇡j ) as a PDF, we can follow a Bayesian approach for the derivation of parameters, and their uncertainties. However, note that this assumption implies that all the N measurements, {yi}, are uncorrelated, i.e. the “variance” attached to, say, the i-th measurement is only 2 i for i 6= j = ⌃ii =) ⌃ij = 0, where ⌃ is the N ⇥ N covariance matrix of the measurements. correlated data sets, the definition of 2 is: 2 (⇡j ) ⌘ (y f (x, ⇡))T ⌃ 1 (y f (x, ⇡)), (21) For (22) and in this case, the confidence levels from the use of L(⇡) as a PDF will change with respect to the original case (eq. 20), according to the level of correlation. I. Ferreras A multivariate analysis primer Page 18 Linear Discriminant Analysis This technique (pioneered by Fisher) rests on the definition of a function that has to be maximised for an optimal classification of the data points into classes. The simplest case corresponds to two classes ({c1, c2}) in a 2D observable space (e.g. we have a set of galaxies for which we measure their mass and size and we split them according to two morphological types). The goal is to find a projection on a line such that both classes will be well separated. We can easily extend this analysis to a higher dimensional space by using vector notation. I. Ferreras A multivariate analysis primer Page 19 This method needs a “training sample” which we already associate to either of the two classes. A hierarchical method can be built up from this method. We start with n1 data points for class c1 and n2 points for class c2. For an arbitrary direction, given by a unit vector v, the projections of the data points are given by vT xi. We can define two di↵erent means for each class (j = {1, 2}): nj 1 X mj = xi nj i2c j nj mvj = 1 X T v x i = v T mj nj i2c j The first one is a vector quantity, giving the average position of the j-th class. The second one is a scalar quantity, representing the average of the projections on to v. The linear discriminant that we need to maximise is: J(v) = (mv1 mv2 )2 s21(v) + s22(v) where s2j (v) is the scatter measured within the projections in the j-th class, I. Ferreras A multivariate analysis primer Page 20 i.e. it is related to the sample variance restricted to the data points in that class: nj X 2 sj (v) = (vT xi mvj )2 i2cj Note that in this case, there is no factor 1/(nj 1) in the definition. We can write the discriminant in vector notation: J(v) = ˜ Bv vT ⌃ ˜Wv vT ⌃ ˜ B and ⌃ ˜ W are the scatter matrices between classes, and within where ⌃ classes, respectively: ˜ B = (m1 ⌃ ˜W = ⌃ ˜1 + ⌃ ˜2 = ⌃ n1 X i2c1 I. Ferreras A multivariate analysis primer (xi m2)(m1 m1)(xi T m2 ) T m1 ) + n2 X (xi m2)(xi m2 ) T i2c2 Page 21 I leave as an exercise the derivation of the final result: the direction v that maximises the discriminant corresponds to: ˜ Bv = ⌃ ˜Wv ⌃ ˜ W has an inverse, we can which is a generalized eigenvalue problem. If ⌃ convert this to an eigenvalue problem: ˜ 1⌃ ˜ ⌃ W Bv = v ˜ B v is always Finally, since for any direction v, the transformed vector ⌃ collinear with (m1 m2), we can solve the eigenvalue equation: ˜ 1(m1 v=⌃ W m2 ) I. Ferreras A multivariate analysis primer Page 22 Clustering analysis One can use the observed data to classify the sources into di↵erent sets. These classes are defined from the statistical properties within the whole set, but they may reflect an underlying connection with the physical processes of the systems under study. Clustering analyses can be classified as: • Supervised/Unsupervised • Hierarchical/Non-hierarchical The concept of clustering relies on a definition of a distance in p-dimensional parameter space. Note these parameters can be comparable (X,Y,Z distances) or not at all (RA, Dec, redshift, luminosity). I. Ferreras A multivariate analysis primer Page 23 Distance in parameter space A generalisation of the Euclidean distance is the Minkowsi metric. The distance between two p dimensional points: D(xi, xj ) = p ⇣X k=1 xj,k |m |xi,k ⌘1/m (23) The Euclidean case corresponds to m = 2. Other choices are m = 1 (Manhattan distance) or m ! 1 (Chebyshev distance). When the di↵erent parameters have very di↵erent ranges, it is often – but not always! – advisable to re-define them, scaling the parameters with respect to their variance (and also o↵setting them to have zero mean): zi,j = (xi,j q x̂j ) (24) ˆ jj ⌃ In addition, those parameters with a large range of variation should be re-scaled by taking the logarithm. I. Ferreras A multivariate analysis primer Page 24 A further approach taking into account the covariance of the data leads to the Mahalanobis distance: h D(xi, xj ) = (xi ˆ xj )T ⌃ 1 (xi xj ) i1/2 (25) When the dataset is decorrelated (diagonal covariance), this distance reduces to the Euclidean case weighted by the variances. I. Ferreras A multivariate analysis primer Page 25 Clustering Once a distance is defined, the clustering proceeds by agglomerative clustering in a hierarchical way, starting with one class per data point. Two nearby data points are merged in one class following some specific threshold (e.g. the closest pair in the whole data set), continuing the process until one class engulfs all data points. The procedure can be visualized as a tree or dendogram. However, this process needs to define the distance between a cluster (C, made up of points {p1, p2, · · · , pj }) and a new point (pk ). Friends-of-friends (single linkage): dCk = min(d1k , d2k , · · · , djk ) Complete linkage: dCk = max(d1k , d2k , · · · , djk ) Pj Average linkage: dCk = 1j i=1 dik I. Ferreras A multivariate analysis primer Page 26 Clustering: k-means A standard clustering procedure starts with a set of k locations in pdimensional parameter space, representing the centroids of k classes. Data points are assigned to one of these classes, with the choice driven by a minimization of the sum of the squares of distances among points within the same class. The number of classes k and the seed locations are chosen at startup. I. Ferreras A multivariate analysis primer Page 27 The Information Bottleneck (IB) There is a plethora of multivariate techniques aimed at blind source separation. Among the many, the Information Bottleneck (Slonim et al. 2000) is a good method to illustrate how to progressively build up common classes. Its methodology derives from clustering techniques that minimize a defined Euclidean distance in the n dimensional parameter space spanned by the data vectors. So, if we have a set of s classes {ki}si=1 to describe our data sample comprised of N n-dimensional vectors {xj }N j=1 , we can describe the probability for a class k for a given data vector x by the use of Bayes’ theorem: h 1 p(k|x) / p(k)p(x|k) = p(k) p exp 2⇡ 1 2 i 2 D (k, x)) , 2 E (26) where p(k) is the prior of the class, and DE is defined as the Euclidean distance between the data vector and the class: 2 DE (k, x) = n X ⇥ ks xs s=1 I. Ferreras A multivariate analysis primer ⇤2 (27) Page 28 Defining Classes The notation will be clearer if we consider a specific example. Let us assume that we have a sample of galaxy spectra. Let us denote by G the set of all galaxies in the sample, and by ⇤ the set of wavelengths observed in each spectrum. The ensemble can be described by a joint probability p(g, ) denoting the probability of observing a photon with wavelength 2 ⇤ from galaxy g 2 G (it is necessary to normalize all spectra to unity so that they can be considered probability distributions). We assume a uniform prior on the galaxies: p(g) = 1/N , where N is the total number of galaxies. The goal of the IB is to construct a set of classes C that preserves the properties of the original sample, with a minimal number of classes and a minimal loss of information. The spectral information of class c 2 C is therefore: X p( |c) = p( |g)p(g|c) (28) g I. Ferreras A multivariate analysis primer Page 29 Mutual Information I Information is often quantified in terms of the entropy of the class. For the class of galaxies: X H(G) = p(g) log(g) (29) g If we include information about wavelengths, we can define a conditional entropy of the galaxies from the spectra. H(G|⇤) = X p( ) X (30) p(g| ) log p(g| ) g The additional knowledge about the wavelength information can only result in less uncertainty in the knowledge of G. We can define the mutual information between classes G and ⇤ as: I(G; ⇤) ⌘ H(G) H(G|⇤) = X g, p(g)p( |g) log I. Ferreras A multivariate analysis primer p( |g) p( ) (31) Page 30 Mutual Information II Mutual information between two random variables is therefore the amount of uncertainty in one variable that is removed by the knowledge of the other one. In our specific case, we can define the mutual information betweem the set of galaxies G and the set of classes C as: I(C; G) = X p(g)p(c|g) ln c,g p(c|g) p(c) (32) The mutual information is symmetric, non-negative, and zero if and only if both sets are independent. “No manipulation of the data can increase the amount of mutual information” (data processing inequality theorem). Hence, by grouping galaxies into classes, one can only lose information about the data: I(C; ⇤) I(G; ⇤) I. Ferreras A multivariate analysis primer (33) Page 31 The Information Bottleneck The goal of the IB is then to find a set of classes C that maximize the spectral information I(C; ⇤) under a constraint on I(C; G). In essence, we pass the spectral information I(G; ⇤) through the bottleneck of the classes, which are forced to extract the relevant information between G and ⇤. The optimal classification has to maximise the functional: L[p(c|g)] = I(C; ⇤) where 1 1 I(C; G) (34) is the Lagrange multiplier attached to the complexity constraint. If ! 0 the classification is as non-informative as possible: one class for all galaxies. If ! 1 the compression of data into classes is maximised: one galaxy, one class. Varying the constraint allows us to probe the level of compactness of the data into simpler classes. I. Ferreras A multivariate analysis primer Page 32 The Information Bottleneck The maximisation of the functional in equation 34 gives: p(c|g) = p(c) exp( Z(g, ) DKL) (35) where Z(g, ) is the partition function and DKL is the Kullback-Leibler divergence, or cross-entropy between g and c, defined by: DKL(g||c) = X p( |g) ln p( |g) p( |c) (36) analogous to the result using the Euclidean distance (equation 26). In practice, the IB method follows a hierarchical approach, starting with C ⌘ G, and merging two classes in each step, checking that the mutual information I(C; ⇤) is maximally preserved. The iterative method stops when a target minimum number of classes (or a mutual information threshold) is reached. I. Ferreras A multivariate analysis primer Page 33 The Information Bottleneck An application of the IB to spectral data from the 2dF galaxy survey (Slonim et al. 2001 MNRAS, 323, 720). Five components are just needed to preserve most of the information (crosses in the left-hand panel). Notice that information from real data (2dF) is harder to “compress” into classes than mock samples from galaxy formation models. I. Ferreras A multivariate analysis primer Page 34 The Information Bottleneck (Ferreras 2012, IAUS, 284, 38) I. Ferreras A multivariate analysis primer Page 35 Blind Source Separation The goal is to separate a set of data into their underlying components. Example I: Dinner Party Problem We invite a number of guests to a dinner party. They have N independent conversations. We put M microphones in the room, that record various linear superpositions (depending on their location within the room) of the conversations. Is it possible to disentangle the M recordings into N conversations? Example II: The formation history of a galaxy The spectrum of a galaxy represents a superposition of its stellar populations. They comprise all stars ever formed or incorporated in the galaxy (of course excluding remnants). Is it possible to disentangle those populations into a star formation history? I. Ferreras A multivariate analysis primer Page 36 Blind Source Separation (cont’d) Example III: Face Recognition Algorithm to identify/classify faces by decomposing the information from a large dataset into “sources” that can cleanly discriminate facial features (no modelling). Example IV: Response of the brain In order to understand the processes inside the brain, NMR imaging is often used on people that are subject to stimuli. The spatio-temporal output is fed to some algorithm that separates the output into its key sources, so that one can relate the input stimuli to the region of the brain that is being activated. Example V: Time series analysis E.g. GRB light curves to be classified without any reference to a model, simply decomposed into their simpler sources by the statistical properties of a large sample of GRB data. I. Ferreras A multivariate analysis primer Page 37 Signal Mixture (as a time series) Let us denote by {xj (tk )} the sequence of observables (j = 1 · · · N ), measured at a number of times (k = 1 · · · T ). The measurement process is simply a MIXTURE of the original variables {yi(tk )} (i = 1 · · · N ) into the observations: xi(tk ) = X j wij yj (tk ) ) x(t) = W · y(t) ⇣ + noise ⌘ (37) The matrix W 1 solves the problem. One can consider the statistical properties of the observations in order to find out about the matrix. For instance, one can consider choices of W that produce decorrelated components (Principal Component Analysis) or statistically independent components (Independent Component Analysis), or that reduce the mutual information among classes (Information Bottleneck). I. Ferreras A multivariate analysis primer Page 38 ... a tough problem to solve In a Blind Source Separation problem, we do not have any information about the mixtures or about the underlying sources. The only data available is a (hopefully large) set of observations that are known/hoped to originate from a simple set of sources. We do not even know how many sources are responsible for the data. Often, a smaller number of sources can reliably reproduce the observations (data compression). Noise will be considered as an extra, additive component, i.e. by solving the problem one can “denoise” the data. I. Ferreras A multivariate analysis primer Page 39 Uncorrelatedness Two random vectors x and y are uncorrelated if their cross-covariance matrix is a zero matrix. Cxy = 0 ) Rxy = mxmT y. (38) One can consider also the case of uncorrelatedness within the components of a random vector x: Cx = D = diag( 2 x1 2 x2 ··· 2 xn ), (39) which is the essence of Principal Component Analysis (PCA). In particular, random vectors having zero mean and unit covariance (up to some constant variance 2) are said to be white. mx = 0, Rx = Cx = I. (40) Exercise: Show that under an orthogonal transformation of an n-dimensional vector: y = Tx, with T 2 SO(n), the transformed vector y remains white. I. Ferreras A multivariate analysis primer Page 40 Statistical Independence We can impose a stronger constraint on the data: two random variables x and y are said to be statistically independent if and only if: px,y (x, y) = px(x)py (y) (41) Which implies that for any function of these variables: E[g(x)h(y)] = E[g(x)]E[h(y)] (42) If both x and y are Gaussian distributions, uncorrelatedness and statistical independence are the same thing (remember a Gaussian distribution can be fully described by the first and second order moments). Uncorrelatedness: moments. equality of distributions up to the second order Independence: equality of distributions for all orders, n = 1, · · · , 1. I. Ferreras A multivariate analysis primer Page 41 Testing for correlation A simple example that shows us how two variables can be correlated is the following pdf – the 2D version of the previous definition of a multivariate Gaussian (eq. 10): P (x, y| x, y , ⇢) = + 2⇡ x y 1 p 1 2⇢xy io µy ) 2 (y ⇢2 ⇥ exp 2 y n 1 2(1 ⇢2) h (x µx ) 2 2 x + (43) x y The correlation between x and y depends on the parameter ⇢, disappearing as ⇢ ! 0. This is the correlation coefficient for two variables, defined as: ⇢= cov[x, y] , (44) x y I. Ferreras A multivariate analysis primer Page 42 Testing for correlation (cont’d) The figure shows the contours of the bivariate Gaussian pdf for two choices of ⇢, a decorrelated case (blue) and a strongly correlated one (red). A typical estimator of correlation is given by the Pearson product-moment correlation coefficient: PN r ⌘ qP N i=1 (xi i=1 (xi where h· · · i denotes the average. I. Ferreras A multivariate analysis primer hxi)(yi hyi) PN hxi)2 i=1(yi hyi)2 (45) Page 43 Testing for correlation (cont’d) The contours of the previous figure drop from the maximum (at the origin) by a factor 1/e at a distance x, given by: xT C 1 (46) x = 1, where the covariance matrix is: C= 2 x x y⇢ 2 y x y⇢ ! (47) We can use the standard estimators for the covariance term: cov[x, y] = x y⇢ = 1 N 1 h(x x̄)(y ȳ)i. I. Ferreras A multivariate analysis primer (48) Page 44 Beware of wrong parameter interpretation! Anscombe’s quartet shows four sets of data with the same means, regression coefficients and correlation/covariance. I. Ferreras A multivariate analysis primer Page 45 Principal Component Analysis Consider a sample of N objects with n parameters measured for each of them. These data can be written as a set of N , n-dimensional vectors {x(k)}N The aim of PCA is to perform a linear transformation of k=1 . these vectors (a rotation in n dimensional space) such that one can define an orthogonal set of n vectors (principal components, {ei}ni=1) that are decorrelated, and can be used to describe the original set of N vectors. Furthermore, each principal component will have an associated variance, so that we can sort the principal components in decreasing order of the individual variances. Each of the N original vectors can be described by a set of n numbers (“coordinates”) representing the projections on to each of the principal components. This method also allows us to compress the data (lossy). We can truncate this set of projections into the first m < n components, so that most of the information (in the sense of variance) for each vector is preserved. I. Ferreras A multivariate analysis primer Page 46 PCA – Covariance The easiest way to deal with PCA is to consider the covariance matrix, which is a n ⇥ n real, symmetric matrix: cij = N X k=1 (k) (xi (k) hxii)(xj hxj i), 1 i, j n (49) One can always diagonalize this matrix: Cei = i ei , (50) with n eigenvalues { i} and n eigenvectors ei (the principal components), and reorder them such that 1 > 2 > · · · > n. Since C is diagonal, the eigenvectors are decorrelated (all the o↵-diagonal terms in their covariance matrix are equal to zero). I. Ferreras A multivariate analysis primer Page 47 PCA – Covariance The projections of the original data vectors are often given as PCi=1,···n. For the k-th input data vector we have the following expansion: (k) PCi ⌘x (k) · ei = n X x(k) s ei,s (51) s=1 The original vectors are therefore uniquely given by these n “coordinates”: x (k) = n X (k) PCi ei (52) i=1 the truncation of this series leads to data compression. I. Ferreras A multivariate analysis primer Page 48 Scree plot The scree plot is a very useful figure that shows the variance of each principal component as a function of rank. That allows us to determine how much information is kept in each component and gives a quantitative measurement of the information lost if the series is truncated. This scree plot shows two main trends in the decay of “information” with the increasing rank of the principal components. Typically, the trend after the 7–8th component is characteristic of noise. Hence, by truncating the series around those terms, one would be capable of “de-noising” the data. The inset shows the cumulative variance: with 8 components we retain about 90% of the information in the original data set. I. Ferreras A multivariate analysis primer Page 49 An example: PCA on galaxy spectra (Rogers, IF et al. 2007) This is an example of PCA applied to a set of ⇠7,000 spectra from early-type galaxies in the Sloan Digital Sky Survey. After de-reddening and de-redshifting the sample, one can treat each SED as a data vector, compute the covariance matrix, and find the principal components. The benefit of a BSS approach is that one does not rely on models to extract information from a data set. It is just the information hidden in the data set – in the form of variance – that results in the definition of the principal components. The drawback is that there is no “physics” in the methodology. Even though, in this case, we can see the Balmer series in components 2 and 5, we cannot interpret these spectra as physical ones. Indeed, the enforced orthogonality inherent to PCA introduces spurious non-physical spectral features. I. Ferreras A multivariate analysis primer An example: PCA on galaxy spectra Page 50 (Rogers, IF et al. 2007) One can put the physics back into the analysis by comparing the projections of the principal components on to the galaxies (i.e. their coordinates) with physical observables. Here we see a strong trend of some of the components with respect to colour or central velocity dispersion. I. Ferreras A multivariate analysis primer Page 51 An example: PCA on galaxy spectra (Rogers, IF et al. 2007) We can then project synthetic models of population synthesis – of known age and metallicity – to quantify the way PCA has disentangled in part the inherent degeneracies. I. Ferreras A multivariate analysis primer Page 52 An example: PCA on galaxy spectra (IF et al. 2010) Once we identify the physical meaning of the PCA-related projections, one can use those as a way of describing the essential information in the galaxy spectra. This figure shows how the PCA information (given here by a combination of the projections of the first two principal components) can discriminate between the e↵ects of intrinsic galaxy properties – such as central velocity dispersion – and environment e↵ects – described here by the mass of the host halo. I. Ferreras A multivariate analysis primer Page 53 The complexity of galaxies If we consider a set of observables of galaxies like size, colour, luminosity, etc, one finds a very “compressible” distribution. Here, a sample of HIdetected galaxies is analyzed with PCA, to show that one independent parameter may be enough to explain their properties (Disney et al. 2008). I. Ferreras A multivariate analysis primer Page 54 PCA: characterization of the PSF PCA can be used to represent in a few numbers the point spread function of a camera. The figure illustrates the case for the Advanced Camera for Surveys (HST/ACS, Jee et al. 2007). The top panel shows an observed PSF through the F814W passband (a), and reconstructions using wavelets (b, 150 basis functions), shapelets (c, 78 fcns) and PCA (d, 20 components, extracted from 800 stellar images). The plot compares these profiles, showing the advantage of PCA, which just uses the variance in the data set as a way to determine the optimal basis functions. The other methods rely on the definition of the basis functions to optimally match the PSF. I. Ferreras A multivariate analysis primer Page 55 Face Recognition Treating images as data vectors, we can look in the covariance matrix of a set of pictures of faces to decompose the information into principal components. We can then describe an arbitrary face by a number of projections on to the most significant “eigenfaces”. I. Ferreras A multivariate analysis primer Page 56 Other image recognition problems Similarly, one can use PCA to determine the illumination or the orientation of simple figures. This can help towards the general problem of computerbased visual recognition. It is also used in video surveillance work, separating the interesting data from the background. I. Ferreras A multivariate analysis primer Page 57 Drawbacks of PCA • Linear • Enforced orthogonality of principal components • Non-physical sources • Highly sensitive to outliers: Robust PCA requires a way of “clipping” outliers from the original data set. • “Attention deficit” Prone to catch consistent instrumental/data reduction residuals. I. Ferreras A multivariate analysis primer Page 58 PCA: removal of systematic signals The last point in the list of drawbacks can actually be a strength of PCA when applied to the filtering of residual e↵ects. In this case, Hewett & Wild (2005) use PCA to remove small – but noticeable – night sky emission from SDSS spectra. I. Ferreras A multivariate analysis primer Page 59 Factor Analysis (FA) An alternative methodology to solve the blind source separation is to assume a set of m latent variables ({fi}), such that the p observed data ({yj }, p > m) correspond to linear superpositions of these variables plus noise ({✏j }): y =µ+W ·f +✏ (53) Here, µ is the mean of the data. In FA jargon, the p ⇥ m mixing matrix (W) is called the loadings of the latent variables. There are a number of assumptions about the data: the uncertainties have zero mean and are uncorrelated; and there is no cross-covariance between the factors and the uncertainties. Also cov(f ) = 1m⇥m Note the di↵erence between PCA and FA: • PCA gives the principal components as linear superpositions of the original data. FA use latent variables. • PCA aims at sorting the data with respect to the variance of the observations. FA exploits the covariances among subsets. I. Ferreras A multivariate analysis primer Page 60 After a few steps, we find that the covariance of the data, ⌃ = cov(y) ⌘ (y µ)(y µ)T , can be written: ⌃ = WW T + where , is the (diagonal) covariance matrix of the uncertainties. There are several ways to solve this: 1. Principal component method: (Note PC/FA 6= PCA) Here we neglect the covariance of the uncertainty, and write: ⌃ = CDCT = (CD1/2)(CD1/2)T where D is a diagonal matrix. We can take the square root as we are dealing with a covariance (i.e. non-negative eigenvalues). Note CD1/2 is a p ⇥ p matrix. The trick is now to select only a few of the top eigenvalues (m < p), creating the eigenvector matrix (C1)p⇥m, and the eigenvalue diagonal matrix (D1)m⇥m, such that: 1/2 W = (C1D1 )p⇥m I. Ferreras A multivariate analysis primer Page 61 2. Principal factor method: The uncertainty matrix is included. method is equivalent to PC/FA where the covariance is replaced by: The ⌃)⌃ (remember the covariance of the uncertainties is a diagonal). A typical assumption for the diagonal elements of this matrix is: (⌃ )ii = (⌃)ii 1 (⌃ 1 )ii Similarly to the previous case, we diagonalise the matrix and restrict the analysis to the highest m eigenvalues, obtaining: ⌃ 1/2 = C 1 D1 =W This method can be iterated, substituting the values of Wii back into the diagonal elements of ⌃ . I. Ferreras A multivariate analysis primer Page 62 Note that the decomposition into factors is not unique. A rotation, i.e. a transformation via an orthogonal matrix (OOT = 1) produces the same result. Therefore, the last, and important, step in FA is to rotate the mixing matrix (W) until the loadings fall on fewer latent variables (rather than being all spread out). I. Ferreras A multivariate analysis primer Page 63 Independent Component Analysis (ICA) ICA can be considered as an extension of PCA to arbitrary moments of the probability distribution. With PCA, we simply decorrelate the data – hence stopping at the covariance, i.e. the second order moment. With ICA we require a separation of the data vectors into sources that are not only decorrelated but statistically independent. While PCA has a clean method to proceed: “diagonalise the covariance matrix and project the data vectors on to the eigenvectors in decreasing order of its eigenvalues”, ICA is not uniquely defined, and many techniques have been defined to achieve the extraction of statistically independent components. We will give a few conceptual ideas below. For more details check out specific packages for the implementation of ICA (e.g. FastICA3). 3 http://scikit-learn.org/stable/modules/decomposition.html#ica I. Ferreras A multivariate analysis primer Page 64 Non-Gaussianity The central tenet of blind source separation is that the observed data vectors are a mixture of the source signals plus some noise: x = As + n (54) such that the sources s are statistically independent. But remember neither the mixing matrix, nor the sources are known. One way of proceeding makes use of the Central Limit Theorem: If a set of signals s = (s1 s2 · · · sN ) are independent, with 2 means (µ1 µ2 · · · µN ) and variances ( 12 22 · · · N ), then the signal PN defined as x ⌘ i=1 si has a probability density function that P approaches (as N P ! 1) a Gaussian distribution with 2 mean i µi and variance i i I. Ferreras A multivariate analysis primer Page 65 Non-Gaussianity: an example Consider the speech signal on the left. It is a leptokurtic (or super-gaussian) distribution – positive kurtosis. The middle panel shows a sawtooth signal, clearly platykurtic (sub-gaussian, negative kurtosis). A mixture of both (let’s just take the sum, rightmost panels) is a signal closer to a gaussian From “Independent Component Analysis”, Stone. I. Ferreras A multivariate analysis primer Page 66 Non-Gaussianity (Projection Pursuit) This means that any mixture of independent (non-Gaussian) signals will appear more Gaussian than the original ones. Hence, one can search for possible decompositions of the original data vectors into those with the highest non-gaussianities. The down side is that ICA will only be capable of decomposing a set of signals into a number of non-gaussian sources plus a single gaussian signal which cannot be decomposed any further. This example shows how to separate the first two principal components out of a PCA test into two more independent sources, by maximizing the non-gaussianity, measured here as kurtosis (contour line) (Ferreras 2012, IAUS, 284, 38). I. Ferreras A multivariate analysis primer Page 67 A pictorial version of ICA This is a very simple representation of ICA, where two independent signals (left) are mixed into two observed datasets (middle). By whitening the data (i.e. decorrelating and scaling such that cov(y) = 1), we see that the final step is to “rotate” the axis so that each signal returns to a set of independent components. (from Hyvärinen et al. 2001) I. Ferreras A multivariate analysis primer Page 68 Negentropy Kurtosis is the simplest indicator of non-gaussianity, but it is strongly a↵ected by outliers. Other, more robust, indicators are used in ICA, for instance negentropy, which is the extra information (entropy) between the observed dataset and the corresponding Gaussian one, that has the same covariance. J(y) ⌘ H(ygauss) H(y), where H(y) = E(ln p(y)] is the entropy. The trick is to use some function g(y) to avoid the dependence on outliers. One of the methods that follow this approach is FastICA, consisting of a fixed point (à la Newton-Raphson) method. An approximation is made to describe negentropy. The first approach would involve high order moments: J(y) ⇡ 1 1 E(y 3) + [kurt(y)]2 12 48 However, this method is not robust against outliers. non-polynomial expressions, finding’: J(y) / [E{G(y)} I. Ferreras A multivariate analysis primer One can go for E{G(⌫)}]2, Page 69 where the data (y) have zero mean and unit variance, and ⌫ is a random variable from a Gaussian distribution, also with zero mean and unit variance. Functions G(y) with a slower growth than y 3 will be less sentitive to outliers, and typical cases are: y2 G(y) = e 2 FastICA is a fixed point method (similar to the Newton-Raphson algorithm to find the roots of a function) that maximises J(y) by an iterative optimization of a projection vector (equivalent to transforming the mixing matrix). (from scikit-learn.org) I. Ferreras A multivariate analysis primer Page 70 Infomax Another way of extracting statistically independent sources is by the use of the entropy (i.e. “the level of surprise”). I A set of signals with a uniform joint pdf has maximum joint entropy II A set of signals that have maximum joint entropy are mutually independent III Any invertible function of independent signals yields signals that are also mutually independent. The last point will be useful if we consider that for any pdf p(y), the cumulative density function g(Y ) ⌘ Z Y p(y)dy (55) 1 has a maximum entropy pdf. I. Ferreras A multivariate analysis primer Page 71 Infomax (cont’d) An example of two source signals (s, leftmost panels) mixed (x = As), and separated via infomax (y = Wx). The rightmost panels correspond to the cumulative distribution (Y = g(y)) when optimized. (from “Independent Component Analysis”, Stone) I. Ferreras A multivariate analysis primer Page 72 Many more methods ... This has been a brief introduction. There are many methods to extract information from multivariate data, including the vast realm of machine learning algorithms. Some interesting advanced topics are: • Non-negative matrix factorization • Support Vector Machines • Artificial Neural Networkds • Gaussian Processes I. Ferreras A multivariate analysis primer Page 73 Further Reading • Methods of multivariate analysis, Rencher & Christensen, 2012, Wiley • Independent Component Analysis, Hyvärinen, Karhunen & Oja, 2001, Wiley • Independent Component Analysis: A tutorial introduction, Stone, 2004, MIT Press • Modern Statistical Methods for Astronomy, Feigelson & Babu, 2012, Cambridge • Practical statistics for astronomers, Wall & Jenkins, 2003, Cambridge I. Ferreras A multivariate analysis primer Page 74