Principal Component Analysis (PCA) and Independent Component Analysis (ICA) A brief tutorial by: John Semmlow, Ph. Robert Wood Johnson Medical School and Rutgers University, New Jersey, USA Multivariate Analysis Multivariate analysis is concerned with the analysis of multiple variables (or measurements), but treating them as a single entity; for example, variables from multiple measurements made on the same process or system. In multivariate analysis, these multiple variables are often represented as a single matrix variable that includes the different variables : x = [x1(t), x2(t) .... xm(t)]T In this case, x is composed of M variables each containing T (t = 1,...T) observations. In signal processing, the observations are time samples while in image processing they are pixels. • A major concern of multivariate analysis is to find transformations of the multivariate data that make the data set smaller or easier to understand. • For example, the relevant information may be contained in an multi-dimensional variable of fewer dimensions (i.e., variables), and the reduced set of variables may be more meaningful than the original data set. Multivariate Transformations • In transformations that reduce the dimensionality of a multi variable data set, the idea is to transform one set of variables into a new set where some of the new variables have values that are quite small compared to the others. Since the values of these variables are relatively small, they must not contribute very much information to the overall data set and, hence, can be eliminated. – Evaluating the significant of a variable by the range of its values assumes that all the original variables have approximately the same range. If not, some form of normalization should be applied to the original data set. Linear Transformations A linear transformation can be represent mathematically as: M yi(t) = w x (t) ij j i 1,...N j=1 where wij are constant coefficients that define the transformation. y1(t) x1(t) y (t) x (t) 2 = W 2 yM(t) xM(t) y = Wx or using linear algebra notation • A linear transformation can be interpreted as a rotation (and possibly scaling) of the original data set in M-dimensional space. Principal Component Analysis (PCA), The data set is transformed to produce a new set of variables (termed the principal components) that are uncorrelated. The goal is to reduce the dimensionality of the data, not necessarily to produce more meaningful variables. Independent Component Analysis (ICA) The goal is a bit more ambitious: to find new variables (components) that are both statistically independent and nongaussian. PCA operates by transforming a set of correlated variables into a new set of uncorrelated variables that are called the “principal components.” • After transformation: principal components • Uncorrelated data are not, in general, independent (except for Gaussian noise). 2 1 1.5 0.8 Covariance = 0.0 0.6 1 0.4 0.5 x2 x1 & x2 0.2 0 0 -0.5 -0.2 -1 -0.4 -1.5 -0.6 -2 -2.5 -0.8 0 0.2 0.4 0.6 Time (sec) 0.8 1 -1 -1.5 -1 -0.5 0 0.5 1 1.5 x1 These two variables are uncorrelated but highly dependent as they were generated from the equation for a circle (plus noise). To implement PCA we can use the covariance matrix. The covariance matrix is defined by: 1,1 1,2 2 ,1 2 , 2 S = N,1 N,2 1, N 2, N N, N If we can rotate the data so that the off-diagonals are zero, then the variables will be uncorrelated. (A matrix that has zeros in the off-diagonals is termed a “diagonal matrix.”) • A well known technique exists to reduce a matrix that is positive-definite (as is the covariance matrix) into a diagonal matrix by pre- and post- multiplication by an orthonormal matrix: U’SU = D where S is the m by m covariance matrix, D is a diagonal matrix, and U is an orthonormal matrix that does the transformation. The diagonal elements of D are the variances of the new data, more generally known as the characteristic roots, or eigenvalues of S: λ1, λ2, ...λn . An alternative technique uses “Singular Value Decomposition” which solves the equation: X = U*D1/2U’ This equation has a similar form to that shown previously. In the case of PCA, X is the data matrix that is decomposed into: D, the diagonal matrix that contains, in this case, the square root of the eigenvalues; and U, the principle components matrix. In MATLAB Singular value decomposition of a data array, X, uses: [V,D,U] = svd(X); where D is a diagonal matrix containing the eigenvalues and V contains the principal components in columns. The eigenvalues can be obtained from D using the ‘diag’ command: eigen = diag(D); Order Selection: How many different variables are actually contained in the data set? The eigenvalues describe how much of the variance is accounted for by the associated principal component, and when singular value decomposition is used, these eigenvalues are ordered by size; that is: λ1 > λ2 > λ3 ... > λM. They can be very helpful in determining how many of the components are really significant and how much the data set can be reduced. If several eigenvalues are zero or ‘close to’ zero, then the associated principal components contribute little to the data and can be eliminated. The eigenvalues can be used to determine the number of separate variables present in a data set. Original Data Set 9 8 This data set contains five variables, but in fact consists of only two variables ( a sine and sawtooth) plus noise mixed together in different ways 7 6 x(t) 5 4 3 2 1 0 -1 0 100 200 300 400 500 Time (sec) 600 700 800 900 1000 The Scree Plot is a plot of eigenvalue against its number and can be useful in estimating how many independent variables are actually present in the data. Scree Plot 0.7 This is the Scree Plot obtained from the previous data set. The sharp break at 2 indicates that only two variables are present in the data. 0.6 Eigenvalues 0.5 0.4 0.3 0.2 0.1 0 1 1.5 2 2.5 3 N 3.5 4 4.5 5 MATLAB Code to calculate the Principal Components and Eigenvalues, and to output the Scree Plot % Find Principal Components [U,S,pc]= svd(D,0); % Singular value decomposition % Vector pc contains the principal components % eigen = diag(S).^2; % Calculate eigenvalues % % Scale eigenvalues to equal variances eigen = eigen/N plot(eigen); % Plot Scree Plot Even though the principal components are uncorrelated they are not independent. Plot of the first Original Components Plot 3 3 CP 2 2 2 1 1 0 CP 1 0 -1 -1 -2 -2 -3 -4 0 200 400 600 Time (msec) 800 1000 -3 0 200 400 600 800 Time (msec) 1000 two (dominant) principal components (left) and the original components (right). Even though the first two principal components are uncorrelated, and contain most of the information, they are not independent since they are still mixtures of the two independent components. Independent Component Analysis •The motivation for this transformation is primarily to uncover more meaningful variables, not to reduce the dimensions of the data set. •When data set reduction is also desired it is usually accomplished by preprocessing the data set using PCA •Seeks to transform the original data set into number of independent variables. The basis of most ICA approaches is a generative model; that is, a model that describes how the measured signals are produced. The model assumes that the measured signals are the product of instantaneous linear combinations of the independent sources: xi(t) = ai1 s1(t) + ai2 s2(t) + ... + aiN sN(t) for i = 1, ..., N or in matrix form as: x1(t) s1(t) x (t) s (t) 2 = A 2 xn(t) sn(t) or simply: x = As where A is know as the “mixing” matrix If A is the mixing matrix then the unknown (hidden) independent variables, s, can be obtained from the “unmixing matrix:” A-1. s = A-1x Since A is unknown, we cannot find A-1 directly. To find A-1, we use optimization techniques (trial and error) to find an A-1 that maximizes the independence of s. ICA analysis then becomes a problem of finding an approach that measures the independence of the new data set s One measure of the independence of s is “nongaussainity” how different the variables of s are from a Gaussian distribution. Mixtures of non-Gaussian signals are more like Gaussian signals than non-mixtures (The Central limit Theorem at work!) Gaussian Distribution Sinusoidal Distribution 1 1 B 0.8 0.6 0.6 P(x) P(x) A 0.8 0.4 0.2 0.4 0.2 0 -1 -0.5 0 0.5 0 -1 1 -0.5 Double Sinusoidal Distribution C 0.8 P(x) P(x) 1 1 0.6 0.4 0.2 0 -1 0.5 Quad Sinusoidal Distribution 1 0.8 0 x x D 0.6 0.4 0.2 -0.5 0 x 0.5 1 0 -1 -0.5 0 x 0.5 1 This plot shows the signal mixtures on the left and the corresponding joint density plot on the right. The plot on the right is the scatter plot of the two variables x. The marginal densities are also shown at the edge of the plot. A first step in many ICA algorithms is to whiten (sphere) the data. 1.5 Scatter Plot after Whitening Scatter Plot before Whitening 4 Correlation = 0.77358 3 1 2 0.5 x2 x2 1 0 0 -1 -0.5 -2 -1 -3 -1.5 -2 -1 0 x1 1 2 -4 -4 -2 0 x1 2 4 Data that has been whitened is uncorrelated (as are the principle components), but, in addition, all of the variables have variances of one. A 3-variable data set that has been whitened would have a spherical shape, hence the term “sphering the data.” This figure below shows the signals, x, and the joint density p(x) after the two-variable data set has been whitened. (Note that the distributions are already less Gaussian.) After sphering, the separated signals can be found by an orthogonal transformation of the whitened signals x (this is simply a rotation of the joint density). The appropriate rotation is sought by maximizing the nongaussainity of the marginal densities (shown on the edges of the density plot). This is because of the fact that a linear mixture of independent random variables is necessarily more Gaussian than the original variables. This implies that in ICA we must restrict ourselves to at most one Gaussian source signal. The plot below shows the result after one step of the FastICA algorithm. The next few slides show subsequent rotations Convergence! The source signals (components of s) in this example were a sinusoid and noise, as can be seen in the left part of the plot below. Note the nongaussian appearance of the probability distrubution function Example application using MATLAB and the Jade algorithm. Mixed Signals 10 Original Components (Before mixing) 4 8 3 2 6 0 X(t) s(t) 1 4 -1 -2 2 -3 -4 0 100 200 300 400 500 600 700 800 900 1000 0 Time (sec) The three noisy signals on the left were mixed together five different ways to produce the signals on the right. -2 0 100 200 300 400 500 Time (sec) 600 700 800 900 1000 ICA accurately un-mixes the three signals even in the presence of a small amount of noise. The Scree Plot of the five mixed signals indicates that only three separate signals are present. Independent Components 6 4 2 800 700 X(t) Scree Plot 0 600 Eigenvalue 500 -2 400 300 -4 200 100 -6 0 1 2 3 N 4 5 0 100 200 300 400 500 Time (sec) 600 700 800 900 1000