Independent Component Analysis Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen, http:www.cis.hut.fi/projects/ica 1 Motivation of ICA The Cocktail-Party Problem • 在Party上三個人在不同的位置,同時在說話 (S) • 三人的聲音混雜在一起,無法分辨出誰說了什麼 • 利用三隻麥克風,在不同的地點收聽會場中的 聲音 (X) • 是否可以將麥克風所錄到的聲音(X)還原回三 個人原來的講話聲音 (S) Demo 2 Formulation of ICA • two speech signal s1(t) and s2(t), received by two microphones, the mixed signals are: x1(t) and x2(t): x1 (t ) a11s1 a12 s2 - - - - - (1) x2 (t ) a21s1 a22 s2 - - - - - (2) • It will be very useful if we could estimate the original signals s1(t) and s2(t), from only the recorded signals x1(t) and x2(t) 3 Formulation of ICA • Suppose aii’s are known, then solving the linear Equations 1 and 2 can retrieve the s1(t), s2(t) • the problem is we don’t know the aii. • One approach is to use some information on the statistical properties of signals si(t) to estimate aii • Assume s1(t) and s2(t) are statistical independent, then Independent Component Analysis techniques can retrieve s1(t) and s2(t), from the mixture x1(t) and x2(t). 4 Original signals s1(t), s2(t) Mixture signals x1(t), x2(t) Recovered signals for s1(t), s2(t) 5 Definition of ICA • For n linear mixtures x1, …,xn from n independent components x j (t ) a j1 s1 a j 2 s 2 ... a jn s n , for all j - - - (3) x As - - - - - - - - - -(4) x i 1 ai si (5) n • The independent component si are latent variables, meaning that they can not be directly observed, and the mixing matrix A is assumed to unknown. • We would like to estimate both A and s using the observable random vector x and some statistical assumption 6 Definition of ICA S A X B S=Y • X=AS ; Y=BX ; y is a copy of s • If C is non-mixing then y=Cs is a copy of s • A square matrix is said to be non-mixing if it has one and only one non-zero entry in each row and each column 7 Illustration of ICA • We use two independent components with the following uniform distributions to illustrate the ICA model: 1 , if si 3 p( si ) 2 3 (7 ) 0 , otherwise – The distribution has zero mean and the variance equal to one – Let us mixing these two independent components with the following mixing matrix A 1 2 0 1 2 – This gives us two mixed variable x1 and x2. – The mixed data has a uniform distribution on a parallelogram. – But x1 and x2 are not independent any more. Since when x1 attains to its maximum, or minimum, then this also determine the value of x2 8 Illustration of ICA S2 Fig 5. Joint density distribution of the original signal s1 and s2 Fig 6. Joint density distribution of the observed mixtures x1 and x2 9 Illustration of ICA • The problem of estimating the data model of ICA is now to estimate the mixing matrix A0 using only information contained in the mixtures x1 and x2 . • We can see from Fig 6 an intuitive way of estimating A: • the edges of the parallelogram are in the directions of the columns of A. That is estimate the ICA model by – first estimating the joint density of x1 and x2 , and then – locating the edges. • However, this only works for random variables with uniform distributions • We need a method that works for any types of distribution 10 Ambiguities of ICA Because y=Bx is just a copy of S: –we can not determine the variance (energies) of the independent components. –we can not determine the order of the independent components. •applying a permutation matrix P to x=As, i.e., x=AP-1Ps, then Ps is still like the original signals, and •AP -1 is just a new unknown mixing matrix, to be solved by the ICA algorithms, • the order of s will be changed. 11 Properties of ICA Independence • the variables y1 and y2 are said to be independent if information on the value of y1 does not give any information the value of y2, and vice versa. • Let p(y1, y2 ) be the joint probability density function (pdf) of y1 and y2, and let p(y1 ) be the marginal pdf of y1 then p1 ( y1 ) p( y1 , y2 )dy2 _(9) • y1 and y2 are independent if and only if the joint pdf is factorizable. p( y , y ) p ( y ) p ( y ) _(10) 1 2 1 1 2 2 • Thus, given two functions h1 and h2 , we always have Eh1 ( y1 )h2 ( y2 ) Eh1 ( y1 )Eh2 ( y2 )_(11) 12 Properties of ICA Uncorrelated variables are only partly independent – Two variables y1 and y2 are said to be uncorrelated if their covariance is zero Ey1 y2 Ey1Ey2 0 _(13) – If the variables are independent, they are uncorrelated, but the reverse is not true! • For example: sin(x) and cos(x) is dependent on x, but cov(sin(x),cos(x))=0 13 Gaussian variables are forbidden • The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible • assume the mixing matrix is orthogonal and si are gaussian, then x1 and x2 are gaussian, uncorrelated, and of unit variance. 1 x12 x22 • The joint pdf is p( x1 , x2 ) exp( ) _(15) 2 2 • the distribution is completely symmetric (shown in figure next page), it does not contain any info on the direction of the columns of the mixing matrix A. • Thus A can not be estimated 14 Fig 7. Multivariate distribution of two independent gaussian variables 15 ICA Basic • source separation by ICA must go beyond second order statistics • ignoring any time structure because the information contained in the data is exhaustively represented by the sample distribution of the observed vector • source separation can be obtained by optimizing a ‘contrast function’ – i.e.,: a function that measure independence. 16 Measures of independence • Nongaussian is independent – The key to estimate the ICA model is the nongaussianity – The central limit theorem (CLT) tells us that the distribution of a sum of independent random variables tends toward a gaussian distribution. In other words, – a mixing of two independent signals usually has a distribution that is closer to gaussian than any of the two original signals • Suppose, we want to estimate y, one of the independent components of s from x, • let us denotes this by y=WTx=Siwixi, w is a vector to be determined • How can we use CLT to determine w so that it would equal to one of the rows of the inverse of A ? 17 Nongaussian is independent • let us make a change of variables,z = ATw • then we have y = wTx = wTAs = zTs = Sizisi • thus y=zTs is more gaussian than the original variables si • y becomes least gaussian, when it equals to one of si, a trivial way is to • let only one of the elements zi of z be nonzero • Maximizing the nongaussianity of wTx, gives us one of the independent components. 18 Measures of nongaussianity • To use nongaussianity in ICA, we must have a quantitative measure of nongaussianity of a random variable yi Kurtosis • the classical measure of nongaussianity is kurtosis or the fourth-order cumulant kurt( y) E{ y } 3( E{ y }) _(16) 4 2 2 • Assume y is of unit variance, then kurt(y)= E{y4}-3. • A kurtosis is simply a normalized fourth moment E{y4} • For a gaussian y, the fourth moment equals to 3(E{y2})2 • thus, kurtosis is zero for a gaussian random variable. 19 Kurtosis • Kurtosis can be both positive and negative • RV have a negative kurtosis are called subgaussian • subgaussian RV have typically a flat pdf, which is rather constant near zero, and very small for larger values – uniform distribution is a typical example for subgaussian • supergaussian RV have a spiky pdf, with a heavy tail – Laplace distribution is a typical example for supergaussian 1 p( y ) exp( 2 y ) _(17) 2 20 Kurtosis (c) • Typically nongaussianity is measured by the absolute value of kurtosis. • Kurtosis can be measured by using the fourth moments of the sample data • if x1 and x2 are two independent RV, it holds kurt( x1 x2 ) kurt( x1 ) kurt( x2 ) _(18) kurt(x1 ) 4kurt( x1 ) _(19) • To illustrate a simple example what optimization landscape for kurtosis looks like, let us look at a 2-d model x=As. • We seek for one of the independent components as y = wTx • let z = ATw, then y = wTx = wATs = zTs = z1 s1 + z2 s2 , 21 Kurtosis (c) • Using additive property of kurtosis, then we have kurt(y) = kurt (z1 s1)+ kurt(z2 s2)= z14 kurt (s1)+ z24 kurt (s2) • let’s apply a constraint on y that the variance of y is equal to 1, that is the same assumption concerning s1 and s2. • Thus, z: E{y2}=z12 + z22 =1, this means that the vector z is constrained to the unit circle on a 2-d plane. • The optimization becomes “what are the maxima of the function | kurt(y)| =| z14 kurt (s1)+ z24 kurt (s2)| on the unit circle”? • The maxima are the points where vector z is (0,1) or (0,-1). • These points correspond to where y equals one of si and -si. 22 Kurtosis (c) • In practice we could start from a weight vector w, and compute the direction in which the kurtosis of y=wTx is growing or decreasing most strongly based on the available sample x(1),…, x(T) of mixture vector x, and use a gradient method for finding a new vector w. • However, kurtosis has some drawbacks, • the main problem is that kurtosis can be very sensitive to outliers, in other words kurtosis is not a robust measure of nongaussianity. • In the following sections, we would like to introduce negentropy, whose properties are rather opposite to those of kurtosis. 23 Negentropy • Negentropy is based on the information-theoretic entropy. • The entropy of a RV is a measure of the degree of randomness of the observed variables. • The more unpredictable and unstructured the variable is, the larger is its entropy. • Entropy is defined for a RV Y as: H (Y ) P(Y ai ) log P(Y ai ) for a discrete RV - - - -(20) H ( y) f ( y) log f ( y)dy for RV y with density f(Y) - - - (21) • A fundamental property of information theory for gaussian variable is it has the largest entropy among all random variables of equal variance. • Thus, entropy can be used to measure nongaussianity. 24 Negentropy • To obtain a measure the nongaussianity that is zero for a gaussian variable and always nonnegative, one often uses Negentropy J, which is defined as: J(y)= H(yGauss)-H(y) --------------------(22) • where ygauss is a gaussian RV of the same covariance matrix as y. • the advantage of using Negentropy is it is in some sense the optimal estimator of nongaussianity, as far as statistical properties are concerned. • The problem in using negentroy is that it is still computationally very difficult. • Thus simpler approximations of negentropy seems necessary and useful. 25 Approximations of negentropy • The classical method of approximating negentropy is using higher-order-moments, for example 1 1 3 2 J ( y ) E y kurt( y ) 2 _( 23) 12 48 • The RV y is assumed to be of zero mean and unit variance. • This approach still suffer from the nonrobustness as kurtosis • Another approximation were developed based on the maximum-entropy principle: J ( y ) c [E{G(y)} E{G(v)}] (25) 2 • Where is a Gaussian variable of zero mean and unit variance, and G is a nonquadratic function 26 Approximations of negentropy • Taking G (y) = y4, then (25) becomes (23) • suppose G is chosen to be slow growing as the following contrast functions: 1 u2 G1 (u ) log cosh a1u, G2 (u ) exp( ) _(26) a1 2 where 1 a1 2. • This approximation is conceptually simple, fast to compute, and especially robustness. • A practical algorithm based on these contrast function will be presented in Section 6 27 Preprocessing - centering • Some preprocessing techniques make the problem of ICA estimation simpler and better conditioned. Centering • Center variable x, i.e., subtract its mean vector m=E{x}, so as to make x a zero-mean variable. • This preprocessing is solely to simplify the ICA algorithms • After estimating the mixing matrix A with centered data, we can complete the estimation by adding the mean vector of s back to the centered estimates of s. • the mean vector of s is given by A-1m, m is the mean vector that was subtracted in the preprocessing 28 Preprocessing - whitening • Another preprocessing is to whiten the observed variables. • Whitening means to transform the variable x linearly so that the new variable x~ is white, i.e., its components are uncorrelated, and their variances equal unity. E~x ~x T I _(33) Cov(1,1) Cov(1,2) Cov(1,3) Cov(2,1) Cov(2,2) Cov(2,3) Cov(3,1) Cov(3,2) Cov(3,3) • In other words, variable x~ is white means the covariance matrix of x~ equals identity matrix: 29 Preprocessing - whitening • The correlation between two variables x and y is Cov( x, y ) (x, y) x y • The covariance between x and y is Cov( x, y) (1 / n) ( xi x )( yi y ) i • The covariance Cov (x, y) can be computed by Cov( x, y) (1 / n) xi yi (1 / n) xi (1 / n) yi E[ xy] E[ x]E[ y] i i i • If two variable are uncorrelated then (x, y)= Cov(x, y) =0 • Covariance matrix = I means that if x not equal to y, then Cov(x,y)=0. • if a matrix’s covariance matrix is white, then it is uncorrelated. 30 Preprocessing - whitening • Although uncorrelated variables are only partly independent, decorrelation (using second-order information) can be used to reduce the problem to a simpler form. • Unwhitened matrix A needs n2 parameters, ~ but whitened matrix A needs lesser (about half) parameters 31 Fig 5 Fig 6 Fig 10 The graph to the right shows data in Fig 6 has been whitened. The square depicts the distribution is clearly a rotated version of original square in Fig 5. All that is left is the estimation of a single angle that gives rotation. 32 Preprocessing - whitening • Whitening can be computed by eigenvalue decomposition (EVD) of the covariance matrix E{xxT}=EDET – E is the orthogonal matrix of eigenvectors of E{xxT} – D is a diagonal matrix of its eigenvalues, D=diag(d1,…,dn) – note that E{xxT} can be estimated in a standard way from the available sample of x(1), …, x(T). 33 Preprocessing - whitening • Whitening can now be computed by ~ x ED 1 / 2 ET x _(34) • Where D-1/2 can be computed by D-1/2 =diag(d1-1/2 ,…, dn-1/2 ). • It is easy to show E{~ x~ xT } I , Using (34) and E{xxT}=EDET, then QED. • According to x=As, thus whitening transform the mixing ~ ~ 1 / 2 T matrix to a new A , and ~ x ED E As As _(35) ~ ~ ~~ • Since E ~ x~ x T AE{ssT }AT AA T ~ is orthogonal • the new mixing matrix A I _(36) 34 The FastICA Algorithm - FastICA for one unit • The FastICA learning rule finds a direction, i.e., a unit vector w such that the projection wTx maximizes nongaussianity, which is measured by the approximation of negentropy J(wTx). J ( y ) c [E{G(y)} E{G(v)}]2 (25) • The variance of y=wTx must be constained to unity, for the whitened data, this is equivalent to constraining the norm of w to unity, i.e., E{(wTx)}2}=||w||2 =1. • In the following algorithm, g denotes the derivative of the derivative of the nonquadratic function G. 35 FastICA for one unit • The FastICA algorithm 1) choose an initial (e.g., random) weight vector w. 2) Let W+ =E{xg(wTx)}-h E{xg’ (wTx)}w 3) Let W = W+ / || W+ || , normalization improve stability 4) if not converged, go back to 2. • The derivation is as follows: – the optima of E{G(wTx)} under the constraint E{(wTx)}2}=||w||2 =1 are obtained at points, where F=E{xg(wTx)}-bw = 0-----(40) – Solving this equation by Newton’s method w+=w–h (F/2F) – The Jacobian matrix is F = ∂F/∂w = E{xxTg’ (wTx)}-bI - And the Hessian matrix of F(w) is T E x g ( w x ) β w 2 E xx T g ( w Tx) β I _(41) w 36 FastICA for one unit • In order to simplify the inversion of the Hessin matrix, the first part of the the Hessian is aproximated, as - Since the data is in a unit sphere, thus E{xxTg’ (wTx)} E{xxT} E{g’ (wTx)}= E{g’ (wTx)}I • thus the Hessian matrix becomes diagonal, and it can be easily inverted, • Then, the vector w can be approximated by Newton T [ E { x g ( w x)} bw] + 2 w h _(42) method: w = w – h (F/ F) = ' T [ E{g ( w x)} b ] • By multiplying b -E{g’(wTx)} on both side, then after algebraic simplification, it gives the FastICA iteration 37 FastICA for one unit (c) • Discussion: – Expectations must be replaced as estimates, which are sample means – to compute sample mean, ideally all the data available should be used, but for the computational complexity, only part of or small size of samples are used, – If convergence is not satisfactory, one may then increase the sample size. 38 FastICA for several units • To etsimate several independent components, we need to run FastICA algorithm using several units, with weight vectors w1,…, wn. • To prevent different vectors from converging to the same maxima we need to decorrelate the outputs w1T x,…, wnT x after every iteration. • A simple way of decorrelation is to estimate the independent components one by one. – when p independent components are estimated, i.e., w1,…, wp, – run the one-unit fixed-point algorithm for wp+1 , and – subtract from wp+1 the “projection” wTp+1 Cwj wj , j=1,…,p of the previously estimated p vectors, and then renormalize wp+1 : 39 FastICA for several units 1.Let w p 1 w p 1 j 1 w p 2.Let w p 1 w p 1 w Tp 1Cw p 1 T p 1 Cw j w j (43) The covariance matrix C=E{xxT} is equal to I, if the data is sphered. 40 Applications of ICA - Finding Hidden Factors in Financial Data • Some financial data such as currency exchange rates, or daily returns of stocks, that may have some common underling factors. • ICA might reveal some driving mechanisms that otherwise remain hidden. • In a recent study of a stock portfolio, it was found that ICA is a complementary tool to PCA, allowing the underlying structure of the data to be more readily observed. 41 Term project • Using PCA, JADE, and FastICA to analyzes the Taiwan stocks returns for underling factors. • JADE and FastICA packages can be found by searching on the www. • Data are available at course web site. • Due: 42