Independent Component Analysis Reference: Independent Component Analysis: A Tutorial

advertisement
Independent Component
Analysis
Reference:
Independent Component Analysis: A Tutorial
by Aapo Hyvarinen, http:www.cis.hut.fi/projects/ica
1
Motivation of ICA
The Cocktail-Party Problem
• 在Party上三個人在不同的位置,同時在說話 (S)
• 三人的聲音混雜在一起,無法分辨出誰說了什麼
• 利用三隻麥克風,在不同的地點收聽會場中的
聲音 (X)
• 是否可以將麥克風所錄到的聲音(X)還原回三
個人原來的講話聲音 (S)
Demo
2
Formulation of ICA
• two speech signal s1(t) and s2(t), received by
two microphones, the mixed signals are: x1(t)
and x2(t): x1 (t )  a11s1  a12 s2 - - - - - (1)
x2 (t )  a21s1  a22 s2 - - - - - (2)
• It will be very useful if we could estimate the
original signals s1(t) and s2(t), from only the
recorded signals x1(t) and x2(t)
3
Formulation of ICA
• Suppose aii’s are known, then solving the linear
Equations 1 and 2 can retrieve the s1(t), s2(t)
• the problem is we don’t know the aii.
• One approach is to use some information on the
statistical properties of signals si(t) to estimate aii
• Assume s1(t) and s2(t) are statistical independent,
then Independent Component Analysis techniques
can retrieve s1(t) and s2(t), from the mixture x1(t)
and x2(t).
4
Original
signals s1(t), s2(t)
Mixture
signals x1(t), x2(t)
Recovered
signals for
s1(t), s2(t)
5
Definition of ICA
• For n linear mixtures x1, …,xn from n independent
components
x j (t )  a j1 s1  a j 2 s 2  ...  a jn s n , for all j - - - (3)
x  As - - - - - - - - - -(4)
x  i 1 ai si      (5)
n
• The independent component si are latent variables,
meaning that they can not be directly observed, and the
mixing matrix A is assumed to unknown.
• We would like to estimate both A and s using the
observable random vector x and some statistical
assumption
6
Definition of ICA
S
A
X
B
S=Y
• X=AS ; Y=BX ; y is a copy of s
• If C is non-mixing then y=Cs is a copy of s
• A square matrix is said to be non-mixing if
it has one and only one non-zero entry in
each row and each column
7
Illustration of ICA
• We use two independent components with the following
uniform distributions to illustrate the ICA model:
 1
, if si  3

p( si )   2 3
        (7 )
0 ,
otherwise

– The distribution has zero mean and the variance equal to one
– Let us mixing these two independent components with the
following mixing matrix A   1 2 
0
  1 2
– This gives us two mixed variable x1 and x2.
– The mixed data has a uniform distribution on a parallelogram.
– But x1 and x2 are not independent any more. Since when x1 attains
to its maximum, or minimum, then this also determine the value of
x2
8
Illustration of ICA
S2
Fig 5. Joint density
distribution of the
original signal s1 and s2
Fig 6. Joint density
distribution of the
observed mixtures x1
and x2
9
Illustration of ICA
• The problem of estimating the data model of ICA is now to
estimate the mixing matrix A0 using only information
contained in the mixtures x1 and x2 .
• We can see from Fig 6 an intuitive way of estimating A:
• the edges of the parallelogram are in the directions of the
columns of A. That is estimate the ICA model by
– first estimating the joint density of x1 and x2 , and then
– locating the edges.
• However, this only works for random variables with
uniform distributions
• We need a method that works for any types of distribution
10
Ambiguities of ICA
Because y=Bx is just a copy of S:
–we can not determine the variance (energies) of
the independent components.
–we can not determine the order of the independent
components.
•applying a permutation matrix P to x=As, i.e., x=AP-1Ps,
then Ps is still like the original signals, and
•AP -1 is just a new unknown mixing matrix, to be solved
by the ICA algorithms,
• the order of s will be changed.
11
Properties of ICA
Independence
• the variables y1 and y2 are said to be independent if
information on the value of y1 does not give any
information the value of y2, and vice versa.
• Let p(y1, y2 ) be the joint probability density function (pdf)
of y1 and y2, and let p(y1 ) be the marginal pdf of y1 then
p1 ( y1 )   p( y1 , y2 )dy2 _(9)
• y1 and y2 are independent if and only if the joint pdf is
factorizable. p( y , y )  p ( y ) p ( y ) _(10)
1
2
1
1
2
2
• Thus, given two functions h1 and h2 , we always have
Eh1 ( y1 )h2 ( y2 )  Eh1 ( y1 )Eh2 ( y2 )_(11)
12
Properties of ICA
Uncorrelated variables are only partly
independent
– Two variables y1 and y2 are said to be
uncorrelated if their covariance is zero
Ey1 y2  Ey1Ey2   0 _(13)
– If the variables are independent, they are
uncorrelated, but the reverse is not true!
• For example: sin(x) and cos(x) is dependent on x,
but cov(sin(x),cos(x))=0
13
Gaussian variables are forbidden
• The fundamental restriction in ICA is that the independent
components must be nongaussian for ICA to be possible
• assume the mixing matrix is orthogonal and si are gaussian,
then x1 and x2 are gaussian, uncorrelated, and of unit
variance.
1
x12  x22
• The joint pdf is p( x1 , x2 ) 
exp( 
) _(15)
2
2
• the distribution is completely symmetric (shown in figure
next page), it does not contain any info on the direction of
the columns of the mixing matrix A.
• Thus A can not be estimated
14
Fig 7. Multivariate distribution of two
independent gaussian variables
15
ICA Basic
• source separation by ICA must go beyond
second order statistics
• ignoring any time structure because the
information contained in the data is
exhaustively represented by the sample
distribution of the observed vector
• source separation can be obtained by
optimizing a ‘contrast function’
– i.e.,: a function that measure independence.
16
Measures of independence
• Nongaussian is independent
– The key to estimate the ICA model is the nongaussianity
– The central limit theorem (CLT) tells us that the
distribution of a sum of independent random variables
tends toward a gaussian distribution. In other words,
– a mixing of two independent signals usually has a
distribution that is closer to gaussian than any of the two
original signals
• Suppose, we want to estimate y, one of the independent
components of s from x,
• let us denotes this by y=WTx=Siwixi, w is a vector to be
determined
• How can we use CLT to determine w so that it would equal
to one of the rows of the inverse of A ?
17
Nongaussian is independent
• let us make a change of variables,z = ATw
• then we have y = wTx = wTAs = zTs = Sizisi
• thus y=zTs is more gaussian than the original
variables si
• y becomes least gaussian, when it equals to one
of si, a trivial way is to
• let only one of the elements zi of z be nonzero
• Maximizing the nongaussianity of wTx,
gives us one of the independent components.
18
Measures of nongaussianity
• To use nongaussianity in ICA, we must have a
quantitative measure of nongaussianity of a
random variable yi
Kurtosis
• the classical measure of nongaussianity is kurtosis or the
fourth-order cumulant
kurt( y)  E{ y }  3( E{ y }) _(16)
4
2
2
• Assume y is of unit variance, then kurt(y)= E{y4}-3.
• A kurtosis is simply a normalized fourth moment E{y4}
• For a gaussian y, the fourth moment equals to 3(E{y2})2
• thus, kurtosis is zero for a gaussian random variable.
19
Kurtosis
• Kurtosis can be both positive and negative
• RV have a negative kurtosis are called subgaussian
• subgaussian RV have typically a flat pdf, which is rather
constant near zero, and very small for larger values
– uniform distribution is a typical example for subgaussian
• supergaussian RV have a spiky pdf, with a heavy tail
– Laplace distribution is a typical example for supergaussian
1
p( y ) 
exp( 2 y ) _(17)
2
20
Kurtosis (c)
• Typically nongaussianity is measured by the absolute value
of kurtosis.
• Kurtosis can be measured by using the fourth moments of
the sample data
• if x1 and x2 are two independent RV, it holds
kurt( x1  x2 )  kurt( x1 )  kurt( x2 ) _(18)
kurt(x1 )   4kurt( x1 ) _(19)
• To illustrate a simple example what optimization landscape
for kurtosis looks like, let us look at a 2-d model x=As.
• We seek for one of the independent components as y = wTx
• let z = ATw, then y = wTx = wATs = zTs = z1 s1 + z2 s2 ,
21
Kurtosis (c)
• Using additive property of kurtosis, then we have
kurt(y) = kurt (z1 s1)+ kurt(z2 s2)= z14 kurt (s1)+ z24 kurt (s2)
• let’s apply a constraint on y that the variance of y is equal to
1, that is the same assumption concerning s1 and s2.
• Thus, z: E{y2}=z12 + z22 =1, this means that the vector z is
constrained to the unit circle on a 2-d plane.
• The optimization becomes “what are the maxima of the
function | kurt(y)| =| z14 kurt (s1)+ z24 kurt (s2)| on the unit
circle”?
• The maxima are the points where vector z is (0,1) or (0,-1).
• These points correspond to where y equals one of si and -si.
22
Kurtosis (c)
• In practice we could start from a weight vector w, and
compute the direction in which the kurtosis of y=wTx is
growing or decreasing most strongly based on the available
sample x(1),…, x(T) of mixture vector x, and use a
gradient method for finding a new vector w.
• However, kurtosis has some drawbacks,
• the main problem is that kurtosis can be very sensitive to
outliers, in other words kurtosis is not a robust measure of
nongaussianity.
• In the following sections, we would like to introduce
negentropy, whose properties are rather opposite to those
of kurtosis.
23
Negentropy
• Negentropy is based on the information-theoretic entropy.
• The entropy of a RV is a measure of the degree of
randomness of the observed variables.
• The more unpredictable and unstructured the variable is,
the larger is its entropy.
• Entropy is defined for a RV Y as:
H (Y )   P(Y  ai ) log P(Y  ai ) for a discrete RV - - - -(20)
H ( y)    f ( y) log f ( y)dy for RV y with density f(Y) - - - (21)
• A fundamental property of information theory for gaussian
variable is it has the largest entropy among all random
variables of equal variance.
• Thus, entropy can be used to measure nongaussianity.
24
Negentropy
• To obtain a measure the nongaussianity that is zero for a
gaussian variable and always nonnegative, one often uses
Negentropy J, which is defined as:
J(y)= H(yGauss)-H(y) --------------------(22)
• where ygauss is a gaussian RV of the same covariance
matrix as y.
• the advantage of using Negentropy is it is in some sense
the optimal estimator of nongaussianity, as far as
statistical properties are concerned.
• The problem in using negentroy is that it is still
computationally very difficult.
• Thus simpler approximations of negentropy seems
necessary and useful.
25
Approximations of negentropy
• The classical method of approximating negentropy is using
higher-order-moments, for example
1
1
3 2
J ( y )  E y   kurt( y ) 2 _( 23)
12
48
• The RV y is assumed to be of zero mean and unit variance.
• This approach still suffer from the nonrobustness as
kurtosis
• Another approximation were developed based on the
maximum-entropy principle:
J ( y )  c [E{G(y)}  E{G(v)}]  (25)
2
• Where  is a Gaussian variable of zero mean and unit
variance, and G is a nonquadratic function
26
Approximations of negentropy
• Taking G (y) = y4, then (25) becomes (23)
• suppose G is chosen to be slow growing as the following
contrast functions:
1
 u2
G1 (u )  log cosh a1u, G2 (u )   exp(
) _(26)
a1
2
where 1  a1  2.
• This approximation is conceptually simple, fast to
compute, and especially robustness.
• A practical algorithm based on these contrast function
will be presented in Section 6
27
Preprocessing - centering
• Some preprocessing techniques make the problem of ICA
estimation simpler and better conditioned.
Centering
• Center variable x, i.e., subtract its mean vector m=E{x},
so as to make x a zero-mean variable.
• This preprocessing is solely to simplify the ICA
algorithms
• After estimating the mixing matrix A with centered data,
we can complete the estimation by adding the mean
vector of s back to the centered estimates of s.
• the mean vector of s is given by A-1m, m is the mean
vector that was subtracted in the preprocessing
28
Preprocessing - whitening
• Another preprocessing is to whiten the observed
variables.
• Whitening means to transform the variable x
linearly so that the new variable x~ is white, i.e.,
its components are uncorrelated, and their
variances equal unity. E~x ~x T   I _(33)
 Cov(1,1) Cov(1,2) Cov(1,3)
Cov(2,1) Cov(2,2) Cov(2,3)

Cov(3,1) Cov(3,2) Cov(3,3)











• In other words, variable x~ is white means the
covariance matrix of x~ equals identity matrix:
29
Preprocessing - whitening
• The correlation  between two variables x and y is
Cov( x, y )
(x, y) 
 x y
• The covariance between x and y is
Cov( x, y)  (1 / n) ( xi   x )( yi   y )
i
• The covariance Cov (x, y) can be computed by
Cov( x, y)  (1 / n) xi yi  (1 / n) xi (1 / n) yi  E[ xy]  E[ x]E[ y]
i
i
i
• If two variable are uncorrelated then (x, y)= Cov(x, y) =0
• Covariance matrix = I means that if x not equal to y, then
Cov(x,y)=0.
• if a matrix’s covariance matrix is white, then it is
uncorrelated.
30
Preprocessing - whitening
• Although uncorrelated variables are only
partly independent, decorrelation (using
second-order information) can be used to
reduce the problem to a simpler form.
• Unwhitened matrix A needs n2 parameters,
~
but whitened matrix A needs lesser (about
half) parameters
31
Fig 5
Fig 6
Fig 10
The graph to the right shows data in Fig 6 has been
whitened. The square depicts the distribution is clearly a
rotated version of original square in Fig 5. All that is left
is the estimation of a single angle that gives rotation.
32
Preprocessing - whitening
• Whitening can be computed by eigenvalue
decomposition (EVD) of the covariance matrix
E{xxT}=EDET
– E is the orthogonal matrix of eigenvectors of
E{xxT}
– D is a diagonal matrix of its eigenvalues,
D=diag(d1,…,dn)
– note that E{xxT} can be estimated in a standard
way from the available sample of x(1), …, x(T).
33
Preprocessing - whitening
• Whitening can now be computed by ~
x  ED 1 / 2 ET x _(34)
• Where D-1/2 can be computed by D-1/2 =diag(d1-1/2 ,…, dn-1/2 ).
• It is easy to show E{~
x~
xT }  I ,
Using (34) and E{xxT}=EDET, then QED.
• According to x=As,
thus whitening transform the mixing
~
~
1 / 2 T
matrix to a new A , and ~
x  ED E As  As _(35)
 
~
~
~~
• Since E ~
x~
x T  AE{ssT }AT  AA T
~ is orthogonal
• the new mixing matrix A
 I _(36)
34
The FastICA Algorithm
- FastICA for one unit
• The FastICA learning rule finds a direction, i.e., a unit
vector w such that the projection wTx maximizes
nongaussianity, which is measured by the approximation
of negentropy J(wTx).
J ( y )  c [E{G(y)}  E{G(v)}]2  (25)
• The variance of y=wTx must be constained to unity, for the
whitened data, this is equivalent to constraining the norm
of w to unity, i.e., E{(wTx)}2}=||w||2 =1.
• In the following algorithm, g denotes the derivative of the
derivative of the nonquadratic function G.
35
FastICA for one unit
• The FastICA algorithm
1) choose an initial (e.g., random) weight vector w.
2) Let W+ =E{xg(wTx)}-h E{xg’ (wTx)}w
3) Let W = W+ / || W+ || , normalization improve stability
4) if not converged, go back to 2.
• The derivation is as follows:
– the optima of E{G(wTx)} under the constraint E{(wTx)}2}=||w||2 =1
are obtained at points, where F=E{xg(wTx)}-bw = 0-----(40)
– Solving this equation by Newton’s method w+=w–h (F/2F)
– The Jacobian matrix is F = ∂F/∂w = E{xxTg’ (wTx)}-bI
- And the Hessian matrix of F(w) is
T


E
x
g
(
w
x ) β w
2
 
 E xx T g ( w Tx) β I _(41)
w
36
FastICA for one unit
• In order to simplify the inversion of the Hessin matrix,
the first part of the the Hessian is aproximated, as
- Since the data is in a unit sphere, thus
E{xxTg’ (wTx)}  E{xxT} E{g’ (wTx)}= E{g’ (wTx)}I
• thus the Hessian matrix becomes diagonal, and it can be
easily inverted,
• Then, the vector w can be approximated by Newton
T
[
E
{
x
g
(
w
x)}  bw]
+
2
w

h
_(42)
method: w = w – h (F/ F) =
'
T
[ E{g ( w x)}  b ]
• By multiplying b -E{g’(wTx)} on both side, then after
algebraic simplification, it gives the FastICA iteration
37
FastICA for one unit (c)
• Discussion:
– Expectations must be replaced as estimates, which are
sample means
– to compute sample mean, ideally all the data available
should be used, but for the computational complexity,
only part of or small size of samples are used,
– If convergence is not satisfactory, one may then
increase the sample size.
38
FastICA for several units
• To etsimate several independent components, we need to
run FastICA algorithm using several units, with weight
vectors w1,…, wn.
• To prevent different vectors from converging to the same
maxima we need to decorrelate the outputs w1T x,…, wnT x
after every iteration.
• A simple way of decorrelation is to estimate the
independent components one by one.
– when p independent components are estimated, i.e., w1,…, wp,
– run the one-unit fixed-point algorithm for wp+1 , and
– subtract from wp+1 the “projection” wTp+1 Cwj wj , j=1,…,p of the
previously estimated p vectors, and then renormalize wp+1 :
39
FastICA for several units
1.Let w p 1  w p 1   j 1 w
p
2.Let w p 1 
w p 1
w Tp 1Cw p 1
T
p 1
Cw j w j
         (43)
The covariance matrix C=E{xxT} is equal to I, if the data is
sphered.
40
Applications of ICA
- Finding Hidden Factors in Financial Data
• Some financial data such as currency exchange
rates, or daily returns of stocks, that may have
some common underling factors.
• ICA might reveal some driving mechanisms that
otherwise remain hidden.
• In a recent study of a stock portfolio, it was found
that ICA is a complementary tool to PCA,
allowing the underlying structure of the data to be
more readily observed.
41
Term project
• Using PCA, JADE, and FastICA to analyzes the
Taiwan stocks returns for underling factors.
• JADE and FastICA packages can be found by
searching on the www.
• Data are available at course web site.
• Due:
42
Download