Principal Component Analysis (PCA) Independent Component Analysis (ICA)

advertisement
Principal Component Analysis
(PCA)
and
Independent Component Analysis
(ICA)
A brief tutorial by:
John Semmlow, Ph.
Robert Wood Johnson Medical School and
Rutgers University, New Jersey, USA
Multivariate Analysis
Multivariate analysis is concerned with the analysis of
multiple variables (or measurements), but treating them
as a single entity; for example, variables from multiple
measurements made on the same process or system. In
multivariate analysis, these multiple variables are often
represented as a single matrix variable that includes the
different variables :
x = [x1(t), x2(t) .... xm(t)]T
In this case, x is composed of M variables each containing T
(t = 1,...T) observations. In signal processing, the
observations are time samples while in image processing
they are pixels.
• A major concern of multivariate analysis is
to find transformations of the multivariate
data that make the data set smaller or
easier to understand.
• For example, the relevant information may
be contained in an multi-dimensional
variable of fewer dimensions (i.e.,
variables), and the reduced set of
variables may be more meaningful than
the original data set.
Multivariate Transformations
• In transformations that reduce the dimensionality of a
multi variable data set, the idea is to transform one set
of variables into a new set where some of the new
variables have values that are quite small compared to
the others. Since the values of these variables are
relatively small, they must not contribute very much
information to the overall data set and, hence, can be
eliminated.
– Evaluating the significant of a variable by the range of its values
assumes that all the original variables have approximately the
same range. If not, some form of normalization should be
applied to the original data set.
Linear Transformations
A linear transformation can be represent mathematically as:
M
yi(t) =
 w x (t)
ij j
i  1,...N
j=1
where wij are constant coefficients that define the transformation.
 y1(t) 
 x1(t) 
 y (t) 
 x (t) 
 2  = W 2 
  
  




 yM(t) 
 xM(t) 
y = Wx
or using linear algebra notation
• A linear transformation can be interpreted as
a rotation (and possibly scaling) of the original
data set in M-dimensional space.
Principal Component Analysis (PCA),
The data set is transformed to produce a
new set of variables (termed the principal
components) that are uncorrelated. The
goal is to reduce the dimensionality of the
data, not necessarily to produce more
meaningful variables.
Independent Component Analysis (ICA)
The goal is a bit more ambitious: to find
new variables (components) that are both
statistically independent and nongaussian.
PCA operates by transforming a set of correlated
variables into a new set of uncorrelated variables that are
called the “principal components.”
• After transformation: principal components
• Uncorrelated data are not, in general,
independent (except for Gaussian noise).
2
1
1.5
0.8
Covariance = 0.0
0.6
1
0.4
0.5
x2
x1 & x2
0.2
0
0
-0.5
-0.2
-1
-0.4
-1.5
-0.6
-2
-2.5
-0.8
0
0.2
0.4
0.6
Time (sec)
0.8
1
-1
-1.5
-1
-0.5
0
0.5
1
1.5
x1
These two variables are uncorrelated but highly dependent as they were
generated from the equation for a circle (plus noise).
To implement PCA we can use the covariance matrix.
The covariance matrix is defined by:
  1,1  1,2

2 ,1  2 , 2
S = 
 


 N,1 N,2
  1, N 
  2, N 


 

 N, N 
If we can rotate the data so that the off-diagonals are
zero, then the variables will be uncorrelated.
(A matrix that has zeros in the off-diagonals is termed a
“diagonal matrix.”)
• A well known technique exists to reduce a
matrix that is positive-definite (as is the
covariance matrix) into a diagonal matrix
by pre- and post- multiplication by an
orthonormal matrix:
U’SU = D
where S is the m by m covariance matrix, D is a diagonal matrix,
and U is an orthonormal matrix that does the transformation.
The diagonal elements of D are the variances of the new
data, more generally known as the characteristic roots, or
eigenvalues of S: λ1, λ2, ...λn .
An alternative technique uses “Singular Value
Decomposition” which solves the equation:
X = U*D1/2U’
This equation has a similar form to that shown previously. In
the case of PCA, X is the data matrix that is decomposed into:
D, the diagonal matrix that contains, in this case, the square
root of the eigenvalues; and U, the principle components
matrix.
In MATLAB Singular value decomposition of a data array, X, uses:
[V,D,U] = svd(X);
where D is a diagonal matrix containing the eigenvalues and V contains
the principal components in columns. The eigenvalues can be obtained from
D using the ‘diag’ command:
eigen = diag(D);
Order Selection: How many different variables
are actually contained in the data set?
The eigenvalues describe how much of the variance is
accounted for by the associated principal component, and
when singular value decomposition is used, these
eigenvalues are ordered by size; that is:
λ1 > λ2 > λ3 ... > λM.
They can be very helpful in determining how many of the
components are really significant and how much the data
set can be reduced.
 If several eigenvalues are zero or ‘close to’ zero, then
the associated principal components contribute little to the
data and can be eliminated.
The eigenvalues can be used to determine the number
of separate variables present in a data set.
Original Data Set
9
8
This data set
contains five
variables, but in
fact consists of
only two
variables ( a
sine and
sawtooth) plus
noise mixed
together in
different ways
7
6
x(t)
5
4
3
2
1
0
-1
0
100
200
300
400
500
Time (sec)
600
700
800
900
1000
The Scree Plot is a plot of eigenvalue against its
number and can be useful in estimating how many
independent variables are actually present in the data.
Scree Plot
0.7
This is the Scree
Plot obtained from
the previous data
set. The sharp
break at 2 indicates
that only two
variables are
present in the data.
0.6
Eigenvalues
0.5
0.4
0.3
0.2
0.1
0
1
1.5
2
2.5
3
N
3.5
4
4.5
5
MATLAB Code to calculate the Principal
Components and Eigenvalues, and to output the
Scree Plot
% Find Principal Components
[U,S,pc]= svd(D,0);
% Singular value decomposition
% Vector pc contains the principal components
%
eigen = diag(S).^2;
% Calculate eigenvalues
%
% Scale eigenvalues to equal variances
eigen = eigen/N
plot(eigen);
% Plot Scree Plot
Even though the principal components are uncorrelated
they are not independent.
Plot of the first
Original Components Plot
3
3
CP 2
2
2
1
1
0
CP 1
0
-1
-1
-2
-2
-3
-4
0
200
400
600
Time (msec)
800
1000
-3
0
200
400
600
800
Time (msec)
1000
two (dominant)
principal
components (left)
and the original
components
(right).
Even though the
first two principal
components are
uncorrelated, and
contain most of
the information,
they are not
independent since
they are still
mixtures of the
two independent
components.
Independent Component Analysis
•The motivation for this transformation is primarily to
uncover more meaningful variables, not to reduce the
dimensions of the data set.
•When data set reduction is also desired it is usually
accomplished by preprocessing the data set using
PCA
•Seeks to transform the original data set into number
of independent variables.
The basis of most ICA approaches is a generative model;
that is, a model that describes how the measured
signals are produced. The model assumes that the
measured signals are the product of instantaneous
linear combinations of the independent sources:
xi(t) = ai1 s1(t) + ai2 s2(t) + ... + aiN sN(t)
for i = 1, ..., N
or in matrix form as:
 x1(t) 
 s1(t) 
 x (t) 
 s (t) 
 2  = A 2 
  
  




 xn(t) 
 sn(t) 
or simply:
x = As
where A is know as the “mixing” matrix
If A is the mixing matrix then the unknown (hidden)
independent variables, s, can be obtained from the
“unmixing matrix:” A-1.
s = A-1x
Since A is unknown, we cannot find A-1 directly.
To find A-1, we use optimization techniques (trial and error) to
find an A-1 that maximizes the independence of s.
ICA analysis then becomes a problem of finding an approach
that measures the independence of the new data set s
One measure of the independence of s is “nongaussainity” how
different the variables of s are from a Gaussian distribution.
Mixtures of non-Gaussian signals are more like
Gaussian signals than non-mixtures
(The Central limit Theorem at work!)
Gaussian Distribution
Sinusoidal Distribution
1
1
B
0.8
0.6
0.6
P(x)
P(x)
A
0.8
0.4
0.2
0.4
0.2
0
-1
-0.5
0
0.5
0
-1
1
-0.5
Double Sinusoidal Distribution
C
0.8
P(x)
P(x)
1
1
0.6
0.4
0.2
0
-1
0.5
Quad Sinusoidal Distribution
1
0.8
0
x
x
D
0.6
0.4
0.2
-0.5
0
x
0.5
1
0
-1
-0.5
0
x
0.5
1
This plot shows the signal mixtures on the left and the
corresponding joint density plot on the right.
The plot on the right is the scatter plot of the two variables x. The
marginal densities are also shown at the edge of the plot.
A first step in many ICA algorithms is to
whiten (sphere) the data.
1.5
Scatter Plot after Whitening
Scatter Plot before Whitening
4
Correlation = 0.77358
3
1
2
0.5
x2
x2
1
0
0
-1
-0.5
-2
-1
-3
-1.5
-2
-1
0
x1
1
2
-4
-4
-2
0
x1
2
4
Data that has been
whitened is
uncorrelated (as are
the principle
components), but, in
addition, all of the
variables have
variances of one. A
3-variable data set
that has been
whitened would have
a spherical shape,
hence the term
“sphering the data.”
This figure below shows the signals, x, and the
joint density p(x) after the two-variable data set
has been whitened. (Note that the distributions are
already less Gaussian.)
After sphering, the separated signals can be found by an
orthogonal transformation of the whitened signals x (this is
simply a rotation of the joint density).
The appropriate rotation is sought by maximizing the nongaussainity of the
marginal densities (shown on the edges of the density plot). This is because of
the fact that a linear mixture of independent random variables is necessarily
more Gaussian than the original variables. This implies that in ICA we must
restrict ourselves to at most one Gaussian source signal.
The plot below shows the result after one step of the FastICA algorithm.
The next few slides show subsequent rotations
Convergence!
The source signals (components of s) in this example were
a sinusoid and noise, as can be seen in the left part of the
plot below.
Note the nongaussian appearance of the probability distrubution
function
Example application using MATLAB and the Jade
algorithm.
Mixed Signals
10
Original Components (Before mixing)
4
8
3
2
6
0
X(t)
s(t)
1
4
-1
-2
2
-3
-4
0
100
200
300
400
500
600
700
800
900
1000
0
Time (sec)
The three noisy signals on the
left were mixed together five
different ways to produce the
signals on the right.
-2
0
100
200
300
400
500
Time (sec)
600
700
800
900
1000
ICA accurately un-mixes the three signals even in
the presence of a small amount of noise.
The Scree Plot of the
five mixed signals
indicates that only
three separate
signals are present.
Independent Components
6
4
2
800
700
X(t)
Scree Plot
0
600
Eigenvalue
500
-2
400
300
-4
200
100
-6
0
1
2
3
N
4
5
0
100
200
300
400
500
Time (sec)
600
700
800
900
1000
Download