Document

advertisement
WK9 – Principle Component Analysis
Contents
PCA
GHA
APEX
CS 476: Networks of Neural Computation
WK9 – Principle Component
Analysis
Kernel PCA
Conclusions
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete
Spring Semester, 2009
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents
•Introduction to Principal Component Analysis
Contents
PCA
GHA
•Generalised Hebbian Algorithm
•Adaptive Principal Components Extraction
APEX
•Kernel Principal Components Analysis
Kernel PCA
•Conclusions
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•The PCA method is a statistical method for Feature
Selection and Dimensionality Reduction.
•Feature Selection is a process whereby a data
space is transformed into a feature space. In
principal both spaces have the same dimensionality.
•However, in the PCA method, the transformation is
design in such way that the data set be represented
by a reduced number of “effective” features and yet
retain most of the intrinsic information contained in
the data; in other words the data set undergoes a
dimensionality reduction.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-1
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Suppose that we have a x of dimension m and we
wish to transmit it using l numbers, where l<m. If
we simply truncate the vector x, we will cause a
mean square error equal to the sum of the variances
of the elements eliminated from x.
•So, we ask: Does there exist an invertible linear
transformation T such that the truncation of Tx is
optimum in the mean-squared sense?
•Clearly, the transformation T should have the
property that some of its components have low
variance.
•Principal Component Analysis maximises the rate
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-2
of decrease of variance and is the right choice.
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Before we present neural network, Hebbian-based,
algorithms that do this we first present the statistical
analysis of the problem.
•Let X be an m-dimensional random vector
representing the environment of interest. We
assume that the vector X has zero mean:
E[X]=0
Where E is the statistical expectation operator. If X
has not zero mean we first subtract the mean from
X before we proceed with the rest of the analysis.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-3
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Let q denote a unit vector, also of dimension m,
onto which the vector X is to be projected. This
projection is defined by the inner product of the
vectors X and q:
A=XTq=qTX
Subject to the constraint:
||q||=(qTq)½=1
•The projection A is a random variable with a mean
and variance related to the statistics of vector X.
Assuming that X has zero mean we can calculate
the mean value of the projection A:
E[A]=qTE[X]=0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-4
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•The variance of A is therefore the same as its
mean-square value and so we can write:
2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q
•The m-by-m matrix R is the correlation matrix of
the random vector X, formally defined as the
expectation of the outer product of the vector X
with itself, as shown:
R=E[XXT]
•We observe that the matrix R is symmetric, which
means that:
RT=R
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-5
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•From this property it follows that for any m-by-1
vectors a and b we have:
aTRb= bTRa
•From the above we see that the variance 2 of A is
a function of the unit vector q; we can then thus
write:
(q)= 2= qTR q
•From the above we can think of (q) as a variance
probe.
•To minimise the variance of A we must find the
vectors q which are the extremal points of (q),
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-6
Subject to the constraint of unit length.
Contents
GHA
•If q is a vector such that (q) has an extreme
value, then for any small q of the unit vector q, we
find that, to the first order in q:
APEX
(q+ q )= (q)
PCA
Kernel PCA
Conclusions
•Now from the definition of the variance probe we
have:
(q+ q )= (q+ q)TR (q+ q)=
qTRq+2(q)TRq+ (q)TR q
Where in the previous line we have made use of the
symmetric property of matrix R.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-7
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Ignoring the second-order term (q)TR q and
invoking the definition of (q) we may write:
(q+ q )= qTRq+2(q)TRq=(q) +2(q)TRq
•The above relation implies that:
(q)TRq=0
•Note that just any perturbation q of q is not
admissible; rather we restrict to use those for which
the Euclidean norm of the perturbed vector q+ q
remains equal to unity:
|| q+ q ||=1
Or: (q+ q)T (q+ q)=1
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-8
Contents
•Taking into account that q is already a vector of
unit length, this means that:
PCA
(q)T q=0
GHA
•This means that perturbation q must be
orthogonal to q and therefore only a small change in
the direction of q is permitted.
APEX
Kernel PCA
Conclusions
•Combining the previous two equations we can now
write:
(q)TR q-(q)T q=0  (q)T(R q- q)=0
Where  is a scaling constant for the elements of R.
•We can now write:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-9
R q=  q
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•This means that q is an eigenvector and  is an
eigenvalue of R.
•The matrix R has real and non-negative
eigenvalues (because it is symmetric). Let the
eigenvalues of matrix R be denoted by i and the
corresponding vectors by qi where the eigenvalues
are arranged in a decreasing order:
1 > 2 > … > m
so that 1= max.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-10
•We can then write matrix R as:
Contents
PCA
GHA
m
 T
R   i qi qi
i 1
•Combining the previous results we can see that the
variance probes are the same as the eigenvalues:
APEX
Kernel PCA
Conclusions
(qj)= j , for j=1,2,…,m
•To summarise the previous analysis we have two
important results:
•The
eigenvectors of the correlation matrix R
pertaining to the zero-mean random variable X define
the unit vectors qj , representing the principal
directions along which the variance probes (qj) have
their extreme values;
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-11
•The
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
associated eigenvalues define the extremal
values of the variance probes.
•We now we want to investigate the representation
of a data vector x which is a realisation of the
random vector X.
•With m eigenvectors qj we have m possible
projection directions. The projections of x into the
eigenvectors are given by:
j=qjTx= xTqj , j=1,2,…,m
•The numbers j are called the principal
components. To reconstruct the original vector x
from the projections we combine all projections into
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-12
a single vector:
Contents
PCA
=[1, 2,…, m]T
GHA
=[xTq1, xTq2,…, xTqm]T
APEX
=QTx
Kernel PCA
Conclusions
Where Q is the matrix which is constructed by the
(column) eigenvectors of R.
•From the above we see that:
m

x=Q    a j q j
i 1
•This is nothing more than a coordinate
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-13
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
transformation from the input space, of vector x, to
the feature space of the vector .
•From the perspective of the pattern recognition the
usefulness of the PCA method is that it provides an
effective technique for dimensionality reduction.
•In particular we may reduce the number of
features needed for effective data representation by
discarding those linear combinations in the previous
formula that have small variances and retain only
these terms that have large variances.
•Let 1, 2, …, l denote the largest l eigenvalues of
R. We may then approximate the vector x by
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Principal Component Analysis-14
truncating the previous sum to the first l terms:
Contents
PCA
GHA
APEX
Kernel PCA

ˆ 
x
l

a
q
 j j
i 1
 a1 
 
 
  a2 
 q1 , q2 ,...,ql 
 . 
 
 al 
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Generalised Hebbian Algorithm
Contents
PCA
GHA
APEX
•We will present now a neural network method which
solves the PCA problem. It belongs to the so-called reestimation algorithms class of PCA methods.
•The network which solves the problem is shown
below:
Kernel PCA
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Generalised Hebbian Algorithm -1
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•For the feedforward network shown we make two
structural assumptions:
•Each
neuron in the output layer of the network is
linear;
•The network has m inputs and l outputs, both of which
are specified. Moreover, the network has fewer outputs
than inputs (i.e. l < m).
•It can be shown that under these assumptions and
by using a special form of Hebbian learning the
network truly learns to calculate the principal
components in its output nodes.
•The GHA can be summarised as follows:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Generalised Hebbian Algorithm -2
1.
Contents
PCA
2.
GHA
Initialise the synaptic weights of the network, wji,
to small random values at time n=1. Assign a small
positive value to the learning rate parameter ;
For n=1, j=1,2,…,l and i=1,2,…,m, compute:
m
y j (n)   w ji (n) xi (n)
APEX
i 1
Kernel PCA
j


w ji (n)    y j (n) xi (n)  y j (n) wki (n) yk (n)
k 1


Conclusions
3.
Where xi(n) is the ith component of the m-by-1
input vector x(n) and l is the desire number of
principal compenents;
Increment n by 1, go to step 2, and continue until
the synaptic weights wji reach their steady state
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Generalised Hebbian Algorithm -3
Contents
PCA
GHA
APEX
values. For large n, the weight wji of neuron j
converges to the ith component of the eigenvector
associated with jth eigenvalue of the correlation
matrix of the input vector x(n). The output neurons
represent the eigenvalues of correlation matrix with
decreasing order from 1 towards l.
Kernel PCA
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Adaptive Principal Components Extraction
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Another algorithm for extracting the principal
components is the adaptive principal components
extraction (APEX) algorithm. This network uses both
feedforward and feedback connections.
•The algorithm is iterative in nature and if we are
given the first (j-1) principal components the jth one
can be easily computed.
•This algorithm belongs to the class of decorrelating
algorithms.
•The network that implements the algorithm is shown
next:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Adaptive Principal Components Extraction-1
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•The network structure is defined as follows:
•Each
neuron is assumed to be linear (in the output
layer);
•Feedforward
connections exist from the input nodes to
each of the neurons 1,2,…,j, with j<m. The feedforward
connections operate with a Hebbian rule. They are
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Adaptive Principal Components Extraction-2
excitatory and therefore provide amplification.
These connections are represented by the wj(n)
vector.
Contents
PCA
•
GHA
Lateral connections exist from the individual outputs
of neurons 1,2,…,j-1 to neuron j of the output layer,
thereby applying feedback to the network. These
connections are represented by the aj(n) vector.
The lateral connections operate with an antiHebbian learning rule which has the effect of
making them inhibitory.
APEX
Kernel PCA
Conclusions
•
The algorithm is summarised as follows:
1.
Initialise the feedforward weight vector wj and the
feedback weight vector aj to small random numbers
at time n=1, where j=1,2,…,m. Assign a small
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Adaptive Principal Components Extraction-3
positive value to the learning rate parameter ;
Contents
2.
Set j=1, and for n=1,2,…, compute:
T

y1 (n)  w1 (n) x(n)




2
w1 (n  1)  w1 (n)   y1 (n) x(n)  y1 (n)w1 (n)
PCA

GHA
APEX

where x(n) is the input vector. For large n, we have
w1(n)q1, where q1 is the eigenvector asociated
with the largest eigenvalue 1 of the correlation
matrix of x(n);
Kernel PCA
Conclusions
3.
Set j=2, and for n=1,2,…, compute:



T
y j 1 (n)  y1 (n), y2 (n),...,y j 1 (n)
 T

 T

y j (n)  w j (n) x (n)  a j (n) y j 1 (n)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Adaptive Principal Components Extraction-4


Contents
PCA
GHA
APEX
Kernel PCA
Conclusions





2
w j (n  1)  w j (n)   y j (n) x (n)  y j (n)w j (n)




2
a j (n  1)  a j (n)   y j (n) y j 1 (n)  y j (n)a j (n)
4.

Increment j by 1, go to step 3, and continue until
j=m, where m is the desired number of principal
components. (Note that j=1 corresponds to
eigenvector associated with the largest eigenvalue,
which is taken care in step 2). For large n we have
wj(n)  qj and aj(n)  0, where qj is the
eigenvector associated with the jth eigenvalue of
the correlation matrix of x(n).
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Kernel Principal Components Analysis
•
A last algorithm which uses kernels (more on the
SVM lecture) will be given below. We simply
summarise the algorithm.
•
This algorithm can be considered as a non-linear
PCA methods as we first project the input space in
a feature space using a non-linear transform (x)
and then we perform a linear PCA analysis in the
feature space. This is different from the previous
methods in that they calculate a linear
transformation between the input and the feature
spaces.
•
Summary of the kernel PCA method:
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
1.
Given the training examples {xi}i=1 , compute the
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Kernel Principal Components Analysis-1
the N-by-N kernel matrix K={K(xi, xj)}, where:
Contents
PCA
K(xi, xj)= T(xi) (xj)
2.
GHA
Ka=a
APEX
where  is an eigenvalue of the kernel matrix K and
a is the associated eigenvector;
Kernel PCA
Conclusions
Solve the eigenvalue problem:
3.
Normalise the eigenvectors so computed by
requiring that:
akT ak=1/ k , k=1,2,…,p
where p is the smallest nonzero eigenvalue of the
matrix K, assuming that the eigenvalues are
arranged in decreasing order;
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Kernel Principal Components Analysis-2
4.
Contents
PCA
GHA
For the extraction of the principal components of a
test point x, compute the projections:
N
~
T 
 
ak  qk  ( x )   ak , j K ( x j , x ),
j 1
k  1,2,..., p
where ak,j is the jth element of eigenvector ak.
APEX
Kernel PCA
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Conclusions
Contents
PCA
GHA
APEX
Kernel PCA
Conclusions
•Typically we use PCA methods for dimension
reduction as a pre-processing step before we apply
other methods, for example in a pattern recognition
problem.
•There are batch and adaptive numerical methods
for the calculation of the PCA. An example for the first
class is the Singular Value Decomposition (SVD)
method while the GHA algorithm is for example and
adaptive method.
•It is used mainly for finding out clusters in highdimensional spaces, as it is difficult to visualise these
clusters otherwise.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Download