WK9 – Principle Component Analysis Contents PCA GHA APEX CS 476: Networks of Neural Computation WK9 – Principle Component Analysis Kernel PCA Conclusions Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Contents •Introduction to Principal Component Analysis Contents PCA GHA •Generalised Hebbian Algorithm •Adaptive Principal Components Extraction APEX •Kernel Principal Components Analysis Kernel PCA •Conclusions Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis Contents PCA GHA APEX Kernel PCA Conclusions •The PCA method is a statistical method for Feature Selection and Dimensionality Reduction. •Feature Selection is a process whereby a data space is transformed into a feature space. In principal both spaces have the same dimensionality. •However, in the PCA method, the transformation is design in such way that the data set be represented by a reduced number of “effective” features and yet retain most of the intrinsic information contained in the data; in other words the data set undergoes a dimensionality reduction. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-1 Contents PCA GHA APEX Kernel PCA Conclusions •Suppose that we have a x of dimension m and we wish to transmit it using l numbers, where l<m. If we simply truncate the vector x, we will cause a mean square error equal to the sum of the variances of the elements eliminated from x. •So, we ask: Does there exist an invertible linear transformation T such that the truncation of Tx is optimum in the mean-squared sense? •Clearly, the transformation T should have the property that some of its components have low variance. •Principal Component Analysis maximises the rate CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-2 of decrease of variance and is the right choice. Contents PCA GHA APEX Kernel PCA Conclusions •Before we present neural network, Hebbian-based, algorithms that do this we first present the statistical analysis of the problem. •Let X be an m-dimensional random vector representing the environment of interest. We assume that the vector X has zero mean: E[X]=0 Where E is the statistical expectation operator. If X has not zero mean we first subtract the mean from X before we proceed with the rest of the analysis. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-3 Contents PCA GHA APEX Kernel PCA Conclusions •Let q denote a unit vector, also of dimension m, onto which the vector X is to be projected. This projection is defined by the inner product of the vectors X and q: A=XTq=qTX Subject to the constraint: ||q||=(qTq)½=1 •The projection A is a random variable with a mean and variance related to the statistics of vector X. Assuming that X has zero mean we can calculate the mean value of the projection A: E[A]=qTE[X]=0 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-4 Contents PCA GHA APEX Kernel PCA Conclusions •The variance of A is therefore the same as its mean-square value and so we can write: 2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q •The m-by-m matrix R is the correlation matrix of the random vector X, formally defined as the expectation of the outer product of the vector X with itself, as shown: R=E[XXT] •We observe that the matrix R is symmetric, which means that: RT=R CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-5 Contents PCA GHA APEX Kernel PCA Conclusions •From this property it follows that for any m-by-1 vectors a and b we have: aTRb= bTRa •From the above we see that the variance 2 of A is a function of the unit vector q; we can then thus write: (q)= 2= qTR q •From the above we can think of (q) as a variance probe. •To minimise the variance of A we must find the vectors q which are the extremal points of (q), CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-6 Subject to the constraint of unit length. Contents GHA •If q is a vector such that (q) has an extreme value, then for any small q of the unit vector q, we find that, to the first order in q: APEX (q+ q )= (q) PCA Kernel PCA Conclusions •Now from the definition of the variance probe we have: (q+ q )= (q+ q)TR (q+ q)= qTRq+2(q)TRq+ (q)TR q Where in the previous line we have made use of the symmetric property of matrix R. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-7 Contents PCA GHA APEX Kernel PCA Conclusions •Ignoring the second-order term (q)TR q and invoking the definition of (q) we may write: (q+ q )= qTRq+2(q)TRq=(q) +2(q)TRq •The above relation implies that: (q)TRq=0 •Note that just any perturbation q of q is not admissible; rather we restrict to use those for which the Euclidean norm of the perturbed vector q+ q remains equal to unity: || q+ q ||=1 Or: (q+ q)T (q+ q)=1 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-8 Contents •Taking into account that q is already a vector of unit length, this means that: PCA (q)T q=0 GHA •This means that perturbation q must be orthogonal to q and therefore only a small change in the direction of q is permitted. APEX Kernel PCA Conclusions •Combining the previous two equations we can now write: (q)TR q-(q)T q=0 (q)T(R q- q)=0 Where is a scaling constant for the elements of R. •We can now write: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-9 R q= q Contents PCA GHA APEX Kernel PCA Conclusions •This means that q is an eigenvector and is an eigenvalue of R. •The matrix R has real and non-negative eigenvalues (because it is symmetric). Let the eigenvalues of matrix R be denoted by i and the corresponding vectors by qi where the eigenvalues are arranged in a decreasing order: 1 > 2 > … > m so that 1= max. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-10 •We can then write matrix R as: Contents PCA GHA m T R i qi qi i 1 •Combining the previous results we can see that the variance probes are the same as the eigenvalues: APEX Kernel PCA Conclusions (qj)= j , for j=1,2,…,m •To summarise the previous analysis we have two important results: •The eigenvectors of the correlation matrix R pertaining to the zero-mean random variable X define the unit vectors qj , representing the principal directions along which the variance probes (qj) have their extreme values; CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-11 •The Contents PCA GHA APEX Kernel PCA Conclusions associated eigenvalues define the extremal values of the variance probes. •We now we want to investigate the representation of a data vector x which is a realisation of the random vector X. •With m eigenvectors qj we have m possible projection directions. The projections of x into the eigenvectors are given by: j=qjTx= xTqj , j=1,2,…,m •The numbers j are called the principal components. To reconstruct the original vector x from the projections we combine all projections into CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-12 a single vector: Contents PCA =[1, 2,…, m]T GHA =[xTq1, xTq2,…, xTqm]T APEX =QTx Kernel PCA Conclusions Where Q is the matrix which is constructed by the (column) eigenvectors of R. •From the above we see that: m x=Q a j q j i 1 •This is nothing more than a coordinate CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-13 Contents PCA GHA APEX Kernel PCA Conclusions transformation from the input space, of vector x, to the feature space of the vector . •From the perspective of the pattern recognition the usefulness of the PCA method is that it provides an effective technique for dimensionality reduction. •In particular we may reduce the number of features needed for effective data representation by discarding those linear combinations in the previous formula that have small variances and retain only these terms that have large variances. •Let 1, 2, …, l denote the largest l eigenvalues of R. We may then approximate the vector x by CS 476: Networks of Neural Computation, CSD, UOC, 2009 Principal Component Analysis-14 truncating the previous sum to the first l terms: Contents PCA GHA APEX Kernel PCA ˆ x l a q j j i 1 a1 a2 q1 , q2 ,...,ql . al Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Generalised Hebbian Algorithm Contents PCA GHA APEX •We will present now a neural network method which solves the PCA problem. It belongs to the so-called reestimation algorithms class of PCA methods. •The network which solves the problem is shown below: Kernel PCA Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Generalised Hebbian Algorithm -1 Contents PCA GHA APEX Kernel PCA Conclusions •For the feedforward network shown we make two structural assumptions: •Each neuron in the output layer of the network is linear; •The network has m inputs and l outputs, both of which are specified. Moreover, the network has fewer outputs than inputs (i.e. l < m). •It can be shown that under these assumptions and by using a special form of Hebbian learning the network truly learns to calculate the principal components in its output nodes. •The GHA can be summarised as follows: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Generalised Hebbian Algorithm -2 1. Contents PCA 2. GHA Initialise the synaptic weights of the network, wji, to small random values at time n=1. Assign a small positive value to the learning rate parameter ; For n=1, j=1,2,…,l and i=1,2,…,m, compute: m y j (n) w ji (n) xi (n) APEX i 1 Kernel PCA j w ji (n) y j (n) xi (n) y j (n) wki (n) yk (n) k 1 Conclusions 3. Where xi(n) is the ith component of the m-by-1 input vector x(n) and l is the desire number of principal compenents; Increment n by 1, go to step 2, and continue until the synaptic weights wji reach their steady state CS 476: Networks of Neural Computation, CSD, UOC, 2009 Generalised Hebbian Algorithm -3 Contents PCA GHA APEX values. For large n, the weight wji of neuron j converges to the ith component of the eigenvector associated with jth eigenvalue of the correlation matrix of the input vector x(n). The output neurons represent the eigenvalues of correlation matrix with decreasing order from 1 towards l. Kernel PCA Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Adaptive Principal Components Extraction Contents PCA GHA APEX Kernel PCA Conclusions •Another algorithm for extracting the principal components is the adaptive principal components extraction (APEX) algorithm. This network uses both feedforward and feedback connections. •The algorithm is iterative in nature and if we are given the first (j-1) principal components the jth one can be easily computed. •This algorithm belongs to the class of decorrelating algorithms. •The network that implements the algorithm is shown next: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Adaptive Principal Components Extraction-1 Contents PCA GHA APEX Kernel PCA Conclusions •The network structure is defined as follows: •Each neuron is assumed to be linear (in the output layer); •Feedforward connections exist from the input nodes to each of the neurons 1,2,…,j, with j<m. The feedforward connections operate with a Hebbian rule. They are CS 476: Networks of Neural Computation, CSD, UOC, 2009 Adaptive Principal Components Extraction-2 excitatory and therefore provide amplification. These connections are represented by the wj(n) vector. Contents PCA • GHA Lateral connections exist from the individual outputs of neurons 1,2,…,j-1 to neuron j of the output layer, thereby applying feedback to the network. These connections are represented by the aj(n) vector. The lateral connections operate with an antiHebbian learning rule which has the effect of making them inhibitory. APEX Kernel PCA Conclusions • The algorithm is summarised as follows: 1. Initialise the feedforward weight vector wj and the feedback weight vector aj to small random numbers at time n=1, where j=1,2,…,m. Assign a small CS 476: Networks of Neural Computation, CSD, UOC, 2009 Adaptive Principal Components Extraction-3 positive value to the learning rate parameter ; Contents 2. Set j=1, and for n=1,2,…, compute: T y1 (n) w1 (n) x(n) 2 w1 (n 1) w1 (n) y1 (n) x(n) y1 (n)w1 (n) PCA GHA APEX where x(n) is the input vector. For large n, we have w1(n)q1, where q1 is the eigenvector asociated with the largest eigenvalue 1 of the correlation matrix of x(n); Kernel PCA Conclusions 3. Set j=2, and for n=1,2,…, compute: T y j 1 (n) y1 (n), y2 (n),...,y j 1 (n) T T y j (n) w j (n) x (n) a j (n) y j 1 (n) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Adaptive Principal Components Extraction-4 Contents PCA GHA APEX Kernel PCA Conclusions 2 w j (n 1) w j (n) y j (n) x (n) y j (n)w j (n) 2 a j (n 1) a j (n) y j (n) y j 1 (n) y j (n)a j (n) 4. Increment j by 1, go to step 3, and continue until j=m, where m is the desired number of principal components. (Note that j=1 corresponds to eigenvector associated with the largest eigenvalue, which is taken care in step 2). For large n we have wj(n) qj and aj(n) 0, where qj is the eigenvector associated with the jth eigenvalue of the correlation matrix of x(n). CS 476: Networks of Neural Computation, CSD, UOC, 2009 Kernel Principal Components Analysis • A last algorithm which uses kernels (more on the SVM lecture) will be given below. We simply summarise the algorithm. • This algorithm can be considered as a non-linear PCA methods as we first project the input space in a feature space using a non-linear transform (x) and then we perform a linear PCA analysis in the feature space. This is different from the previous methods in that they calculate a linear transformation between the input and the feature spaces. • Summary of the kernel PCA method: Contents PCA GHA APEX Kernel PCA Conclusions 1. Given the training examples {xi}i=1 , compute the CS 476: Networks of Neural Computation, CSD, UOC, 2009 Kernel Principal Components Analysis-1 the N-by-N kernel matrix K={K(xi, xj)}, where: Contents PCA K(xi, xj)= T(xi) (xj) 2. GHA Ka=a APEX where is an eigenvalue of the kernel matrix K and a is the associated eigenvector; Kernel PCA Conclusions Solve the eigenvalue problem: 3. Normalise the eigenvectors so computed by requiring that: akT ak=1/ k , k=1,2,…,p where p is the smallest nonzero eigenvalue of the matrix K, assuming that the eigenvalues are arranged in decreasing order; CS 476: Networks of Neural Computation, CSD, UOC, 2009 Kernel Principal Components Analysis-2 4. Contents PCA GHA For the extraction of the principal components of a test point x, compute the projections: N ~ T ak qk ( x ) ak , j K ( x j , x ), j 1 k 1,2,..., p where ak,j is the jth element of eigenvector ak. APEX Kernel PCA Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Conclusions Contents PCA GHA APEX Kernel PCA Conclusions •Typically we use PCA methods for dimension reduction as a pre-processing step before we apply other methods, for example in a pattern recognition problem. •There are batch and adaptive numerical methods for the calculation of the PCA. An example for the first class is the Singular Value Decomposition (SVD) method while the GHA algorithm is for example and adaptive method. •It is used mainly for finding out clusters in highdimensional spaces, as it is difficult to visualise these clusters otherwise. CS 476: Networks of Neural Computation, CSD, UOC, 2009