Feature Extraction K. Ramachandra Murthy Email: k.ramachandra@isical.ac.in Mathematical Tools (Recall) Linear Independence 0 Different direction vectors are independent. 0 Parallel vectors are dependent. (+ve/-ve directions are parallel) Basis & Orthonormal Bases 0 Basis (or axes): frame of reference vs Basis 2 1 0 −9 1 0 =2. + 3. ; =−9. + 6. 3 0 1 6 0 1 4 −1 −9 2 9 2 −1 2 = . + . ; =−3. + 3. 7 3 7 1 3 6 3 1 1 0 , 0 1 is a set of representative for โ2 2 −1 , 3 1 is another Set of representative for โ2 Basis & Orthonormal Bases 2 1 1 0 = 2. + 3. = 3 0 1 0 −9 1 1 0 = −9. + 6. = 6 0 1 0 0 2 ; 1 3 0 −9 ; 1 6 9 2 4 −1 2 2 −1 9/7 = . + . = ; 7 7 4/7 3 1 3 1 3 −9 −1 2 −1 −3 2 = −3. + 3. = ; 6 1 3 1 3 3 Linear Transformation Basis & Orthonormal Bases Basis: a space is totally defined by a set of vectors – any vector is a linear combination of the basis Ortho-Normal: orthogonal + normal [Sneak peek: Orthogonal: dot product is zero Normal: magnitude is one ] ๐ฅ= 1 0 0 ๐ฆ= 0 1 0 ๐ง= 0 0 1 ๐ ๐ ๐ ๐ฅ. ๐ฆ = 0 ๐ฅ. ๐ง = 0 ๐ฆ. ๐ง = 0 Projection: Using Inner Products Projection of ๐ฅ along the direction ๐ ๐. ๐ cos 90 = ๐ ๐ ⇒ ๐. ๐ = 0 ⇒ ๐. ๐ฅ − ๐ = 0 ⇒ ๐. ๐ฅ − ๐ก๐ = 0 ⇒ ๐. ๐ฅ − ๐ก๐. ๐ = 0 ๐. ๐ฅ ⇒๐ก= ๐. ๐ 0 y Light Source ๐ฅ ๐ =๐ฅ−๐ ๐๐๐๐๐๐๐ก๐๐๐ ๐ฃ๐๐๐ก๐๐ ๐ = ๐ก๐ ๐ ๐ = ๐ก๐ x ๐= ๐๐ ๐ฅ ๐๐๐ ๐ ๐ = ๐ ๐ ๐ ๐ ๐ Projection Matrix ๐ฅ Eigenvalues & Eigenvectors 0 Eigenvectors (for a square m๏ดm matrix S) ๐๐ฃ = ๐๐ฃ Example (right) eigenvector eigenvalue ๐๐โ ๐ฃ๐โ๐ ≠ 0 ๐๐ฃ ๐ฃ ๐๐ฃ Eigenvalues & Eigenvectors (Properties) 0 The eigenvalues of real symmetric matrices are real. ๐ด๐ฃ = ๐ ๐ฃ ๏ Then we can conjugate to get ๐ด๐ฃ=๐๐ฃ ๏ If the entries of A are real, this becomes A๐ฃ=๐๐ฃ ๏ This proves that complex eigenvalues of real valued matrices come in conjugate pairs ๏ Now transpose to get ๐ฃ ๐ก ๐ด๐ = ๐ฃ ๐ก ๐. Because A is symmetric matrix we now have ๐ฃ ๐ก ๐ด = ๐ฃ๐ก๐ ๏ Multiply both sides of this equation on the right with ๐ฃ, i.e. ๐ฃ ๐ก ๐ด๐ฃ = ๐ฃ ๐ก ๐๐ฃ ๏ On the other hand multiply๐ด๐ฃ = ๐๐ฃ on the left by ๐ฃ ๐ก to get ๐ฃ ๐ก ๐ด๐ฃ = ๐ฃ ๐ก ๐๐ฃ ⇒ ๐ฃ ๐ก ๐๐ฃ = ๐ฃ ๐ก ๐๐ฃ ⇒ ๐ = ๐ ⇒ ๐ ๐๐ ๐๐๐๐ Eigenvalues & Eigenvectors (Properties) 0 If A is an n x n symmetric matrix, then any two eigenvectors that come from distinct eigenvalues are orthogonal. ๐ด๐ฃ๐ = ๐๐ ๐ฃ๐ ๏ Left multiply with ๐ฃ๐๐ to ๐ด๐ฃ๐ = ๐๐ ๐ฃ๐ ⇒ ๐ฃ๐๐ ๐ด๐ฃ๐ = ๐ฃ๐๐ ๐๐ ๐ฃ๐ ๏ Similarly ๐ฃ๐๐ ๐ด๐ฃ๐ = ๐ฃ๐๐ ๐๐ ๐ฃ๐ ๏ From the above two equations (๐๐ − ๐๐ )๐ฃ๐๐ ๐ฃ๐ = 0 โน ๐ฃ๐๐ ๐ฃ๐ = 0 ∴ ๐ฃ๐ and ๐ฃ๐ are perpendicular Diagonalization 0 Stack up all the eigen vectors to get AV = VΛ Where ๐ = ๐ฃ1 ๐ฃ2 … ๐ฃ๐ ; Λ = ๐๐๐๐ ๐1 ๐2 … ๐๐ 0 If all eigenvectors are linearly independent , V is invertible. ๐ด = ๐Λ๐ −1 0 Suppose A is symmetric matrix, then ๐ ๐ = ๐ −1 Therefore ๐ด = ๐Λ๐ ๐ Variance Variance is the average squared deviation from the mean of a set of data. It is used to find the standard deviation. Variance Mean Variance - 2 Variance - 2 - 2 Variance 1 ---------------- ……… No. of Data Points + - 2 + - 2 + ……… Variance Formula 1 2 ๐ = ๐ ๐ (๐ฅ๐ − ๐ฅ) ๐=1 2 Standard Deviation ๐ = 1 ๐ ๐ (๐ฅ๐ − ๐ฅ)2 ๐=1 [ standard deviation = square root of the variance ] Variance (2D) Variance (2D) Variance (2D) Variance (2D) Variance (2D) Variance doesn’t explore relationship between variables Covariance Variance(x)= 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)2 = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฅ๐ − ๐ฅ) Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๏ถ Covariance x, x = var x ๏ถ Covariance x, ๐ฆ = Covariance y, x Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๐ฆ ๐ฆ1 − ๐ฆ<0 ๐ฆ1 ๐ฅ1 ๐ฅ ๐ฅ1 − ๐ฅ<0 Covariance Covariance(x, y) = ๐ฆ1 − ๐ฆ >0 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๐ฆ1 ๐ฆ ๐ฅ ๐ฅ1 ๐ฅ1 − ๐ฅ >0 Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)>0 (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)<0 Positive Relation (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)>0 (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)<0 Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ ๐ฆ1 ๐ฆ1 − ๐ฆ >0 ๐ฆ ๐ฅ1 ๐ฅ ๐ฅ1 − ๐ฅ<0 − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ ๐ฅ ๐ฅ1 − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๐ฆ ๐ฆ1 − ๐ฆ<0 ๐ฆ1 ๐ฅ1 − ๐ฅ>0 Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๐ฅ๐ − ๐ฅ ๐ฆ๐ − ๐ฆ <0 (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)>0 Negative Relation (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)>0 (๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ)<0 Covariance Covariance(x, y) = 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) Covariance Covariance(x, y) = ๐ฅ๐ − ๐ฅ ๐ฆ๐ − ๐ฆ <0 1 ๐ ๐ ๐=1(๐ฅ๐ − ๐ฅ)(๐ฆ๐ − ๐ฆ) ๐ฅ๐ − ๐ฅ ๐ฆ๐ − ๐ฆ >0 No Relation ๐ฅ๐ − ๐ฅ ๐ฆ๐ − ๐ฆ >0 ๐ฅ๐ − ๐ฅ ๐ฆ๐ − ๐ฆ <0 Covariance Matrix ๐ถ๐๐ฃ ๐ถ๐๐ฃ ๐๐๐ฃ(๐ฅ1 , ๐ฅ1 ) ๐๐๐ฃ(๐ฅ2 , ๐ฅ1 ) = โฎ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ1 ) ๐๐๐ฃ(๐ฅ1 , ๐ฅ2 ) ๐๐๐ฃ(๐ฅ2 , ๐ฅ2 ) โฎ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ2 ) โฏ ๐๐๐ฃ(๐ฅ1 , ๐ฅ๐ ) โฏ ๐๐๐ฃ(๐ฅ2 , ๐ฅ๐ ) โฎ โฎ โฏ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ๐ ) ๐ฅ1 1 ๐ฅ2 ๐ = ๐ − ๐ ๐ − ๐ ; ๐คโ๐๐๐ ๐ = โฎ ๐ ๐ฅ๐ Covariance Matrix ๐ถ๐๐ฃ ๐๐๐ฃ(๐ฅ1 , ๐ฅ1 ) ๐๐๐ฃ(๐ฅ2 , ๐ฅ1 ) = โฎ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ1 ) ๐๐๐ฃ(๐ฅ1 , ๐ฅ2 ) ๐๐๐ฃ(๐ฅ2 , ๐ฅ2 ) โฎ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ2 ) โฏ ๐๐๐ฃ(๐ฅ1 , ๐ฅ๐ ) โฏ ๐๐๐ฃ(๐ฅ2 , ๐ฅ๐ ) โฎ โฎ โฏ ๐๐๐ฃ(๐ฅ๐ , ๐ฅ๐ ) ๏ Diagonal elements are variances, i.e. Cov(๐ฅ, ๐ฅ)=๐ฃ๐๐ ๐ฅ . ๏ Covariance Matrix is symmetric. ๏ It is a positive semi-definite matrix. Covariance Matrix ๏ Covariance is a real symmetric positive semi-definite matrix. ๏ถ All eigenvalues must be real ๏ถ Eigenvectors corresponding to different eigenvalues are orthogonal ๏ถ All eigenvalues are greater than or equal to zero ๏ถ Covariance matrix can be diagonalized, i.e. Cov = PDPT Correlation Positive relation Negative relation No relation • Covariance determines whether relation is positive or negative, but it was impossible to measure the degree to which the variables are related. • Correlation is another way to determine how two variables are related. • In addition to whether variables are positively or negatively related, correlation also tells the degree to which the variables are related each other. Correlation ๐๐ฅ๐ฆ = ๐ถ๐๐๐๐๐๐๐ก๐๐๐ ๐ฅ, ๐ฆ = ๐๐๐ฃ(๐ฅ, ๐ฆ) ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฆ). −1 ≤ ๐ถ๐๐๐๐๐๐๐ก๐๐๐ ๐ฅ, ๐ฆ ≤ +1 Feature Extraction What is feature extraction? 0 Feature extraction refers to the mapping of the original high- dimensional data onto a lower-dimensional space. 0 Criterion for feature extraction can be different based on different problem settings. 0 Unsupervised setting: minimize the information loss 0 Supervised setting: maximize the class discrimination 0 Given a set of data points of m features ๏ปx1 , x2 , ๏ , xm ๏ฝ Compute the linear transformation (projection) P ๏ ๏ p๏ดm : x ๏ ๏m ๏ฎ y ๏ฝ Px ๏ ๏ p (p ๏ผ๏ผ m) What is feature extraction? Original data reduced data Linear transformation Y ๏๏p P ๏ ๏ p๏ด m X ๏ ๏m P ๏๏ p๏ด m : X ๏ฎ Y ๏ฝ PX ๏ ๏ p High-dimensional data in computer vision Face images Handwritten digits Principal Component Analysis Example of a problem 0 We collected m features about 100 students: โป Height โป Weight โป Hair color โป Average grade โป ….. 0 We want to find the most important features that best describe a student. Example of a problem 0 Each student has a vector of data which describes him of length m: โ (180,70,’purple’,84,…) 0 We have n = 100 such vectors. Let’s put them in one matrix, where each column is one student vector. 0 So we have a mxn matrix. This will be the input of our problem. m-dimensional data vector Example of a problem Every student is a vector that lies in an mdimensional vector space spanned by an orthonormal basis. All data/measurement vectors in this space are linear combination of the set of standarad unit length basis vectors. n- students What features can we ignore? 0 Constant feature (number of heads) โป 1,1,…,1. 0 Constant feature with some noise - (thickness of hair) โป 0.003, 0.005,0.002,….,0.0008 low variance 0 Feature that is linearly dependent on other features (head size and height) : Z= aX + bY Which features do we want to keep? 0 Feature that doesn’t depend on others (e.g. eye color), i.e. uncorrelated or low covariance. 0 Feature that changes a lot (grades) โป The opposite of noise โป High variance Questions 0 How we describe ‘most important’ features using math? – High Variance and Low Covariance 0 How do we represent our data so that the most important features can be extracted easily? – Change of basis Change of Basis 0 Let X and Y be mxn matrices related by a linear transformation P. 0 X is the original recorded data set and Y is a re-representation of that data set. PX = Y 0 Let’s define; โ pi are the rows of P. – xi ’s are the columns of X. – yi ’s are the columns of Y. Change of Basis 0 Let X and Y be mxn matrices related by a linear transformation P. 0 X is the original recorded data set and Y is a re-representation of that data set. PX = Y 0 What does this mean? โป P is a matrix that transforms X into Y. โป Geometrically, P is a rotation and a stretch (scaling) which again transforms X into Y. โป The rows of ๐ = ๐1 , ๐2 , … , ๐๐ expressing the columns of X. are a set of new basis vectors for Change of Basis 0 Lets write out the explicit dot products of PX. 0 We can note the form of each column of Y. ๐1 . ๐ฅ๐ ๐2 . ๐ฅ๐ ๐ฆ๐ = โฎ ๐๐ . ๐ฅ๐ 0 We can see that each coefficient of yi is a dot ๐1 ๐2 ๐๐ = โฎ ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ ๐๐ ๐1 . ๐ฅ1 ๐2 . ๐ฅ1 ๐= โฎ ๐๐ . ๐ฅ1 ๐1 . ๐ฅ2 ๐2 . ๐ฅ2 โฎ ๐๐ . ๐ฅ2 … ๐1 . ๐ฅ๐ … ๐2 . ๐ฅ๐ โฑ โฎ … ๐๐ . ๐ฅ๐ product of xi with the corresponding row in P. ๏ง In other words, the jth coefficient of yi is a projection onto the jth row of P. ๏ง Therefore, the rows of P are a new set of basis vectors for representing the columns of X. Change of Basis 0 Changing the basis doesn’t change the data – only its representation. 0 Changing the basis is actually projecting the data vectors on the basis vectors. 0 Geometrically, P is a rotation and a stretch of X. โ If P basis is orthonormal (length = 1) then transformation P is only a rotation the Questions Remaining !!! 0 Assuming linearity, the problem now is to find the appropriate change of basis. 0 The rows vectors ๐1 , ๐2 , … , ๐๐ in this transformation will become the principal components of X. 0 Now, โป What is the best way to re-express X? โป What is the good choice of basis P? 0 Moreover, โ what features we would like Y to exhibit? What does “best express” the data means? 0 As we’ve said before, we want to filter out noise and extract the relevant information from the given data set. 0 Hence, the representation we are looking for will decrease both noise and redundancy in the data set at hand. Noise 0 Noise in any data must be low or – no matter the analysis technique – no information about a system can be extracted. 0 There exists no absolute scale for the noise but rather it is measured relative to the measurement, e.g. recorded ball positions. 0 A common measure is the signal-to-noise ration (SNR), or a ratio of variances. SNR= 2 ๐๐ ๐๐๐๐๐ 2 ๐๐๐๐๐ ๐ 0 A high SNR (>>1) indicates high precision data, while a low SNR indicates noise contaminated data. Why Variance ?!!! 0 Find the axis rotation that maximizes SNR = maximizes the variance between axis. Redundancy … 0 Multiple sensors record the same dynamic information. 0 Consider a range of possible plots between two arbitrary measurement r1 and r2. 0 Panel(a) depicts two recordings with no redundancy, i.e. they are un-correlated, e.g. person’s height and his GPA. 0 However, in panel(c) both recordings appear to be strongly related, i.e. one can be expressed in terms of the other. Redundancy … 0 Computing Covariance Matrix (S) quantifies the correlations between all possible pairs of measurements. Between one pair of measurements, a large covariance/correlation corresponds to a situation like panel (c), while zero covariance corresponds to entirely uncorrelated data as in panel (a) Redundancy … 0 Suppose, we have the option of manipulating SX. We will suggestively define our manipulated covariance matrix SY. What features do you want to optimize in SY? Diagonalize the Covariance Matrix Our goals are to find the covariance matrix that: 1. Minimizes redundancy, measured by covariance. (off- diagonal). • i.e. we would like each variable to co-vary as little as possible with other variables. 2. Maximizes the signal, measured by variance. (the diagonal) Diagonalize the Covariance Matrix PCA Assumptions 0 PCA assumes that all basis vectors ๐1 , ๐2 , … , ๐๐ are orthonormal (i.e. ๐๐ . ๐๐ = ๐ฟ๐๐ ). 0 Hence, in the language of linear algebra, PCA assumes P is an orthonormal matrix. 0 Secondly, PCA assumes the directions with the largest variances are the most “important” or in other words, most principal. 0 Why are these assumptions easiest? Diagonalize the Covariance Matrix PCA Assumptions 0 By the variance assumption PCA first selects a normalized direction in m-dimensional space along which the variance in X is maximized - it saves this as p1. 0 Again it finds another direction along which variance is maximized, however, because of the orthonormality condition, it restricts its search to all directions perpendicular to all previous selected directions. 0 This could continue until m directions are selected. The resulting ordered set of p’s are the principal components. 0 The variances associated with each direction pi quantify how principal each direction is. We could thus rank-order each basis vector pi according to the corresponding variances. Solving PCA: Eigen Vectors of Covariance Matrix 0 We will derive an algebraic solution to PCA using linear algebra. This solution is based on an important property of eigenvector decomposition. 0 Once again, the data set is X, an m×n matrix, where n is the number of data patterns and m is the number of feature types. 0 The goal is summarized as follows: 1 ๐ ๏ Find some orthonormal matrix P where Y=PX such that is ๐๐ = ๐๐ ๐ diagonalized. (Assuming zero mean data i.e. subtract the mean) Solving PCA: Eigen Vectors of Covariance Matrix We begin by rewriting SY in terms of our variable of choice P. 1 ๐๐ = ๐๐ ๐ ๐ 1 = (๐๐)(๐๐)๐ ๐ 1 = ๐ ๐๐๐ ๐ ๐๐ 1 = ๐ ๐๐ ๐ ๐๐ ๐ = ๐๐ด๐๐ 0 Note that we defined a new matrix A=(1/n)XXT, where A is symmetric . 0 Our roadmap is to recognize that a symmetric matrix (A) is diagonalized by an orthogonal matrix of its eigenvectors. 0 A symmetric matrix A can be written as EDET, where D is a diagonal matrix and E is a matrix of eigenvectors of A arranged as columns. 0 The matrix A has r ≤ m orthonormal eigenvectors where r is the rank of the matrix. Solving PCA: Eigen Vectors of Covariance Matrix 0 Now comes the trick. We select the matrix P to be a matrix where each row pi is an eigenvector of (1/n)XXT. 0 By this selection, P=ET. Hence A=PTDP. 0 With this relation and the fact that P-1=PT since the inverse of orthonormal matrix is its transpose, we can finish evaluating SY as follows; ๐๐ = ๐๐ด๐๐ = ๐(๐๐ ๐ท๐)๐๐ = ๐(๐๐ ๐ท๐)๐๐ = (๐๐๐ )๐ท(๐๐๐ ) = ๐๐−1 ๐ท ๐๐−1 ๐๐ = ๐ท It is evident that the choice of P diagonalizes SY. This was the goal for PCA. What We Have Done So Far…. What We Have Done So Far…. What We Have Done So Far…. p2 p1 What We Have Done So Far…. p2 p1 No correlation in the new representation of the data What We Have Done So Far…. p2 ๐2 p1 ๐1 PCA Process – STEP 1 PCA Process – STEP 1 PCA Process – STEP 2 PCA Process – STEP 3 PCA Process – STEP 3 PCA Process – STEP 4 PCA Process – STEP 4 PCA Process – STEP 4 PCA Process – STEP 4 PCA Process – STEP 5 PCA Process – STEP 5 PCA Process – STEP 5 PCA Process – STEP 5 Reconstruction of Original Data Reconstruction of Original Data Reconstruction of Original Data