Lecture Slides for ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Why Reduce Dimensionality? 1. Reduces time complexity: Less computation 2. Reduces space complexity: Less parameters 3. Saves the cost of observing the features 4. Simpler models are more robust on small datasets 5. More interpretable; simpler explanation 6. Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3 Feature Selection vs Extraction Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Rough Sets Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k Principal components analysis (PCA) Linear discriminant analysis (LDA) Factor analysis (FA) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4 Subset Selection Subset selection is supervised. There are 2d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature j = arg mini E(Fxi) Add xj to F if E(Fxj) < E(F) where E(F): the error when only the inputs in F are used F: a feature set of input dimensions, xi, i = 1,…,d For example: Hill-climbing O(d 2) algorithm A greedy method: local search, not an optimal solution Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5 6 Hill Climbing • In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. • Hill climbing can also operate on a continuous space: in that case, the algorithm is called gradient ascent (or gradient descent if the function is minimized). • A problem with hill climbing is that it will find only local maxima. Other local search algorithms try to overcome this problem such as stochastic hill climbing, random walks and simulated annealing. (http://en.wikipedia.org/wiki/Hill_climbing) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Subset Selection Backward search: Start with all features and remove one at a time, if possible. Set of features F initially all features. At each iteration, find the feature j = arg mini E(F-xi) Remove xj to F if E(F-xj) < E(F) Stop if removing a feature does not decrease the error To decrease complexity, we may decide to remove a feature if its removal causes only a slight increase in error. Floating search (Add k, remove l) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7 Principal Components Analysis (PCA) Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w : z = wTx Find w such that Var(z) is maximized Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)] = E[wT(x – μ)(x – μ)Tw] = wT E[(x – μ)(x –μ)T]w = wT ∑ w where Var(x)= E[(x – μ)(x –μ)T] = ∑ (= Cov(x)) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8 If z1=w1Tx with Cov(x) = ∑ then Var(z1) = w1T ∑ w1 Maximize Var(z1) subject to ||w1||=1 (w1Tw1 = 1) max(w1T w1 - (w1T w1 - 1)) w 1 Taking the derivative with respect to w1 and setting it equal to 0, we have 2∑w1 – 2αw1 = 0 ∑w1 = αw1 that is, w1 is an eigenvector of ∑ and α is the corresponding eigenvalue. Because we have w1T ∑w1 = αw1Tw1 = α, max(w1T w1 - (w1T w1 - 1)) max( - ) max w 1 Choose the one with the largest eigenvalue for Var(z) to be max Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9 10 Eigenvalue, Eigenvector and Eigenspace • When a transformation is represented by a square matrix A, the eigenvalue equation can be expressed as • This can be rearranged to • If there exists an inverse then both sides can be left multiplied by the inverse to obtain the trivial solution: x = 0. • Therefore, if λ is such that A − λI is invertible, λ cannot be an eigenvalue. • Thus we require there to be no inverse by assuming from linear algebra that the determinant equals zero: • The determinant requirement is called the characteristic equation of A, and the lefthand side is called the characteristic polynomial. When expanded, this gives a polynomial equation for λ. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11 Example • The matrix defines a linear transformation of the real plane. • The eigenvalues of this transformation are given by the characteristic equation • The roots of this equation are λ = 1 and λ = 3. • Considering first the eigenvalue λ = 3, we have • After matrix-multiplication ▫ This matrix equation represents a system of two linear equations 2x + y = 3x and x + 2y = 3y. ▫ Both the equations reduce to the single linear equation x = y. ▫ To find an eigenvector, we are free to choose any value for x, so by picking x=1 and setting y=x, we find the eigenvector to be Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1 max w2T w2 - w2T w2 - 1 - w2T w1 - 0 w 2 Taking the derivative with respect to w2 and setting it equal to 0, we have 2w2 - 2 w2 - w1 0 2w1T w2 - 2 w1T w2 - w1T w1 0 Note w1T w2 0, w1 1w1 , and w1T w2 w2T w1 w1T w2 w2T w1 1w2T w1 0 w1T w1 0 0 ∑ w2 = α w2 that is, w2 should be the eigenvector of ∑ with the second largest eigenvalue, 2 . Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12 What PCA does z = WT(x – m) where the columns of W are the eigenvectors of ∑, and m is sample mean Centers the data at the origin and rotates the axes Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13 How to choose k ? Proportion of Variance (PoV) explained 1 2 k 1 2 k d when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14 0.9 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15 Example of PCA Figure 6.3 Optdigit data plotted in the space of two principal components. Only the labels of hundred data points are shown to minimize the ink-to-noise ratio. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Figure 6.3 17 Factor Analysis Find a small number of unobservable, latent (隱性的) factors z, which when combined generate x : xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi where zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi , zj)=0, i ≠ j , εi are the noise sources E[ εi ]= 0, Var[ εi ]= ψi, Cov(εi , εj) =0, Cov(εi , zj) =0, i ≠ j, and vij are the factor loadings Var(xj)=vi12 + vi22+ vi32+…+ vik2+ ψi FA, like PCA, is an unsupervised method. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18 PCA vs FA PCA FA From x to z From z to x z = WT(x – µ) x – µ = Vz + ε x z z x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19 Factor Analysis In FA, factors zj are stretched (延伸), rotated and translated to generate x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 Factor Analysis Given S, the estimator of ∑, we would like to find V and Ψ such that S Cov( x) Cov(Vz + ) VV T Ψ (p. 121) Ψ : a diagonal matrix with i ( i Var ( i )) on the diagonals Ignore Ψ S = CDCT= (CD1/2)(CD1/2)T V = CD1/2 Z = XS-1V (p.123-p.124) X : observations C : the matrix of eigenvectors D : the diagonal matrix with the eigenvalues on its diagonals Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21 22 Example • The following example is a simplification for expository purposes, and should not be taken to be realistic. • Suppose a psychologist proposes a theory that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. • Evidence for the theory is sought in the examination scores from each of 10 different academic fields of 1000 students. • If each student is chosen randomly from a large population, then each student's 10 scores are random variables. • The psychologist's theory may say that for each of the 10 academic fields the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a linear combination of those two "factors". • The numbers for this particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the theory to be the same for all intelligence level pairs, and are called "factor loadings" for this subject. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23 Example • For example, the theory may hold that the average student's aptitude(天資) in the field of is {10 × the student's verbal intelligence} + {6 × the student's mathematical intelligence}. • The numbers 10 and 6 are the factor loadings associated with amphibiology (兩棲動 物學). Other academic subjects may have different factor loadings. • Two students having identical degrees of verbal intelligence and identical degrees of mathematical intelligence may have different aptitudes in amphibiology because individual aptitudes differ from average aptitudes. • That difference is called the "error" — a statistical term that means the amount by which an individual differs from what is average for his or her levels of intelligence. • The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data. (http://en.wikipedia.org/wiki/Factor_analysis) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Multidimensional Scaling Given pairwise distances between N points, dij, i,j =1,...,N place on a low-dim map such that distances are preserved. z = g (x | θ )= WTx : Sammon mapping Find θ that min Sammon stress E | X r ,s r ,s ( z r - z s - xr - xs )2 xr - xs 2 ( g(x r | ) - g(x s | ) - x r - x s ) 2 xr - xs 2 Sammon stress: the normalized error in mapping One can use any regression method for g (x | θ ) and estimate θ to minimize the stress on the training data X. If g (x | θ ) is nonlinear in x, this will then correspond to a nonlinear dimensionality reduction. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24 Map of Europe by MDS Map from CIA – The World Factbook: http://www.cia.gov/ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25 Linear Discriminant Analysis LDA is a supervised method for dimensionality reduction for classification problems. Find a low-dimensional space such that when x is projected, classes are well-separated. Find w that maximizes J w m - m2 1 2 s12 s22 w xr w m , r w x (1 - r ) w (1 - r ) T m1 t t t 2 s12 t wT x t - m1 r t T t 1 t T m2 t t t t T 2 m2 , s22 t wT x t - m2 (1 - r t ) t X = { xt , rt }, where rt = 1 if xt∈C1, and rt = 0 if xt∈C2. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26 Between-class scatter: m1 - m2 2 w T m1 - w T m2 2 T w T m1 - m2 m1 - m2 w T w T S Bw where S B m1 - m2 m1 - m2 Within-class scatter: w x - m x - m wr w S w where S x - m x - m r Similarly s w S w with S x - m x 2 s12 t wT x t - m1 r t T t T t 1 t t 1 2 2 t 1 T t 1 t T 1 t 1 t T 2 2 t 2 t - m2 (1 -r t ) T s12 s22 w T SW w where SW S1 S 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27 Fisher’s Linear Discriminant Find w that max J w m1 - m2 s12 s22 2 w SB w T w SW w T wT m1 - m 2 2 wT SW w LDA solution: (c: constant)(see p. 130) w c SW-1 m1 - m2 Remember that: when p x|Ci ~ N μi , we have a linear discriminant where w -1 μ1 - μ2 Fisher’s linear discriminant is optimal if the classes are normally distributed. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28 K>2 Classes Within-class scatter: K SW Si i 1 Si t rit x t - mi x t - mi T Between-class scatter: K S B N i mi - m mi - m T i 1 1 m K K m i i 1 N i rit t Find W that max J W WT S B W WT SW W The largest eigenvectors of SW-1SB are the solutions. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30 Exercise Draw two-class, two-dimensional data such that (a) PCA and LDA find the same direction and (b) PCA and LDA find totally different directions. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31 Isomap (Isometric (等量) feature mapping) Geodesic (大地測量學的) distance is the distance along the manifold that the data lies in, as opposed to the Euclidean distance in the input space. Isomap uses the geodesic distances between all pairs of data points. For neighboring points that are close in the input space, Euclidean distance can be used. For faraway points, geodesic distance is approximated by the sum of the distances between the points along the way over the manifold. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32 Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33 Isomap Instances r and s are connected in the graph if ||xr-xs||< or if xs is one of the k neighbors of xr the edge length is ||xr-xs|| For two nodes r and s not connected, the geodesic distance is equal to the shortest path between them. Once the NxN distance matrix is thus formed, use MDS (Multidimensional scaling) to find a lower-dimensional mapping Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html 35 Locally Linear Embedding Locally linear embedding recovers global nonlinear structure from locally linear fits. Each local patch of the manifold can be approximated linearly. Given enough data, each point can be written as a linear, weighted sum of its neighbors. So, 1. Given xr find its neighbors xr(s) 2. Find the reconstruction weights Wrs that minimize E( W | X ) x - Wrs x r r 2 r (s) s Wrr 0, r and Wrs 1. subject to s 3. Find the new coordinates zr that minimize E(z | W ) z r - Wrs z(rs ) r 2 s Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 36 Locally Linear Embedding Local linear embedding first learns the constraints in the original space and next places the points in the new space respecting those constraints. The constraints are learned using the immediate neighbors but also propagate to second-order neighbors. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37 PCA vs LLE PCA http://www.cs.nyu.edu/~roweis/lle/papers/lleintro.pdf LLE Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38 Exercise In Isomap, instead of using Euclidean distance, we can also use Mahalanobis distance between neighboring points. What are the advantages and disadvantages of this approach, if any? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39