Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006 Summary • • • • • • • Bayes Classifiers Linear Classifiers Linear regression of an indicator matrix Linear discriminant analysis (LDA) Logistic regression Separating hyperplanes Reading (ch4, ELS) Bayes Classifier • The marginal distributions of G are specified as PMF pG(g), g=1,2,…,K • fX|G(x|G=g) shows the conditional distribution of X for G=g • The training set (xi,gi),i=1,..,N has independent samples from the joint distribution fX,G(x,g) – fX,G(x,g) = pG(g)fX|G(x|G=g) • The loss of predicting G* for G is L(G*,G) • Classification goal: minimize the expected loss – EX,GL(G(X),G)=EX(EG|XL(G(X),G)) Bayes Classifier (cont’d) • It suffices to minimize EG|XL(G(X),G) for each X. The optimal classifier is: – G(x) = argmin g EG|X=xL(g,G) Bayes classification rule • The Bayes rule is also known as the rule of maximum a posteriori probability – G(x) = argmax g Pr(G=g|X=x) • Many classification algorithms estimate the Pr(G=g|X=x) and then apply the Bayes rule More About Linear Classification • Since predictor G(x) take values in a discrete set G, we can divide the input space into a collection of regions labeled according to classification • For K classes (1,2,…,K), and the fitted linear model for k-th indicator response variable is T ˆ ˆ ˆ f ( x ) • k k0 kx • The decision boundary b/w k and l is: fˆk ( x) fˆl ( x) • An affine set or hyperplane: ˆ ˆ ) ( ˆ ˆ )T x 0} {x : ( k0 l0 k l • Model discriminant function k(x) for each class, then classify x to the class with the largest value for k(x) Linear Decision Boundary • We require that monotone transformation of k or Pr(G=k|X=x) be linear • Decision boundaries are the set of points with log-odds=0 • Prob. of class 1: , prob. of class 2: 1- • Apply a transformation:: log[/(1- )]=0+ Tx • Two popular methods that use log-odds – Linear discriminant analysis, linear logistic regression • Explicitly model the boundary b/w two classes as linear. For a two-class problem with p-dimensional input space, this is modeling decision boundary as a hyperplane • Two methods using separating hyperplanes – Perceptron - Rosenblatt, optimally separating hyperplanes - Vapnik Generalizing Linear Decision Boundaries • Expand the variable set X1,…,Xp by including squares and cross products, adding up to p(p+1)/2 additional variables Linear Regression of an Indicator Matrix • For K classes, K indicators Yk, k=1,…,K, with Yk=1, if G=k, else 0 • Indicator response matrix Linear Regression of an Indicator Matrix (Cont’d) • For N training data, form NK indicator response matrix Y, a matrix of 0’s and 1’s ˆ X(XT X)1 XT Y Y • A new observation is classified as follows: ˆ] – Compute the fitted output (K vector) - fˆ ( x) [(1, x)B – Identify the largest component and classify accordingly: T Gˆ ( x) arg maxkG fˆk(x) • But… how good is the fit? – Verify kG fk(x)=1 for any x – fk(x) can be negative or larger than 1 • We can allow linear regression into basis expansion of h(x) • As the size of training set increases, adaptively add more basis Linear Regression - Drawback • For K3, especially for large K Linear Regression - Drawback • For large K and small p, masking can naturally occur • E.g. Vowel recognition data in 2D subspace, K=11, p=10 dimensions Linear Regression and Projection* • A linear regression function (here in 2D) • Projects each point x=[x1 x2]T to a line parallel to W1 • We can study how well the projected points {z1,z2,…,zn}, viewed as functions of w1, are separated across the classes * Slides Courtesy of Tommi S. Jaakkola, MIT CSAIL Linear Regression and Projection • A linear regression function (here in 2D) • Projects each point x=[x1 x2]T to a line parallel to W1 • We can study how well the projected points {z1,z2,…,zn}, viewed as functions of w1, are separated across the classes Projection and Classification • By varying w1 we get different levels of separation between the projected points Optimizing the Projection • We would like to find the w1 that somehow maximizes the separation of the projected points across classes • We can quantify the separation (overlap) in terms of means and variations of the resulting 1-D class distribution Fisher Linear Discriminant: Preliminaries • Class description in d – Class 0: n0 samples, mean 0, covariance 0 – Class 1: n1 samples, mean 1, covariance 1 • Projected class descriptions in – Class 0: n0 samples, mean 0Tw1, covariance w1T0 w1 – Class 1: n1 samples, mean 1Tw1, covariance w1T1 w1 Fisher Linear Discriminant • Estimation criterion: find w1 that maximizes • The solution (class separation) is decision theoretically optimal for two normal populations with equal covariances (1=0) Linear Discriminant Analysis (LDA) • k class prior Pr(G=k) • Function fk(x)=density of X in class G=k • Bayes Theorem: • Leads to LDA, QDA, MDA (mixture DA), Kernel DA, Naïve Bayes • Suppose that we model density as a MVG: • LDA is when we assume the classes have a common covariance matrix: k= k. It’s sufficient to look at log-odds LDA • Log-odds function implies decision boundary b/w k and l: Pr(G=k|X=x)=Pr(G=l|X=x) – linear in x; in p dimensions a hyperplane • Example: three classes and p=2 LDA (Cont’d) LDA (Cont’d) • In practice, we do not know the parameters of Gaussian distributions. Estimate w/ training set – – – Nk is the number of class k data ˆ k Nk / N ˆ k g k xi / N k i ˆ K ( x ˆ )(x ˆ )T /( N K ) i k i k k 1 g k i • For two classes, this is like linear regression QDA • If k’s are not equal, the quadratic terms in x remain; we get quadratic discriminant functions (QDA) QDA (Cont’d) • The estimates are similar to LDA, but each class has a separate covariance matrices • For large p dramatic increase in parameters • In LDA, there are (K-1)(p+1) parameters • For QDA, there are (K-1){1+p(p+3)/2} • LDA and QDA both work really well • This is not because the data is Gaussian, rather, for simple decision boundaries, Gaussian estimates are stable • Bias-variance trade-off Regularized Discriminent Analysis • A compromise b/w LDA and QDA. Shrink separate covariances of QDA towards a common covariance (similar to Ridge Reg.) • • Example - RDA Computations for LDA • Suppose we compute the eigen decomposition for k, i.e. • Uk is pp orthonormal, Dk diagonal matrix of positive eigenvalues dkl. Then, 1 T ˆ 1 ( x ) [U T ( x ˆ ˆ ( x k )T )] D [ U k k k k k ( x k )] log | ˆ k | l log d kl • The LDA classifier is implemented as: • X* D-1/2UTX, where =UDUT. The common covariance estimate of X* is identity • Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities k Background: Simple Decision Theory* • Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) • How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error? * Courtesy of Tommi S. Jaakkola, MIT CSAIL Background: Simple Decision Theory • Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) • How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error? 2-Class Logistic Regression • The optimal decisions are based on the posterior class probabilities P(y|x). For binary classification problems, we can write these decisions as • We generally don’t know P(y|x) but we can parameterize the possible decisions according to 2-Class Logistic Regression (Cont’d) • Our log-odds model • Gives rise to a specific form for the conditional probability over the labels (the logistic model): Where Is a logistic squashing function That turns linear predictions into probabilities 2-Class Logistic Regression: Decisions • Logistic regression models imply a linear decision boundary K-Class Logistic Regression • The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) • The choice of denominator is arbitrary, typically last class P r(G 1 | X x) log 10 1T x P r(G K | X x) ….. P r(G 2 | X x) log 20 T2 x P r(G K | X x) P r(G K 1 | X x) log ( K 1) 0 TK 1 x P r(G K | X x) K-Class Logistic Regression (Cont’d) • The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) • A simple calculation shows that P r(G k | X x) Pr(G K | X x) exp( k 0 Tk x) 1 l 1 exp(l 0 x) 1 K 1 T l , k 1,...,K 1, 1 l 1 exp(l 0 Tl x) K 1 • To emphasize the dependence on the entire parameter set ={10, 1T,…,(K-1)0, T(K-1)}, we denote the probabilities as Pr(G=k|X=x) = pk(x; ) Fitting Logistic Regression Models P( x) log it P( x) log ( x) T x 1 P( x) log Likelihood i 1{ yi log pi (1 yi ) log(1 pi )} N i 1{ yi xi log(1 e N T T xi )} Fitting Logistic Regression Models • IRLS is equivalent to Newton-Raphson procedure Fitting Logistic Regression Models P( x) log it P( x) log ( x) T x 1 P( x) log Likelihood i 1{ yi log pi (1 yi ) log(1 pi )} N i 1{ yi xi log(1 e N • T T xi )} IRLS algorithm (equivalent to Newton-Raphson) – Initialize . – Form Linearized response: – Form weights wi=pi(1-pi) – Update by weighted LS of zi on xi with weights wi – Steps 2-4 repeated until convergence Example – Logistic Regression • South African Heart Disease: – Coronary risk factor study (CORIS) baseline survey, carried out in three rural areas. – White males b/w 15 and 64 – Response: presence or absence of myocardial infarction – Maximum likelihood fit: Example – Logistic Regression • South African Heart Disease: Logistic Regression or LDA? • LDA: • This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. • Logistic model • They use the same form for the logit function Logistic Regression or LDA? • Discriminative vs informative learning: • logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality). • If normality holds, LDA is up to 30% more efficient; o/w logistic regression can be more robust. But the methods are similar in practice. Separating Hyperplanes Separating Hyperplanes • Perceptrons: compute a linear combination of the input features and return the sign • • • • For x1,x2 in L, T(x1-x2)=0 *= /|| || normal to surface L For x0 in L, Tx0= - 0 The signed distance of any point x to L is given by 1 (T x 0 ) || || 1 f ( x) || f ' ( x) || *T ( x x0 ) Rosenblatt's Perceptron Learning Algorithm • Finds a separating hyperplane by minimizing the distance of misclassified points to the decision boundary • If a response yi=1 is misclassified, then xiT+0<0, and the opposite for misclassified point yi=-1 • The goal is to minimize Rosenblatt's Perceptron Learning Algorithm (Cont’d) • Stochastic gradient descent • The misclassified observations are visited in some sequence and the parameters updated • is the learning rate, can be 1 w/o loss of generality • It can be shown that algorithm converges to a separating hyperplane in a finite number of steps Optimal Separating Hyperplanes • Problem Example - Optimal Separating Hyperplanes