Document

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim Contents  4.1. Discriminant Functions  4.2. Probabilistic Generative Models (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2 Classification Models  Linear classification model  (D-1)-dimensional hyperplane for D-dimensional input space  1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T  Discriminant function  Directly assigns each vector x to a specific class.  ex. Fishers linear discriminant  p  Ck | x  Approaches using conditional probability  Separation of inference and decision states  Two approaches Direct modeling of the posterior probability  Generative approach  – Modeling likelihood and prior probability to calculate the posterior probability – Capable of generating samples (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 3 Discriminant Functions-Two Classes  Classification by hyperplanes y  x   w T x  w0   if y  x   0, x  C1   otherwise, x  C2 or y  x   wT x where w   w0 , w  and x  1, x  (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 4 Discriminant Functions-Multiple Classes  One-versus-the-rest classifier  K-1 classifiers for a K-class discriminant  Ambiguous when more than two classifiers say ‘yes’.  One-versus-one classifier  K(K-1)/2 binary discriminant functions  Majority voting  ambiguousness with equal scores One-versus-the-rest One-versus-one (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5 Discriminant Functions-Multiple Classes (Cont’d)  K-class discriminant comprising K linear functions  Assigns x to the corresponding class having the maximum output. yk  x   wTk x  wk 0 , k  1,..., K x  Ck if yk  x   y j  x  for j  k  The decision regions are always singly connected and convex. For x A , x B  Ck , let xˆ   x A  1    x B . Then yk  xˆ    yk  x A   1    yk  x B  . yk  x A   y j  x A  and yk  x B   y j  x B  for j  k , therefore yk  xˆ   y j  xˆ  for j  k . (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6 Approaches for Learning Parameters for Linear Discriminant Functions   Least square method Fisher’s linear discriminant  Relation to least squares  Multiple classes  Perceptron algorithm (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7 Least Square Method   Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t. y  x   WT x  where W   w1 w 2 ... w K  and w k   wk 0, w Tk  T . For a training data set, {xn, tn} where n = 1,…,N. The sum of squares error function is…   ED W   1 Tr XW  T 2   XW  T , T where X   x1 x 2 ... x N  and T   t1 t 2 ... t N  . T  Minimizing SSE gives T  W  XT X  1 XT T  X T. Pseudo inverse (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 8 Least Square Method (Cont’d) -Limit and Disadvantage  The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1].  Vulnerable to outliers  Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.  ML under Gaussian conditional distribution  Unimodal vs. multimodal (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 Least Square Method (Cont’d) -Limit and Disadvantage  Lack of robustness comes from…  Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution.  Binary target vectors are far from this assumption. Least square solution Logistic regression (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10 Fisher’s Linear Discriminant  Linear classification model as dimensionality reduction from the D-dimensional space to one dimension.  In case of two classes yw x T  if y  w0 , then x  C1  x  C2  otherwise, Finding w such that the projected data are clustered well. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11 Fisher’s Linear Discriminant (Cont’d)  Maximizing projected mean distance?  The distance between the cluster means, m1 and m2 projected onto w. 1 1 m2  m1  w T m2  m1  m1  N1 x n and m 2  nC1 N2 x n nC2  Not appropriate when the covariances are nondiagonal. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12 Fisher’s Linear Discriminant (Cont’d)  Integrate the within-class variance of the projected data.  Finding w that maximizes J(w). J w 2 m2  m1    , where s 2  si2  s22 wTS B w  J w  T w SW w  k  y n  mk  SB: Between-class covariance matrix 2 SW: Within-class covariance matrix nCk S B   m 2  m1  m 2  m1  SW  T T T x  m x  m  x  m x  m        n 1 n 1  n 2 n 2 nC1 J(w) is maximized when nC2 w S wS T B Ww    wTSW w S B w in the direction of (m2-m1) Fisher’s linear discriminant w  SW1 m2  m1   If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case.  (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 13 Fisher’s Linear Discriminant -Relation to Least Squares Fisher criterion as a special case of least squares  When setting target values as:  N/N1 for class C1 and N/N2 for class C2. w T x n  w0  tn  0 w T x n  w0  tn x n  0 N 1 E 2  N n 1 w xn  w0  tn T  dE / dw0  0 2 dE / dw  0 n 1 N n 1 1 w0  wTm, where m  N N  xn  n 1 1  N1m1  N2m2  N N1N2   S  S B  w  N  m1  m2  .  W N   1 w  SW m1  m2  .  (1)  (2) by solving (1). by solving (2) with the w0 above. SB w : always in the direction of m2  m1  (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14 Fisher’s Discriminant for Multiple Classes   K > 2 classes Dimension reduction from D to D’  D’ > 1 linear features, yk (k = 1,…,D’)  yk  wTk x Generalization of SW and SB K SW   S k , where S k  k 1 K SB   T x  m x  m     n k n k and mk  nCk N k  m k  m  m k  m  1 Nk x . n nCk T k 1 SB is from the decomposition of total covariance matrix (Duda and Hart, 1997) N  1 ST   xn  m  xn  m  , where m  N n 1 T N  1 xn  N n 1 K N m . k k k 1 ST  SW  S B . (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15 Fisher’s Discriminant for Multiple Classes (Cont’d)  Covariance matrices in the projected y-space K sW     y k  μk  yk  μk  T K and s B  k 1 nCk 1 where μ k  Nk    nCk 1 y n and μ  N Fukunaga’s criterion Another criterion  N k  μ k  μ  μ k  μ  , T k 1 K N μ . k k k 1 J  W  Tr  1 sW sB     Tr  WSW WT    WS W  1 T B  Duda et al. ‘Pattern Classification’, Ch. 3.8.3  Determinant: the product of the eigenvalues, i.e. the variances in the principal directions. WS B W T sB J W  = T sW WSW W (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 16 Fisher’s Discriminant for Multiple Classes (Cont’d) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17 Perceptron Algorithm  Classification of x by a perceptron 1, a  0 y  x   f w   x  , where f  a    .  1, a  0     T Error functions  The total number of misclassified patterns constant and discontinuous gradient is zero almost everywhere.  Piecewise  Perceptron criterion. EP  w     w  t , where t T n n n is the target output. nM (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 18 Perceptron Algorithm (cont’d)  Stochastic gradient descent algorithm w  1     w  EP  w   w  ntn The error from a misclassified pattern is reduced after each iteration.  Not imply the overall error is reduced. w  1 T  ntn  w Tntn  ntn  ntn  w Tntn T Perceptron convergence theorem.  If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it.  However…  Learning speed, linearly nonseparable, multiple classes (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 19 Perceptron Algorithm (cont’d) (a) (b) (c) (d) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 20 Probabilistic Generative Models  Computation of posterior probabilities using class-conditional densities and class priors. p  x | Ck  and p Ck   p Ck | x   Two classes p  C1 | x     p  x | C1  p  C1  p  x | C1  p  C1   p  x | C2  p  C2  1   a 1  exp  a  where a  ln p  x | C1  p  C1  p  x | C2  p  C2  . Generalization to K > 2 classes p  Ck | x   p  x | Ck  p  Ck   exp  ak   p  x | C  p C   j j j where ak  ln p  x | Ck  p  Ck  . j   exp a j , The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 21 Probabilistic Generative Models -Continuous Inputs Posterior probabilities when the class-conditional densities are Gaussian. p  x | Ck   When sharing the same covariance matrix ∑, p  x | Ck    1 1  2 D / 2  1/ 2 T  1  exp   x  μ k  1  x  μ k   .  2  Two classes  p  C1 | x    w T x  w0 w 1  μ1  μ 2   p  C1  1 T 1 1 T 1 and w0   μ1  μ1  μ 2  μ 2  ln 2 2 p  C2  p  C1 | x   The quadratic terms in x from the exponents are cancelled.  The resulting decision boundary is linear in input space.  The prior only shifts the decision boundary, i.e. parallel contour. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 22 Probabilistic Generative Models -Continuous Inputs (cont’d) Generalization to K classes ak  x   wTk x  wk 0 1 w k  1μk and wk 0   μTk 1μk  ln p  Ck  2  When sharing the same covariance matrix, the decision boundaries are linear again.  If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 23 Probabilistic Generative Models -Maximum Likelihood Solution  Determining the parameters for p  x | Ck  and p Ck  using maximum likelihood from a training data set. Two classes Data set: xn , tn  , n  1,..., N Priors: p  C1    and p C2   1   tn  1 or 0, (denoting C1 and C2 , respectively) p  xn , C1   p  C1  p  xn | C1    N  xn | μ1 ,   p  xn , C2   p  C2  p  xn | C2   1    N  xn | μ2 ,    The likelihood function p  t | x, μ1, μ2 ,    N  n 1 1 tn  N  xn | μ1,   n 1    N  xn | μ2 ,   t (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 24 t  t  t1,..., N T Probabilistic Generative Models -Maximum Likelihood Solution (cont’d) Two classes (cont’d)  Maximization of the likelihood with respect to π. of the log likelihood that depend on π.  Setting the derivative with respect to π equal to zero.  Terms N tn ln   1  tn  ln 1    1  N n 1 N  tn  n 1 N1 N1  N N1  N 2  Maximization with respect to μ1. N  1 tn ln N  xn | μ1,     2 n 1 1 μ1  N1 N t x n n n 1 N  tn  xn  μ1  1  xn  μ1   const. T n 1 1 μ  and analogously 2 N2 N  1  t  x n n n 1 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25 Probabilistic Generative Models -Maximum Likelihood Solution (cont’d) Two classes (cont’d)  Maximization of the likelihood with respect to the shared covariance matrix ∑. N N 1  2  1  2 N   N N ln   Tr  1S 2 2 1 tn   2 n 1  tn  x n  μ1   1  x n  μ1  T n 1 1 1  tn    2 n 1 N  1  t n  xn  μ 2  T  1  xn  μ 2  n 1   S Weighted average of the covariance matrices associated with each classes. N1 N S1  2 S 2 N N 1 Sk   xn  μk  xn  μk T N k nC S  k But not robust to outliers. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 26 Probabilistic Generative Models -Discrete Features  Discrete feature values xi 0,1 General distribution would correspond to a 2D size table.  When we have D inputs, the table size grows exponentially with the number of features. Naïve Bayes assumption, conditioned on the class Ck p  x | Ck   D  1 xi kixi 1  ki  i 1 ln p  x | Ck  p  Ck   D x ln  i ki  1  xi  ln 1  ki   ln p  Ck  i 1 Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27 Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 28 Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29 Probabilistic Generative Models -Exponential Family For both Gaussian distributed and discrete inputs…  The posterior class probabilities are given by  Generalized linear models with logistic sigmoid or softmax activation functions.  Generalization to the class-conditional densities of the exponential family  The subclass for which u(x) = x. Exponential family For some scaling parameter s,   p  x | λ k   h  x  g  λ k  exp λ Tk u  x  1 1  1  p  x | λ k , s   h  x  g  λ k  exp  λ Tk x  . s s  s  Two-classes a  x   λ1  λ2 T x  ln g  λ1   ln g  λ2   ln p C1   ln p C2  K-classes ak  x   λTk x  ln g  λ k   ln p Ck   Linear with respect to x again. where p  Ck | x   exp  ak   (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ p  C1 | x     a1  . j   exp a j . 30

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib