Week 6

Linear Models for Classification: Probabilistic Methods Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/ Recall, Linear Methods for Classification Problem Definition: Given the training data {xn,tn}, find a linear model for each class yk(x) to partition the feature space into decision regions  Deterministic Models:  Discriminant Functions  Fisher Discriminant function  Perceptron 2 Probabilistic Approaches for Classification  Generative Models:  Inference : Model p(x/Ck) and p(Ck)  Decision : Model p(Ck/x)  Discriminative Models  Model p(Ck/x) directly  Use the functional form of the generalized linear model explicitly  Determine the parameters directly using Maximum Likelihood 3 Logistic Sigmoid Function    Comes from population growth Prob distribution function of Normal R.V. İs Logistic sigmoid İf class conditional densities are Normal, posteriors become lo gistic sigmoid 4 Posterior Probabilities can be formulated by 2-Class: Logistic sigmoid acting on a linear function of x  K-Class: Softmax transformation of a linear function of x  Then,  The parameters of the densities as well as the class priors can be determined using Maximum Likelihood 5 Probabilistic Generative Models: 2-Class p  x | Ck  and p Ck   p Ck | x   Recall, given  Posterior can be expresses by Logistic Sigmoid p  C1 | x    p  x | C1  p  C1  p  x | C1  p  C1   p  x | C2  p  C2  1   a 1  exp  a  where a  ln  p  x | C1  p  C1  p  x | C2  p  C2  . a is called logit function 6 Probabilistic Generative Models K-Class   Posterior can be expresses by Softmax function or normalized exponential Multi-class generalisation of logistic sigmoid: p  Ck | x   p  x | Ck  p  Ck   exp  ak   p  x | C  p C   j j j j   exp a j , where ak  ln p  x | Ck  p  Ck  . 7 Probabilistic Generative Models Gaussian Class Conditionals for 2-Class  Assume same covariance matrix ∑, p  x | Ck    1 1  2 D / 2 p  C1 | x    w T x  w0  1/ 2 p  x | Ck  T  1  exp   x  μ k  1  x  μ k   .  2   p  C1  1 1 w   1  μ1  μ 2  and w0   μ1T  1μ1  μT2  1μ 2  ln 2 2 p  C2   Note p  C1 | x   The quadratic terms in x from the exponents are cancelled.  The resulting decision boundary is linear in input space.  The prior only shifts the decision boundary, i.e. parallel contour. 8 Probabilistic Generative Models: Gaussian Class Conditionals for K-classes ak  x   wTk x  wk 0 1 w k  1μk and wk 0   μTk 1μk  ln p  Ck  2   When, covariance matrix is the same, decision boundaries are linear. When, each class-condition density have its own covariance matrix, ak becomes quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9 Probabilistic Generative Models -Maximum Likelihood Solution  Two classes Given Data set: xn , tn  , n  1,..., N tn  1 or 0, (denoting C1 and C2 , respectively) 10 Q: Find P(C1) = π and P(C2) = 1- π and parameters of p(Ck/x): μ1, μ2 and  11 Probabilistic Generative Models -Maximum Likelihood Solution Let P(C1) = π and P(C2) = 1- π 12 Probabilistic Generative Models -Maximize log likelihood w r to. π ,μ1 μ2. ∑ 1  N N  tn  n 1 1 μ.1  N1 N1 N1  N N1  N 2 N t x n n n 1 1 μ2  N2 N  1  t  x n n n 1 S N1 N S1  2 S 2 N N 1 Sk   xn  μk  xn  μk T N k nC S  k 13 Probabilistic Generative Models -Discrete Features  Discrete feature values xi 0,1 When we have D inputs, the table size grows exponentially with the number of featuresto a 2D size table. . Naïve Bayes assumption, conditioned on the class Ck p  x | Ck   D  1 xi kixi 1  ki  i 1 ln p  x | Ck  p  Ck   D x ln  i ki  1  xi  ln 1  ki   ln p  Ck  i 1 Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14 Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15 Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 16 For both Gaussian distributed and discrete inputs  The posterior class probabilities are given by  Generalized linear models with logistic sigmoid or  softmax activation functions. 17 Probabilistic Generative Models -Exponential Family Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a general form   p  x | λ k   h  x  g  λ k  exp λ Tk u  x  p  C1 | x     a1  . 18 Probabilistic Generative Models Exponential Family 2- Classes: Logistic Function  The subclass for which u(x) = x. p  x | λ k   h  x  g  λ k  exp  λ Tk u  x  For some scaling parameter s, 1 1  1  p  x | λ k , s   h  x  g  λ k  exp  λ Tk x  . s s  s  a  x   λ1  λ 2  x  ln g  λ1   ln g  λ2   ln p C1   ln p C2  T  K-Classes: Softmax function. Linear with respect to x again. ak  x  λTk x  ln g  λ k   ln p Ck  where p  Ck | x   exp  ak   j   exp a j . 19 Probabilistic Discriminative Models     Goal: Find p(Ck/x) directly No inferrence step Discriminative Training: Max likelihood p(Ck/x) İmproves prediction performance when p(x/Ck) is poorly estimated 20 Fixed basis functions: x  Assume fixed nonlinear transformation  Transform inputs using a vector of basis functions  The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 21 Posterior probability of a class for twoclass problem:  The number of adjustable parameters (M-dimensional, 2-class)  2 Gaussian class conditional densities (generative model) 2M parameters for means  M(M+1)/2 parameters for (shared) covariance matrix  Grows quadratically with M   Logistic regression (discriminative model) M parameters for  Grows linearly with M  22 Determining the parameters using Likelihood function:  Take negative log likelihood: Cross-entropy error function  Recall, cross entropy between two probability distributions measures t he average number of bits needed to identify an event from a set of po ssibilities, if a coding scheme is used based on a given probability distri bution q, rather than the "true" distribution p. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 23 The gradient of the error function w.r.t. W  The same form as the linear regression prediction target value 24 Iterative Reweighted Least Squares  Recall, Linear regression models in ch.3  ML solution on the assumption of a Gaussian noise leads to a closeform solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w.  Logistic regression model  No longer a closed-form solution  But the error function is concave and has a unique minimum Efficient iterative technique can be used  The Newton-Raphson update to minimize a function E(w)  – Where H is the Hessian matrix, the second derivatives of E(w) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25 Iterative reweighted least squares (Cont’d)  CASE 1: SSE function:  Newton-Raphson update: CASE 2: Cross-entropy error function:   Newton-Rhapson update: (iterative reweighted least squares) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 26 Multiclass logistic regerssion  Posterior probability for multiclass classification  We can use ML to determine the parameters directly.  Likelihood function using 1-of-K coding scheme  Cross-entropy error function for the multiclass classification 27 Multiclass logistic regression (Cont’d)  The derivative of the error function  Same form, the product of error times the basis function.  The Hessian matrix  IRLS algorithm can also be used for a batch processing (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 28 Generalized Linear Models  Recall, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables.  However this is not the case for all choices of class-conditional density  It might be worth exploring other types of discriminative probabilistic model (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29 Generalized Linear Model: 2 Classes For example: For each input, we evaluate an=wTΦn θ 30 Noisy Threshold model  Corresponding activation function when θ is drawn from p(θ), mixture of Gaussian 31 Probit Function Sigmoidal shape The generalized linear model based on a probit activation function is known as probit regression. 32 Canonical link functions  Recall, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector.  Logistic regression model with sigmoid activation function  Logistic regression model with softmax activation function  This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function. 33 Canonical link functions (Cont’d)  Consider the exponential family, Conditional distributions of the target variable  Log likelihood:  The derivative of the log likelihood: where  The canonical link function: then (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34 The Laplace approximation    Goal: Find a Gaussian approximation to a non-Gaussian density, centered on the mode z0 of the distribution. Suppose: p(z)= (1 /Z)f(z) , non Gaussian Taylor expansion, arround mode z0, of the logarithm of the target function:  Resulting approximated Gaussian distribution: 35 Laplace approximation for p(z) ∝ exp(-z2/2)σ(20z +4)   Left: the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. Right:The negative logarithms of the corresponding curves (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36 Model comparison and BIC  Laplace approximation to the normalization constant Z  This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison.  Consider a set of models  The log of model evidence having parameters can be approximated as  Further approximation with some more assumption: Bayesian Information Criterion (BIC) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 37 Bayesian Logistic Regression  Exact Bayesian inference is intractable.  Gaussian prior:  Posterior:  Log of posterior:  Laplace approximation of posterior distribution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 38 Predictive distribution  Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w) where  a is a marginal distribution of a Gaussian which is also Gaussian (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 39 Predictive distribution  Resulting variational approximation to the predictive distribution  To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function Then where  Finally we get (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 40

Week 6

Related documents

Products

Support

Week 6

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib