5.1 - 5.8

ECE 8527 to Machine Learning and Pattern Recognition 8443 – Introduction Pattern Recognition LECTURE 20: LINEAR DISCRIMINANT FUNCTIONS • Objectives: Linear Discriminant Functions Gradient Descent Nonseparable Data • Resources: SM: Gradient Descent JD: Optimization Wiki: Stochastic Gradient Descent MJ: Linear Programming Discriminant Functions • Recall our discriminant function for minimum error rate classification: g i (x )  ln p(x | i )  ln P(i ) • For a multivariate normal distribution: 1 d 1 g i (x)   (x  μi )t i 1(x  μi )  ln( 2 )  ln i  ln P(i ) 2 2 2 • Consider the case: Σi = σ2I (statistical independence, equal variance, class-independent variance)  2 0 0 0    2 0  ... 0   2 i   0 ... ... ...   2 0 ...    0  d   2d  i 1  (1 /  2 )I  i   2 d and is independen t of i ECE 8527: Lecture 20, Slide 1 Gaussian Classifiers • The discriminant function can be reduced to: 1 d 1 g i (x)   (x  μi )t i 1(x  μi )  ln( 2 )  ln i  ln P(i ) 2 2 2 • Since these terms are constant w.r.t. the maximization: 1 g i (x )   (x  μi ) t i 1 (x  μi )  ln P(i ) 2  x  μi 2 2 2  ln P(i ) • We can expand this: g i (x)   1 2 t t t ( x x  2  x   i i i )  ln P(i ) 2 • The term xtx is a constant w.r.t. i, and μitμi is a constant that can be precomputed. ECE 8527: Lecture 20, Slide 2 Linear Machines • We can use an equivalent linear discriminant function: g i (x)  w it x  wi 0 wi  1   2 i 1 t  i  ln P(i ) i 2 wi 0  2 • wi0 is called the threshold or bias for the ith category. • A classifier that uses linear discriminant functions is called a linear machine. • The decision surfaces defined by the equation: gi ( x ) - g j ( x )  0 x  i 2 2 2  ln P(i )  x j 2  ln P( j )  0 2 x  i ECE 8527: Lecture 20, Slide 3 2 2  x j 2  2 2 ln P( j ) P(i ) Linear Discriminant Functions • A discriminant function that is a linear combination of the components of x can be written as: • In the general case, c discriminant functions for c classes. • For the two class case: ECE 8527: Lecture 20, Slide 4 Generalized Linear Discriminant Functions • Rewrite g(x) as: • Add quadratic terms: • Generalize to a functional form: • For example: ECE 8527: Lecture 20, Slide 5 A Gradient Descent Solution • Define a cost function, J(a), and minimize: • Gradient descent: • Approximate J(a) with a Taylor’s series: • The optimal learning rate is: ECE 8527: Lecture 20, Slide 6 Additional Gradient Descent Approaches • Newton Descent: • Relaxation Procedure: ECE 8527: Lecture 20, Slide 7 • Perceptron Criterion: The Ho-Kashyap Procedure • Previous algorithms do not converge if the data is nonseparable. • If linearly separable, we can define a cost function: If a and b are allowed to vary (with b > 0), the minimum value of Js is zero for separable data. • Computing gradients with respect to a and b: • Solving for a and b yields the Ho-Kashyap update rule: ECE 8527: Lecture 20, Slide 8 Linear Programming • A classical linear programming problem can be stated as: Find a vector u = (u1, u2, …, um) that minimizes the linear objective function: • u is arbitrarily constrained such that ui > 0. • The solution to such an optimization problem is not unique. A range of solutions lie in a convex polytope (an n-dimensional polyhedron). • Solutions can be found in polynomial time: O(nk). • Useful for problems involving scheduling, asset allocation, or routing. • Example:  An airline has to assign crews to its flights  Make sure each flight is covered.  Meet regulations such as the number of hours flown each day.  Minimize costs such as fuel, lodging, etc. ECE 8527: Lecture 20, Slide 9 Summary • Machine learning in its most elementary form is a constrained optimization problem in which we find a weighting vector. • The solution is only as good as the cost function. • There are many gradient descent type algorithms that operate using first or second derivatives of a cost function. • Convergence of these algorithms can be slow and hence selecting a suitable convergence factor is critical. • Nonseparable data poses additional challenges and makes the use of marginbased classifiers critical. ECE 8527: Lecture 20, Slide 10

5.1 - 5.8

Related documents

Products

Support

5.1 - 5.8

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib