EE 6885 Statistical Pattern Recognition Fall 2005 Prof. Shih-Fu Chang http://www.ee.columbia.edu/~sfchang Lecture 10 (10/12/05) EE6887-Chang Reading Distance Metrics DHS Chap. 4.6 Linear Discriminant Functions 11-1 DHS Chap. 5.1-5.4 Midterm Exam Oct. 24th 2005 Monday 1pm-2:30pm (90mins) Open books/notes, no computer EE6887-Chang 11-2 1 kn-Nearest-Neighbor For classification, estimate p(x) for each class ωi p ( x, ωi ) k pn (ωi | x) = c n = i ∑ pn ( x, ω j ) k j =1 Performance bound of 1-nearest neighbor (Cover & Hart ’67) c * * P ≤ lim Pn (e) ≤ P* (2 − P) n →∞ c −1 Combine K-NN with clustering K-Means, LVQ, GMM Reduce complexity When K increases, complexity? Smooth decision boundaries EE6887-Chang 11-3 Distance Metrics Nearest neighbor rules need distance metrics Required properties of a metric 1. non-negativity: D(a, b) ≥ 0 2. reflexivity: D (a, b) = 0 iff a = b 3. symmetry: D(a, b) = D(b, a ) 4. trangular inequality: D (a, b) + D (b, c ) ≥ D(c, a ) D (a, b) ≥ D (c, a ) − D (b, c) Minkowski Metric Euclidean Manhattan L∞ h ? x useful in indexing d Lk (a, b) = (∑ | ai − bi |k )1/ k i =1 n + n − 2n12 ( n1 − n12 ) + ( n2 − n12 ) Tanimono Metric Dtanimono ( S1 , S2 ) = 1 2 = n1 + n2 − n12 n1 + n2 − n12 sets of elements n2 n1 Point-point distance not useful EE6887-Chang 11-4 2 Discriminant Functions Revisited define discriminant function gi ( x) for class ωi map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i e.g., gi ( x) = ln P ( x | ωi ) + ln P(ωi ) MAP classifier Gaussian Case: P ( x | ωi ) = N ( μi , Σ i ) 1 −1 t exp( ( x − μ i ) Σi −1 ( x − μ i )) P(x | wi ) = d /2 2 Σi ( 2π ) Σi = Σ Case I: g i ( x) = wi t x + wi 0 Case II: a hyperplane with bias wi 0 Σ = arbitrary i 1 d 1 g i ( x ) = − ( x − μ i ) t Σ i −1 ( x − μ i ) − ln 2π − ln Σ i + ln P (ωi ) 2 2 2 g i ( x ) = x tWi x + wit x + wi 0 EE6887-Chang Decision boundaries may be Hyperplane, hypersphere, hyperellipsoid, hyperparaboloids, hyperhyperboloids 11-5 Discriminant Functions (Chap. 5) Directly define discriminant functions and optimize them Do not assume parameter distribution functions for P ( x | ωi ) Easy to derive useful classifiers Linear Functions g ( x) = w t x + w0 ,w: weight vector, w0 : bias Two-Category Case map x to class ω1 if g (x)>0, otherwise class ω2 Decision surface H : g (x) = 0 x = xp + r⋅ w w x p : projection of x onto H , g (x p ) = 0 r : distance from x to H ⎛ w ⎞⎟ w ⎟⎟ = rw t =r w g (x) = g ⎜⎜⎜x p + r ⋅ ⎜⎝ w ⎠⎟ w EE6887-Chang ⇒ r= g ( x) w 11-6 3 Multi-category Case c categories: ω1 , ω2 , …, ωc number of classifiers needed? Approaches ⇒ x belongs to class ωi or not? Use two-class discriminant Use two-class discriminant for each pair of classes ⇒ need c(c −1) / 2 discriminant functions General Approach one function for each class gi (x) = w i t x + wi 0 map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i decision boundary H ij : gi (x) = g j (x) H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 Each decision regions is convex and singly connected. Good for monomodal distributions EE6887-Chang 11-7 Method for searching decision boundaries g i (x) = w i t x + wi 0 ⇒ find weight ω and bias ωo Augmented Vector Decision Boundary ⎡1⎤ ⎢ ⎥ ⎡ 1⎤ ⎢ x1 ⎥ y = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢⎣ x⎥⎦ ⎢ ⎥ ⎢x ⎥ ⎣ d⎦ H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 ⎡ ωi 0 ⎤ ⎢ ⎥ ⎡ ωi 0 ⎤ ⎢ ωi1 ⎥ ai = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢ ωi ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ω ⎥ ⎣ id ⎦ ⇒ g i ( y ) = ai y ⇒ H ij : (ai − a j )t y = 0 2-category case H : (w )t x + wi 0 = 0 ⇓ H : (a)t y = 0 A hyperplane in augmented y space, with normal vector a EE6887-Chang 11-8 4 Search Method for Linear Discriminant all sample points reside in the y1 = 1 subspace distance from x to boundary in x space: distance from x to boundary in y space: r= [1, x]t g ( x) w i.e., r ′ and r same signs, bounds of r ′ r ′ = at y / a ≤ r ⇒ bounds for r Design Objective Find a that correctly classify each sample data ∀y i in class ω1 , a t y i >0 ∀y i in class ω2 , a t y i <0 ∀y i in class ω2 , y i ← −( y i ) Normalization New Design Objective ∀y i in class ω1 or ω2 , a t y i >0 Solution region Intersection of positive sides of all hyperplanes EE6887-Chang 11-9 Searching Linear Discriminant Solutions Solution region with margin ∀y i in class ω1 or ω2 , a t y i >b Search Approaches Gradient decent methods to find a solution in the solution region Maximize margin EE6887-Chang 11-10 5 Gradient Decent (GD) Choose criterion function J(a) J(a) is minimized when a is in the solution region Examples # of samples misclassified J (a) = count of Y Sum of distances from misclassified samples to H Æ perceptron distance J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples y∈Y Quadratic error J (a) = (a t y )2 ∑ q Quadratic error with margin y ∈Y (a y − b) 1 , where Y :{y | a t y < b} J q (a) = ∑ 2 2 y∈Y y t 2 Repeat a(k + 1) = a(k ) − η (k )∇J(a(k )) Results of GD make the mis-classification set Y empty or other stop criteria met EE6887-Chang η (k ) : learning rate 11-11 Different Criterion Functions GD not applicable Solutions may be trapped to boundaries EE6887-Chang Not differentiable Solutions moved away from boundaries 11-12 6 GD based on perceptron criterion J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples y ∈Y ∇J p (a) = ∑ (−y ) y∈Y Batch Perceptron Update initialize a(1), choose rate η (.), and stop criterion θ Loop a(k + 1) = a(k ) + η (k )∑ y y∈Y until η (k )∑ y < θ y∈Y Example a(1) = 0, η (k)=1 Add sum of misclassified samples Theorem: If samples are separable, then a solution can always be found within finite steps. EE6887-Chang 11-13 7