EE 6885 Statistical Pattern Recognition Fall 2005 Prof. Shih-Fu Chang http://www.ee.columbia.edu/~sfchang Lecture 11 (10/17/05) EE6887-Chang Reading Distance Metrics DHS Chap. 5.1-5.4 Midterm Exam Oct. 24th 2005 Monday 1pm-2:30pm (90mins) DHS Chap. 4.6 Linear Discriminant Functions 11-1 Open books/notes, no computer Review Class Oct. 21st Friday 4pm. Location TBA EE6887-Chang 11-2 1 kn-Nearest-Neighbor For classification, estimate pn (ωi for each class ωi p ( x, ωi ) k pn (ωi | x) = c n = i ∑ pn ( x, ω j ) k | x) j =1 Performance bound of 1-nearest neighbor (Cover & Hart ’67) c * * P ≤ lim Pn (e) ≤ P* (2 − P) n →∞ c −1 Combine K-NN with clustering K-Means, LVQ, GMM Reduce complexity When K increases, complexity? Smooth decision boundaries EE6887-Chang 11-3 Distance Metrics Nearest neighbor rules need distance metrics Required properties of a metric 1. non-negativity: D(a, b) ≥ 0 2. reflexivity: D (a, b) = 0 iff a = b 3. symmetry: D(a, b) = D(b, a ) 4. triangular inequality: D(a, b) + D(b, c) ≥ D(c, a ) D (a, b) ≥ D (c, a ) − D (b, c) Minkowski Metric Euclidean Manhattan L∞ ? x useful in indexing d Lk (a, b) = (∑ | ai − bi |k )1/ k i =1 n + n − 2n12 ( n1 − n12 ) + ( n2 − n12 ) Tanimono Metric Dtanimono ( S1 , S2 ) = 1 2 = n1 + n2 − n12 n1 + n2 − n12 sets of elements n2 n1 Point-point distance not useful EE6887-Chang 11-4 2 Discriminant Functions Revisited define discriminant function gi ( x) for class ωi map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i e.g., gi ( x) = ln P ( x | ωi ) + ln P(ωi ) MAP classifier Gaussian Case: P ( x | ωi ) = N ( μi , Σ i ) 1 −1 t exp( ( x − μ i ) Σi −1 ( x − μ i )) P(x | wi ) = d /2 2 Σi ( 2π ) Σi = Σ Case I: g i ( x) = wi t x + wi 0 Case II: a hyperplane with bias wi 0 Σ = arbitrary i 1 d 1 g i ( x ) = − ( x − μ i ) t Σ i −1 ( x − μ i ) − ln 2π − ln Σ i + ln P (ωi ) 2 2 2 g i ( x ) = x tWi x + wit x + wi 0 EE6887-Chang Decision boundaries may be Hyperplane, hypersphere, hyperellipsoid, hyperparaboloids, hyperhyperboloids 11-5 Discriminant Functions (Chap. 5) Directly define discriminant functions Without assuming parameter distribution functions for P ( x | ω ) i Easy to derive useful classifiers Linear Functions g ( x) = w t x + w0 ,w: weight vector, w0 : bias Two-Category Case map x to class ω1 if g (x)>0, otherwise class ω2 Decision surface H : g (x) = 0 x = xp + r⋅ w w x p : projection of x onto H , g (x p ) = 0 r : distance from x to H ⎛ w ⎞⎟ w ⎟⎟ = rw t =r w g (x) = g ⎜⎜⎜x p + r ⋅ ⎜⎝ w ⎠⎟ w EE6887-Chang ⇒ r= g ( x) w 11-6 3 Multi-category Case c categories: ω1 , ω2 , …, ωc number of classifiers needed? Approaches Use two-class discriminant for each class ⇒ x belongs to class ωi or not? Use two-class discriminant for each pair of classes ⇒ x belongs to class ωi or ω j? General Approach one function for each class gi (x) = w i t x + wi 0 map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i decision boundary H ij : gi (x) = g j (x) H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 Each decision regions is convex and singly connected. Good for monomodal distributions EE6887-Chang 11-7 Method for searching decision boundaries g i (x) = w i t x + wi 0 ⇒ find weight ω and bias ωo ⎡1⎤ ⎢ ⎥ ⎡ 1⎤ ⎢ x1 ⎥ y = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢⎣ x⎥⎦ ⎢ ⎥ ⎢x ⎥ ⎣ d⎦ Augmented Vector Decision Boundary ⎡ wi 0 ⎤ ⎢ ⎥ ⎡ wi 0 ⎤ ⎢ wi1 ⎥ ai = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢ wi ⎥ ⎣ ⎦ ⎢ ⎥ ⎢w ⎥ ⎣ id ⎦ H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 ⇒ gi (x) = g i ( y ) = ai y ⇒ H ij : (ai − a j )t y = 0 2-category case H : (w )t x + wi 0 = 0 ⇓ H : at y = 0 A hyperplane in augmented y space, with normal vector a EE6887-Chang 11-8 4 Search Method for Linear Discriminant all sample points reside in the y1 = 1 subspace distance from x to boundary in x space: distance from x to boundary in y space: r ′ = at y / a ≤r r= r g ( x) w [1, x]t i.e., r ′ and r same signs, r ′ lower bound for r Design Objective for finding a Find a that correctly classify each sample data ∀y i in class ω1 , a t y i >0 ∀y i in class ω2 , a t y i <0 ∀y i in class ω2 , y i ← −( y i ) Normalization New Design Objective ∀y i in class ω1 or ω2 , a t y i >0 Solution region Intersection of positive sides of all hyperplanes EE6887-Chang 11-9 Searching Linear Discriminant Solutions Stricter criterion: Solution region with margin ∀y i in class ω1 or ω2 , a t y i >b b>0 b=0 Search Approaches Gradient decent methods to find a solution in the solution region Maximize margin Mapping to high-dimensional space EE6887-Chang 11-10 5