EE 6885 Statistical Pattern Recognition Reading

EE 6885 Statistical Pattern Recognition Fall 2005 Prof. Shih-Fu Chang http://www.ee.columbia.edu/~sfchang Lecture 10 (10/12/05) EE6887-Chang Reading Distance Metrics DHS Chap. 4.6 Linear Discriminant Functions 11-1 DHS Chap. 5.1-5.4 Midterm Exam Oct. 24th 2005 Monday 1pm-2:30pm (90mins) Open books/notes, no computer EE6887-Chang 11-2 1 kn-Nearest-Neighbor For classification, estimate p(x) for each class ωi p ( x, ωi ) k pn (ωi | x) = c n = i ∑ pn ( x, ω j ) k j =1 Performance bound of 1-nearest neighbor (Cover & Hart ’67) c * * P ≤ lim Pn (e) ≤ P* (2 − P) n →∞ c −1 Combine K-NN with clustering K-Means, LVQ, GMM Reduce complexity When K increases, complexity? Smooth decision boundaries EE6887-Chang 11-3 Distance Metrics Nearest neighbor rules need distance metrics Required properties of a metric 1. non-negativity: D(a, b) ≥ 0 2. reflexivity: D (a, b) = 0 iff a = b 3. symmetry: D(a, b) = D(b, a ) 4. trangular inequality: D (a, b) + D (b, c ) ≥ D(c, a ) D (a, b) ≥ D (c, a ) − D (b, c) Minkowski Metric Euclidean Manhattan L∞ h ? x useful in indexing d Lk (a, b) = (∑ | ai − bi |k )1/ k i =1 n + n − 2n12 ( n1 − n12 ) + ( n2 − n12 ) Tanimono Metric Dtanimono ( S1 , S2 ) = 1 2 = n1 + n2 − n12 n1 + n2 − n12 sets of elements n2 n1 Point-point distance not useful EE6887-Chang 11-4 2 Discriminant Functions Revisited define discriminant function gi ( x) for class ωi map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i e.g., gi ( x) = ln P ( x | ωi ) + ln P(ωi ) MAP classifier Gaussian Case: P ( x | ωi ) = N ( μi , Σ i ) 1 −1 t exp( ( x − μ i ) Σi −1 ( x − μ i )) P(x | wi ) = d /2 2 Σi ( 2π ) Σi = Σ Case I: g i ( x) = wi t x + wi 0 Case II: a hyperplane with bias wi 0 Σ = arbitrary i 1 d 1 g i ( x ) = − ( x − μ i ) t Σ i −1 ( x − μ i ) − ln 2π − ln Σ i + ln P (ωi ) 2 2 2 g i ( x ) = x tWi x + wit x + wi 0 EE6887-Chang Decision boundaries may be Hyperplane, hypersphere, hyperellipsoid, hyperparaboloids, hyperhyperboloids 11-5 Discriminant Functions (Chap. 5) Directly define discriminant functions and optimize them Do not assume parameter distribution functions for P ( x | ωi ) Easy to derive useful classifiers Linear Functions g ( x) = w t x + w0 ,w: weight vector, w0 : bias Two-Category Case map x to class ω1 if g (x)>0, otherwise class ω2 Decision surface H : g (x) = 0 x = xp + r⋅ w w x p : projection of x onto H , g (x p ) = 0 r : distance from x to H ⎛ w ⎞⎟ w ⎟⎟ = rw t =r w g (x) = g ⎜⎜⎜x p + r ⋅ ⎜⎝ w ⎠⎟ w EE6887-Chang ⇒ r= g ( x) w 11-6 3 Multi-category Case c categories: ω1 , ω2 , …, ωc number of classifiers needed? Approaches ⇒ x belongs to class ωi or not? Use two-class discriminant Use two-class discriminant for each pair of classes ⇒ need c(c −1) / 2 discriminant functions General Approach one function for each class gi (x) = w i t x + wi 0 map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i decision boundary H ij : gi (x) = g j (x) H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 Each decision regions is convex and singly connected. Good for monomodal distributions EE6887-Chang 11-7 Method for searching decision boundaries g i (x) = w i t x + wi 0 ⇒ find weight ω and bias ωo Augmented Vector Decision Boundary ⎡1⎤ ⎢ ⎥ ⎡ 1⎤ ⎢ x1 ⎥ y = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢⎣ x⎥⎦ ⎢ ⎥ ⎢x ⎥ ⎣ d⎦ H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0 ⎡ ωi 0 ⎤ ⎢ ⎥ ⎡ ωi 0 ⎤ ⎢ ωi1 ⎥ ai = ⎢ ⎥ = ⎢⎢ ⎥⎥ ⎢ ωi ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ω ⎥ ⎣ id ⎦ ⇒ g i ( y ) = ai y ⇒ H ij : (ai − a j )t y = 0 2-category case H : (w )t x + wi 0 = 0 ⇓ H : (a)t y = 0 A hyperplane in augmented y space, with normal vector a EE6887-Chang 11-8 4 Search Method for Linear Discriminant all sample points reside in the y1 = 1 subspace distance from x to boundary in x space: distance from x to boundary in y space: r= [1, x]t g ( x) w i.e., r ′ and r same signs, bounds of r ′ r ′ = at y / a ≤ r ⇒ bounds for r Design Objective Find a that correctly classify each sample data ∀y i in class ω1 , a t y i >0 ∀y i in class ω2 , a t y i <0 ∀y i in class ω2 , y i ← −( y i ) Normalization New Design Objective ∀y i in class ω1 or ω2 , a t y i >0 Solution region Intersection of positive sides of all hyperplanes EE6887-Chang 11-9 Searching Linear Discriminant Solutions Solution region with margin ∀y i in class ω1 or ω2 , a t y i >b Search Approaches Gradient decent methods to find a solution in the solution region Maximize margin EE6887-Chang 11-10 5 Gradient Decent (GD) Choose criterion function J(a) J(a) is minimized when a is in the solution region Examples # of samples misclassified J (a) = count of Y Sum of distances from misclassified samples to H Æ perceptron distance J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples y∈Y Quadratic error J (a) = (a t y )2 ∑ q Quadratic error with margin y ∈Y (a y − b) 1 , where Y :{y | a t y < b} J q (a) = ∑ 2 2 y∈Y y t 2 Repeat a(k + 1) = a(k ) − η (k )∇J(a(k )) Results of GD make the mis-classification set Y empty or other stop criteria met EE6887-Chang η (k ) : learning rate 11-11 Different Criterion Functions GD not applicable Solutions may be trapped to boundaries EE6887-Chang Not differentiable Solutions moved away from boundaries 11-12 6 GD based on perceptron criterion J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples y ∈Y ∇J p (a) = ∑ (−y ) y∈Y Batch Perceptron Update initialize a(1), choose rate η (.), and stop criterion θ Loop a(k + 1) = a(k ) + η (k )∑ y y∈Y until η (k )∑ y < θ y∈Y Example a(1) = 0, η (k)=1 Add sum of misclassified samples Theorem: If samples are separable, then a solution can always be found within finite steps. EE6887-Chang 11-13 7

EE 6885 Statistical Pattern Recognition Reading

Related documents

Products

Support

EE 6885 Statistical Pattern Recognition Reading

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib