Functional Bregman Divergence, Bayesian Estimation of Distributions and Completely Lazy Classifiers Maya R. Gupta Dept. of Electrical Engineering University of Washington Sergey Feldman Dept. of Electrical Engineering University of Washington Santosh Srivastava Fred Hutch Cancer Research Center Bela A. Frigyik Eric Garcia Dept. of Electrical Engineering University of Washington Dept. of Mathematics Purdue University 1 this talk: Bregman divergence Functional Bregman divergence Bayesian estimation of distributions Uniform Gaussian Bayesian QDA Local BDA Gaussian mixture completely lazy learning 2 The mean minimizes average squared error x1 Let x1 , x2 , . . . , xN ∈ Rn . 1 X 2 A = arg minn (kxj − Ak2 ) A∈R N j ∗ Then, ∗ A = 1 N P j x2 x3 A∗ x4 xj 3 The mean minimizes average squared error x1 Let x1 , x2 , . . . , xN ∈ Rn . 1 X 2 A = arg minn (kxj − Ak2 ) A∈R N j ∗ Then, ∗ A = 1 N P j xj x2 x3 A∗ x4 Are there other distortion functions that yield the sample mean? 4 The mean minimizes average Bregman divergence (Banerjee et al. JMLR ’05, IEEE Trans. on Info Theory ’05) x1 Let x1 , x2 , . . . , xN ∈ Rn . Let d(x, y) be any Bregman divergence. 1 X A = arg minn d(xj , A) A∈R N j ∗ Then, ∗ A = 1 N P j x2 x3 A∗ x4 xj 5 Bregman divergence between vectors Class of distortion functions, including: sum of squared errors relative entropy Itakura-Saito distance etc. General formula: dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y), x, y ∈ Rn φ is convex function. Total squared error: X φ(x) = x[i]2 i Relative entropy: X φ(x) = x[i] log x[i] i 6 Class of distortion functions, including: sum of squared errors relative entropy Itakura-Saito distance etc. General formula: dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y), x, y ∈ Rn φ is convex function. dφ (x, y) is tail of Taylor series expansion of φ around y: φ(x) = φ(y) + ∇φ(y)T (x − y) + dφ (x, y) 7 Relationship to Bayesian Estimation Let d(x, y) be any Bregman divergence. Goal: Estimate a parameter θ̂ ∈ R, Given candidates θ ∈ R and posterior p(θ). Consider the ZBayesian estimate with d as the risk function: θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)] θ̂∈R θ θ̂∈R 8 Relationship to Bayesian Estimation Let d(x, y) be any Bregman divergence. Goal: Estimate a parameter θ̂ ∈ R, Given candidates θ ∈ R and posterior p(θ). Consider the ZBayesian estimate with d as the risk function: θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)] θ̂∈R θ̂∈R θ 0.4 Say you flip 1 tail 9 heads Let θ =P(tails) 0.35 likelihood of θ given 1 tail and 9 heads 0.3 p(θ) 0.25 0.2 0.15 0.1 0.05 0 0 0 0.1 0.2 0.3 0.4 0.5 θ 0.6 0.7 0.8 0.9 9 1 1 Relationship to Bayesian Estimation Let d(x, y) be any Bregman divergence. Goal: Estimate a parameter θ̂ ∈ R, Given candidates θ ∈ R and posterior p(θ). Consider the ZBayesian estimate with d as the risk function: θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)] θ̂∈R θ θ̂∈R Then, Banerjee et al. theorem says minimizer is the mean: θ∗ = EΘ [Θ] Say you flip ∗ θ = .1666 1 tail θ̂MLE = .1 9 heads Let θ =P(tails) 10 Estimation of Distributions Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. 11 Estimation of Distributions Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. fˆ(x)ML = U [0, 8] 12 Estimation of Distributions Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. Bayesian parameter estimate: arg min EA [(A − â)2 ] â∈R+ 13 Estimation of Distributions Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. Bayesian parameter estimate: arg min EA [(A − â)2 ] â∈R+ Assume for A: ´ ¸ ∙ gamma³ prior 2 2 ⇒ U 0, P χγ < γ Xmax 1 2 ´ ³ 1 γ0 P χ2 < 2 γ3 γ Xmax 4 14 Bayesian Estimation of Distributions (Frigyik,Gupta, Srivastava ’06) Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. f ∗ (x) = arg min EF [d(F, fˆ)] ˆ f(x) ≡ arg min ˆ f(x) Z d(f, fˆ)p(f )dU f ∈U 15 Bayesian Estimation of Distributions (Gupta and Srivastava ’06) Goal: Estimate a distribution fˆ(x) Given candidates f : R → R, and posterior p(f ). Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. f ∗ (x) = arg min EF [d(F, fˆ)] ˆ f(x) ? ≡ EF [F ] if d is a Bregman divergence? What is a Bregman divergence between functions? 16 Bregman Divergence Definitions Bregman Divergence: for vectors x, y ∈ Rn , convex function φ, dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y), Pointwise Bregman Divergence: for functions f (t), g(t) R (Jones and Byrne 1990, Csiszar 1995 ) dφ (f, g) = t dφ (f (t), g(t))dν(t), ? arg min EF [d(F, fˆ)] = EF [F ] ˆ f(x) 17 Bregman Divergence Definitions Bregman Divergence: for vectors x, y ∈ Rn , convex function φ, dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y), Pointwise Bregman Divergence: for functions f (t), g(t) R (Jones and Byrne 1990, Csiszar 1995 ) dφ (f, g) = t dφ (f (t), g(t))dν(t), Functional Bregman Divergence: (Srivastava, Gupta, Frigyik, JMLR 06 ) f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g 18 Functional Bregman Divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g Frechet derivative: φ[g + a] − φ[g] = δφ[g; a] + ²[g, a]kakLp (ν) For all a ∈ Lp (ν), with ²[g, a] → 0 as kakLp (ν) → 0. 19 Functional Bregman Divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g Ex: total squared error φ[g] = R g 2 dν Compare with vector Bregman divergence P φ(x) = i x[i]2 20 Functional Bregman Divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) f, g : Rn → R, f, g, ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g R Ex: total squared error φ[g] = g 2 dν R ⇒ δφ[g; f − g] = 2g(f − g)dν R 2 R 2 R dφ (f, g) = f dν − g dν − 2g(f − g)dν R 2 = (f − g)2 dν = kf − gkL2 (ν) 21 Functional Bregman Divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) f, g : Rn → R, f, g, ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g Functional Bregman divergences include pointwise Bregman divergences and more! Ex: squared bias µZ ¶2 dφ (f, g) = (f − g)dν φ[g] = µZ gdν ¶2 22 Functional Bregman divergence has same properties as Bregman divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) Non-negativity Convexity with respect to first function Linearity with respect to φ-functionals Equivalence classes with respect to φ-functionals Dual divergences by Legendre transformation Generalized Pythagorean Inequality 23 Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta 2006 ) Theorem: for random function F defined on a finite-dimensional manifold with posterior pF , f ∗ = arg min EF [dφ [F, fˆ]] fˆ ≡ EF [F ] 24 Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta 2006 ) Theorem: for random function F defined on a finite-dimensional manifold with posterior pF , f ∗ = arg min EF [dφ [F, fˆ]] fˆ ≡ EF [F ] e.g. parametric distribution or decomposable in terms of finite basis functions 25 Uniform Example (arXiv: Frigyik, Srivastava, Gupta) Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. Let F be a random uniform distribution: U [0, a] Let pF be the likelihood of F given N data samples. f ∗ = arg min EF [dφ [F, fˆ]] fˆ ∗ ≡ EF [F ] f (x) Pf R∞ ¡1¢ ¡ f (x) = a=max(x,Xmax ) R∞ a=Xmax a 1 aN dU ° ° ¢ ° df ° 1 aN ° da °2 da ° ° ° df ° ° da ° da 2 (actually, we use the Fisher information metric for dU ) 26 Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta) Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. Let F be a random uniform distribution: U [0, a] Let pF be the likelihood of F given N data samples. f ∗ = arg min EF [dφ [F, fˆ]] fˆ ≡ EF [F ] ∗ f (x) = R∞ ¡1¢ ¡ max(x,Xmax ) a R∞ 1 da Xmax aN a 1 aN ¢ da a 27 Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta) f ∗ (x)Bayesian = arg min EF [dφ [F, fˆ]] ˆ f(x) ≡ EF [F ] N (Xmax )N = (N + 1)[max(x, Xmax )]N+1 28 Compare estimates Let F be a random uniform distribution: U [0, a] Let pF be the likelihood of F given the data samples. N (Xmax )N = (N + 1)[max(x, Xmax )]N+1 arg min EF [dφ [F, fˆ]] ˆ f(x) ³ arg min EF [ kF − fˆk2 fˆ(x)∈U ´2 ] = U [0, Xmax 21/N ] Bayesian parameter estimate of a, gamma prior p(a): arg min EA [(A − â)2 ] â∈R+ ³ ´¸ ∙ P χ2γ < γ X2max ⇒ U 0, γ10 ³ 21 2 2 ´ P χγ3 < γ 4 Xmax 29 Compare estimates Simulation: Draw n random samples from U [0, 1] Metric: Squared error between estimated dist. and U [0, 1]. 1 SQUARED ESTIMATION ERROR 10 maximum likelihood estimate ´2 ³ arg min EF [ kF − fˆk2 ] 0 10 fˆ(x)∈U arg min EA [(A − â)2 ] −1 â∈R+ 10 −2 10 arg min EF [dφ [F, fˆ]] ˆ f(x) −3 10 0 10 1 2 10 10 NUMBER OF DATA SAMPLES 30 this talk: Bregman divergence Functional Bregman divergence Bayesian estimation of distributions Uniform Gaussian Bayesian QDA Local BDA Gaussian mixture completely lazy learning 31 Classification set-up Training Data T = {Xi , Yi } Feature vectors Xi ∈ Rd for i = 1, . . . n. Associated labels Yi ∈ G, where G is a finite set of classes. Test vector X, estimates its associated label Ŷ . 32 QDA: classifying with Gaussian models 1 Feature 2 2 X[2] 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 ? 2 2 2 2 33 Feature 1 X[1] QDA: classifying with Gaussian models 1 Feature 2 2 X[2] 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 ? 2 2 2 2 34 Feature 1 X[1] QDA: classifying with Gaussian models 1 Feature 2 2 X[2] 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 ? 2 2 2 2 35 Feature 1 X[1] Bayesian: Minimizing Expected Misclassification Costs G X Y = arg min C(g, h)p(x|Y = h)P (Y = h) g h=1 Ŷ = arg min E g ≡ arg min g " G X G X h=1 C(g, h)Nh (x)Θh # C(g, h)E[Nh (x)]EΘ [Θh ] h=1 Eμh ,Σh [Nh (x)] ENh [Nh (x)] (Geisser 1964) (Srivastava, Gupta 2006) 36 Distribution-based Bayesian Minimum Expected Misclassification Cost: (Srivastava and Gupta, IEEE ISIT (2006) ) For a test point x and class h, R ENh [Nh (x)] = M N (x)f (N |Th )dM Look at all Prob of possible test point Gaussians given some Gaussian Prob. of that Gaussian given training data and prior Measure over space of Gaussians dM = dμdΣ d+2 |Σ| 2 differential element based on Fisher information matrix 37 (C. R. Rao ’45) : Distribution-based Bayesian Minimum Expected Misclassification Cost: (Srivastava and Gupta, IEEE ISIT (2006) ) For a test point x and class h, R ENh [Nh (x)] = M N (x)f (N |Th )dM Prob. of a Gaussian given training data f (N |Th ) = Πkj=1 N (Xj )p(N ) Likelihood of the iid training samples Prior prob of that Gaussian 38 Prior matters with minimum expected risk Design goals for the prior (over the Gaussian distributions): p(N ) 1) Regularize for ill-posed likelihood to reduce estimation variance (not a flat prior). 2) Add sensible bias. 3) Allow the estimation to converge as number of training samples becomes infinite. 4) Lead to closed form solution. 39 Proposed Prior p(Nh ) = γ0 Σh,max = Bh )) exp(− 12 trace(Σ−1 h q |Σh | 2 ( inverted Wishart) Bh q We set: Bh = q diag(Σ̂h,M L ) Prior p(N ) Σ diag(Σ̂ML ) 40 Proposed Prior p(Nh ) = γ0 Bh )) exp(− 12 trace(Σ−1 h q |Σh | 2 ( inverted Wishart) Prior p(N ) q = 5d q=d Σ diag(Σ̂ML ) 41 Distribution-based classifier and closed form solution Choose the class Ŷ = g ∈ G that minimizes G X C(g, h)ENh [Nh (x)]EΘ [Θh ] h=1 Closed-form solution: ENh [Nh (x)] = n +q+1 nh +q+1 nh T −1 − h2 Γ( 2 )(1 + nh +1 Zh Dh Zh ) d 1 nh +q−d+1 nh +1 2 π Γ( )|( nh )Dh | 2 2 Bh = .95q diag(Σ̂M L,h ) + .05I Dh = Sh + Bh , and Zh = x − x̄h 42 Distribution-based Bayesian discriminant (Srivastava, Gupta, 2006) ENh [Nh ] = n +q+1 nh +q+1 nh − h −1 T 2 Γ( )(1+ n +1 Zh Dh Zh ) 2 h d 1 nh +q−d+1 nh +1 2 π Γ( )|( n )Dh | 2 2 h Parameter-based Bayesian discriminant (Geisser, 1964) Eμ,Σ [Nh ] = n +q−d−1 nh +q−d−1 nh − h −1 T 2 Γ( )(1+ n +1 Zh Dh Zh ) 2 h d 1 nh +1 n +q−2d−1 2 π |( n )Dh | 2 Γ( h 2 ) h Difference: For parameter-based you need nh > 2d – q +1 samples for each class. If you have few samples, forced to use high q = more bias. q = 5d q=d 43 Σ Bregman divergence Functional Bregman divergence Bayesian estimation of distributions Uniform Gaussian Bayesian QDA (reduce model bias) Local BDA Gaussian mixture completely lazy learning 44 Local BDA 1. Find the k samples from each class nearest to the test sample. 2. Fit a Gaussian to the nearest k samples of each class. 3. Classify as the class that minimizes expected misclassification costs. Related Work: Local Nearest Means (Mitani and Hamamoto, 2000) SVM‐KNN (Malik et al. 2006) 45 Local BDA – 7 Neighbors Feature 2 X[2] ? 1 46 Feature 1 X[1] Local BDA – 7 Neighbors Feature 2 X[2] ? 1 47 Feature 1 X[1] Local BDA‐ 7 neighbors Feature 2 X[2] ? 1 48 Feature 1 X[1] How do we choose the neighborhood size? Standard: cross‐validate on training. Not truly lazy. ‐ Theoretically‐sound only if training and test are iid. ‐ If training set evolving, must re‐train 49 How do we choose the neighborhood size? Standard: cross‐validate on training. Not very lazy. Proposed: average over multiple neighborhood sizes = completely lazy. Choose the class Ŷ that solves arg min g=1,...,G G X h=1 C(g, h)ENh ,K [Nh,K (x)]EΘ [Θh ] average discriminant with respect to uncertainty in the Gaussian‐fit to the training samples and to the neighborhood size 50 Representative Misclassification Rates on UCI datasets Optical Char. Rec. cv k E_K[] Isolet cv_k E_k[] Local Nearest Means 3.3 3.2 4.3 3.9 Local BDA 2.6 1.7 3.1 3.1 kNN 3.5 3.5 8.7 6.9 DANN 4.0 4.3 8.6 8.4 SVM‐kNN 2.6 2.7 3.7 3.1 SVM 2.9 4.3 GMM 10.9 68.4 GMM BQDA 5.5 68.4 51 Summary 1. Proposed a functional Bregman divergence for functions f, g: dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] 2. Showed Bayesian distribution estimation with functional Bregman yields mean distribution: f ∗ (x) = arg min EF [dφ (F, fˆ)] ≡ EF [F ] ˆ f(x) 3. Demonstrated Bayesian distribution estimation on uniform. 4. Applied Bayesian Gaussian estimates locally for classification. 5. Proposed Bayesian estimate neighborhood for completely lazy classifiers. 52 du e . n ngto pa t a s per i h s a .w e e . idl Bayesian Quadratic Discriminant Analysis, Srivastava, Gupta, Frigyik, Journal of Machine Learning, Oct. 2007. Functional Bregman Divergence and Bayesian Estimation of Distributions, Frigyik, Gupta, Srivastava, in review, available on arXiv. 53 Extra slides 54 Fisher Information Metric (C. R. Rao ’45, Jeffreys ’46) dM = |I(a)|1/2 da I(a) is the Fisher information matrix. For the 1-d manifold M formed by the set U , "µ ¶2 # d log 1/a I(a) = EX da = Z a x=0 1 1 1 dx = 2 2 a a a g(x)p(x) 1 → dM = a 55 Fisher Information Metric (C. R. Rao ’45, Jeffreys ’46) At each point on the statistical manifold U define a tangent, which specifies a tangent space. If an inner product is defined on each tangent space, the collection of inner products is a Riemannian metric: < ·, · >= {< ·, · >f |f ∈ U} Together with < ·, · >, U is a Riemannian manifold. Riemannian metric → a natural volume element = measure. 56 The mean minimizes average squared error x1 Let x1 , x2 , . . . , xN ∈ Rn . 1 X 2 A = arg minn (kxj − Ak2 ) A∈R N j ∗ Then, ∗ A = 1 N P j x2 A∗ ∗ C x3 x 4 xj Not true of l2 error, X 1 C ∗ = arg minn kxj − Ck2 C∈R N j ∗ ∗ C 6= A C ∗ minimizes length of string needed to connect to points. 57 Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta) Ex: Given samples {2, 3, 7, 8}, estimate the generating uniform distribution U [0, a]. Let F be a random uniform distribution: U [0, a] Let pF be the likelihood of F given N data samples. f ∗ = arg min EF [dφ [F, fˆ]] fˆ ≡ EF [F ] f ∗ (x) = R∞ ¡1¢ ¡ 1 aN max(x,Xmax ) a R∞ 1 da Xmax aN a3/2 ¢ da a3/2 58 BDA discriminant acts like a regularized covariance estimate ENh [Nh ] = n +q+1 nh +q+1 nh − h −1 T 2 Γ( )(1+ n +1 Zh Dh Zh ) 2 h d 1 n +q−d+1 nh +1 π 2 Γ( h 2 )|( n )Dh | 2 h Approximate |ZhT Dh−1 Zh | using 1 + r ≈ er : ENh [Nh ] ≈ Σ̃h = Γ( nh +q+1 nh +1 Dh −1 1 T ) exp[− Z [ Zh ] 2 2 h nh +q+1 nh ] d 1 n +q−d+1 nh +1 π 2 Γ( h 2 )|( n )Dh | 2 h nh +1 Dh nh +q+1 nh ≈ (1 − q Sh ) nh +q+1 nh q + ( nh +q+1 ) Bqh 59 5 Figures show average distortion between each 1 X d(xj , A) point A in the space and the five black points: 5 j=1 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 10 20 30 40 50 60 70 80 90 Bregman divergence with φ(x) = (kxk2 )2 . Squared Error: dφ (xj , A) = (kxj − Ak2 )2 100 10 20 30 40 50 60 70 80 90 Bregman divergence with φ(x) = (kxk2 )4 Results in complicated divergence function dφ .60 100 Functional Bregman Divergence (arXiv: Frigyik, Srivastava, Gupta 2006 ) f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν) φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2 dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g] Frechet derivative of φ at g in the direction of f − g Frechet derivative: φ[g + a] − φ[g] = δφ[g; a] + ²[g, a]kakLp (ν) For all a ∈ Lp (ν), with ²[g, a] → 0 as kakLp (ν) → 0. 61 Bayesian estimate if forced to be uniform MER estimate solves: error if truth is p likelihood of p Let p be uniform from zero to a: our example: The MER estimate is q is uniform U[0,b]: 62 Regularized Discriminant Analysis (RDA) (Friedman 1989) Σ̂h (λ, γ) = (1 − γ)Σ̂h (λ) + γ d trace(Σ̂h (λ))I controls shrinkage towards a multiple of the identity Σ̂h (λ) = γ QDA (1−λ)Sh +λS (1−λ)nh +λn controls degree of shrinkage of the class covariance matrix towards the pooled nearest-means (identity cov) LDA λ 63 Proposed Prior p(Nh ) = p(μh )p(Σh ) = γ0 Bh )) exp(− 12 trace(Σ−1 h q |Σh | 2 ( inverted Wishart) Differentiate log p(Nh ) with respect to Σh to solve for Σh,max : q ∂ 1 ∂ −1 trace(Σh Bh ) + log |Σh | = 0 2 ∂Σh 2 ∂Σh −1 −1 −Σ−1 B Σ + qΣ = 0 h,max h h,max h,max Σh,max = Bh q 64 % Misclassification error results on UCI and Statlog benchmark datasets Glass Heart Disease Ionosphere Iris Letter Recognition Pen Digits Pima Sonar Thyroid Vowel Local BDA B=I 23.44 Local BDA B =Trace 23.78 Local BDA B =Diag 27.67 Local Nearest Means 26.17 k-NN 26.56 21.78 5.53 1.73 25.41 9.74 1.53 20.48 35.29 1.93 32.63 9.50 2.73 33.30 13.35 2.60 3.03 2.17 27.41 12.40 3.19 41.77 3.05 2.17 26.68 12.45 3.33 36.80 3.18 2.26 26.25 10.10 3.76 32.72 4.38 2.12 26.80 15.60 3.52 41.56 4.73 2.66 26.36 15.55 5.09 43.51 65 Apply to Nearest‐Neighbor Learning (Gupta et al. IEEE SSP ’05) Goal: Classify x based on its k nearest-neighbors 2 such that the expected misclassification cost is minimized. Let θh be the (unknown) P (class h | neighbors). ∗ Ideal: g = arg min g X 1 1 X 2 1 1 2 Cost(g, h)θh h 66 Apply to Nearest‐Neighbor Learning (Gupta et al. IEEE SSP ’05) Goal: Classify x based on its k nearest-neighbors 2 such that the expected misclassification cost is minimized. Let θh be the (unknown) P (class h | neighbors). ∗ Ideal: g = arg min ∗ g Standard: g = arg min g X 1 1 X 2 1 1 2 Cost(g, h)θh h X Cost(g, h)θ̂h h 67 Apply to Nearest‐Neighbor Learning (Gupta et al. IEEE SSP ’05) Goal: Classify x based on its k nearest-neighbors 2 such that the expected misclassification cost is minimized. X 2 Ideal: g = arg min ∗ g Standard: g = arg min ∗ g X 1 Cost(g, h)θh h X h Minimize Expected Cost: g = arg min EΘ g 1 2 Let θh be the (unknown) P (class h | neighbors). ∗ 1 1 Cost(g, h)θ̂h " X h Cost(g, h)Θh # 68 Apply to Nearest‐Neighbor Learning (Gupta et al. IEEE SSP ’05) Goal: Classify x based on its k nearest-neighbors 2 such that the expected misclassification cost is minimized. X 2 Ideal: g = arg min ∗ g Standard: g = arg min ∗ g X X h g g 1 Cost(g, h)θh h Minimize Expected Cost: g = arg min EΘ ≡ arg min 1 2 Let θh be the (unknown) P (class h | neighbors). ∗ 1 1 X Cost(g, h)θ̂h " X Cost(g, h)Θh h # Cost(g, h)EΘ [Θh ] h 69 Apply to Nearest‐Neighbor Learning (Gupta, Cazzanti, Srivastava, IEEE SSP ’05) Goal: Classify x based on its k nearest-neighbors 2 such that the expected misclassification cost is minimized. X 2 Ideal: g = arg min ∗ g Standard: g = arg min ∗ g X ≡ Minimizing Expected Bregman Divergence 1 Cost(g, h)θh h X h Minimize Expected Cost: g = arg min EΘ g ≡ arg min 1 2 Let θh be the (unknown) P (class h | neighbors). ∗ 1 1 X Cost(g, h)θ̂h " X Cost(g, h)Θh h # Cost(g, h)EΘ [Θh ] g h µ ¶ X 70 θ̂)] ≡ arg min Cost(g, h) arg min EΘ [d(Θ, g h θ̂ How much better is MER than ML? PMF estimate with 100 training/100 test samples on 3D Kohonen simulation mean-squared error of class pmf 0.22 0.2 Meanshapes: squared−error Maximum as k varied likelihood lines: BMER estimate 0.18 0.16 0.14 0.12 k−NN 0.1 0.08 tricube 0.06 LIME 0.04 0.02 0 10 20 30 40 50 number of neighbors (k) 60 70 71 MER classification results Classification with 1000 training/1000 test and 50,000 validation samples on 4D Kohonen simulation. 45 40 ML cost / MER cost ML Cost/ MER Cost 45 35 30 25 25 20 15 10 k−NN k-NN 5 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 threshold equal costs 0.7 0.8 0.9 1 unequal costs 72