Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien Contents: Introduction Theory Aspects of Application Simulation Study Summary Introduction 25 observations: 0 1 2 3 Which distribution? Introduction Introduction Theory Application Aspects Simulation Study Summary 4 0.0 0.1 0.2 0.3 0.4 ? 0 1 0.2 0.4 2 2 3 ? 0 1 3 2 4 0.8 1 0.6 0.6 0.0 0.1 0.2 0.3 0.4 ? 0.2 1 4 3 4 0.4 0.2 0 0.0 0.0 0.0 0 2 3 0 0 4 ? 1 2 1 3 2 4 ? 3 4 0 1 2 3 Kernel density estimator model: Introduction Theory Application Aspects Simulation Study Summary K(.) and h to choose 4 0 2 3 4 „large“ h 0.4 0.6 „small“ h 0.2 0.1 0.0 0.0 triangular 0.2 0.3 0.4 kernel/ bandwidth: 1 1 2 3 4 0 1 2 3 4 0.2 0.3 0.4 0.1 0.2 0.0 0.0 gaussian 0.4 0.6 0 0 1 2 3 4 0 1 2 3 4 Question 1: Which choice of K(.) and h is the best for a descriptive purpose? Introduction Theory Application Aspects Simulation Study Summary Classification: 0,14 0,12 0,1 0,08 0,06 -3 0,04 -1,5 0,02 0 2,6 1,2 0,5 -0,2 -0,9 -1,6 -2,3 -3 3 1,9 0 1,5 0,09 0,08 0,07 Introduction Introduction 0,06 0,05 0,04 0,03 0,02 0,01 0,00 Theory Application Aspects 2,6 1 1,8 -3 ,2 -1,3 ,6 Summary 4 Simulation Study 2,1 0,4 Levelplot – LDA (based on assumption of a multivariate normal distribution): Classification: 5 7 5 3 3 3 3 33 33 3 3 3 3 33 33 3 3 3 3 5 3 V2 5 51 Introduction Introduction -1 1 Theory Application Aspects -5 Simulation Study 1 5 5 5 5 55 5 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 2 1 44 4 4 22 2 4 4 5 2 4 4 2 2 1 22 2 2 2 4 2 4 2 2 2 2 2 2 5 5 5 5 5 5 4 4 44 4 4 4 4 4 4 Summary -1 1 V1 3 5 7 Classification: 5 7 5 3 3 3 3 33 33 3 3 3 3 33 33 3 3 3 3 5 3 V2 5 51 Introduction Introduction -1 1 Theory Application Aspects -5 Simulation Study 1 5 5 5 5 55 5 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 2 1 44 4 4 22 2 4 4 5 2 4 4 2 2 1 22 2 2 2 4 2 4 2 2 2 2 2 2 5 5 5 5 5 5 4 4 44 4 4 4 4 4 4 Summary -1 1 V1 3 5 7 Classification: Levelplot – KDE classificator: 5 5 7 5 5 V2 V2 3 Introduction Introduction -1 Theory Application Aspects -5 3 3 3 33 33333 33 33 3 3 33 3333 33 33 3 3 33 33 3 3 33 33 5 3 5 5 1 515 1 151 1 1 1 1 111 1 1 1 21 1 2 21 4 2 2 4 1 2 25 22 2 1 2 21 2225 2 2 2 1 22 2 2 2 2 2 2 2 2 22 22 2 2 5 5 1 5 55 5 55 55 5 5 5 5 5 5 55 5 11 11 1 15 5 5 1 5 5 11 11 1 4 4 4 4 1 41 4 4 4 44 4 4 4 4 44 44 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 42 4 5 5 5 5 4 4 4 4 2 2 Simulation Study Summary -1 1 V1 V1 3 5 7 Question 2: Introduction Introduction Theory Application Aspects Simulation Study Summary Performance of classification based on KDE in more than 2 dimensions? Theory Essential issues Introduction Theory Application Aspects Simulation Study Summary Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Essential issues Introduction Theory Application Aspects Simulation Study Summary Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Optimization criteria Lp-distances: Introduction Theory Application Aspects Simulation Study Summary 0.4 f(.) 0.0 0.2 g(.) -2 Introduction Theory Application Aspects Simulation Study Summary -1 0 1 2 3 4 0.4 0.2 0.0 -1 0 1 2 3 4 0.15 -2 0.10 Introduction 0.05 Theory Application Aspects Summary 0.0 Simulation Study -2 -1 0 1 2 3 4 =IAE „Integrated absolute error“ =ISE 0.15 „Integrated squared error“ 0.10 Introduction 0.05 Theory Application Aspects Summary 0.0 Simulation Study -2 -1 0 1 2 3 4 =IAE „Integrated absolute error“ =ISE 0.15 „Integrated squared error“ 0.10 Introduction 0.05 Theory Application Aspects Summary 0.0 Simulation Study -2 -1 0 1 2 3 4 Other ideas: Introduction Minimization of the maximum vertical distance Consideration of horizontal distances for a more intuitive fit (Marron and Tsybakov, 1995) Compare the number and position of modes Theory Application Aspects Simulation Study Summary Overview about some minimization criteria L1-distance=IAE L-distance=Maximum difference „Modern“ criteria, which include a kind of measure of the horizontal distances L2-distance=ISE, Most commonly used MISE,AMISE,... Introduction Theory Application Aspects Simulation Study Summary Difficult mathematical tractability Does not consider overall fit Difficult mathematical tractability ISE, MISE, AMISE,... 0.0 Introduction Density 0.2 0.4 0.05 ISE is a random variable MISE=E(ISE), the expectation of ISE AMISE=Taylor approximation of MISE, easier to calculate 0.04 -3 1 2 3 0.03 x Simulation Study Summary MISE,IV,ISB AMISE,AIV,AISB 0.0 Application Aspects 0.01 0.02 Theory -1 -1.2 -1.0 -0.8 -0.6 -0.4 log10(h) -0.2 0.0 0.2 Essential issues Introduction Theory Application Aspects Simulation Study Summary Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h The AMISE-optimal bandwidth Introduction Theory Application Aspects Simulation Study Summary The AMISE-optimal bandwidth dependent on the kernel function K(.) Introduction minimized by 0.6 Theory 0.4 0.2 Application Aspects 0.0 Simulation Study Summary „Epanechnikov kernel“ -1.0 -0.5 0.0 0.5 1.0 The AMISE-optimal bandwidth dependent on the unknown density f(.) Introduction Theory Application Aspects Simulation Study Summary How to proceed? Data-driven bandwidth selection methods Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula Introduction Theory Application Aspects Simulation Study Summary Maximum Likelihood CrossValidation Least-squares cross-validation (Bowman, 1984) „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Data-driven bandwidth selection methods Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula Introduction Theory Application Aspects Simulation Study Summary Maximum Likelihood CrossValidation Least-squares cross-validation (Bowman, 1984) „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Least squares cross-validation (LSCV) Introduction Theory Application Aspects Simulation Study Summary Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local minimizer – no agreement about which one to use Bad convergence rate for the resulting bandwidth hopt Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood CrossValidation Least-squares cross-validation (Bowman, 1984) Criteria based on substituting R(f“) in the AMISE-formula Introduction Theory Application Aspects Simulation Study Summary „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Normal rule („Rule of thumb“) Introduction Theory Application Aspects Simulation Study Summary Assumes f(x) to be N(,2) Easiest selector Often oversmooths the function The resulting bandwidth is given by: Data-driven bandwidth selection methods Leave-one-out selectors Maximum Likelihood CrossValidation Least-squares cross-validation (Bowman, 1984) Criteria based on substituting R(f“) in the AMISE-formula Introduction Theory Application Aspects Simulation Study Summary „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990) Introduction Theory Application Aspects Simulation Study Summary Does not substitute R(f“) in the AMISEformula, but estimates it via R(f(IV)) and R(f(IV)) via R(f(VI)),etc. Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient Better rates of convergence Does not finally circumvent the problem of the unknown density, either The multivariate case h H...the bandwidth matrix Introduction Theory Application Aspects Simulation Study Summary Issues of generalization in d dimensions Introduction Theory Application Aspects Simulation Study Summary d2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially straightforward to generalize For Plug-in methods it is „too difficult“ to give succint expressions for d>2 dimensions Aspects of Application Essential issues Introduction Theory Application Aspects Simulation Study Summary Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes Essential issues Introduction Theory Application Aspects Simulation Study Summary Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes The „curse of dimensionality“ The data „disappears“ into the distribution tails in high dimensions Probability mass NOT in the "Tail" of a Multivariate Normal Density 100% Introduction 80% 60% Theory 40% 20% Application Aspects 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # of dimensions Simulation Study Summary d : a good fit in the tails is desired! The „curse of dimensionality“ Much data is necessary to obey a constant estimation error in high dimensions Introduction Theory Application Aspects Simulation Study Summary Dimensionality Required sample size 1 2 3 4 5 6 7 8 9 10 4 19 67 223 768 2790 10700 43700 187000 842000 Essential issues Introduction Theory Application Aspects Simulation Study Summary Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes Essential issues AMISE-optimal parameter choice L2-optimal Optimal classification (in high dimensions) L1-optimal (Misclassification rate) Worse fit in the tails Estimation of tails important Calculation intensive for large n Many observations required for a reasonable fit Essential issues Introduction Theory Application Aspects Simulation Study Summary Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes Method 1: Introduction Theory Application Aspects Simulation Study Summary Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information „trade-off“ Use the multivariate kernel density concept to estimate the class densities Method 2: Introduction Theory Application Aspects Simulation Study Summary Use the univariate concept to „normalize“ the data nonparametrically Use the classical methods like LDA and QDA for classification Drawback: calculation intensive Method 2: f ( x) 0. 05 0. 10 0. 15 0. 20 a) 0 2 4 6 8 6 8 x Introduction 1. 0 F(x) G (x) Simulation Study 0. 4 0. 0 0. 2 Application Aspects 0. 6 0. 8 Theory b) 0 Summary 2 4 x t(x) t(x+) x x+ Simulation Study Criticism on former simulation studies Introduction Theory Application Aspects Simulation Study Summary Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of high dimensions No dimension reduction The present simulation study Introduction Theory Application Aspects Simulation Study Summary 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results The present simulation study Introduction Theory Application Aspects Simulation Study Summary 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results Each dataset has... Introduction Theory Application Aspects Simulation Study Summary ...2 classes for distinction ...600 observations/class ...200 test observations, 100 produced by each class ... therfore dimension 1400x10 Normal- noise small 0. 0 0. 0 0. 2 0. 2 0. 4 0. 4 Normal -4 0 2 4 -4 -2 0 2 4 2 4 Normal- noise large 0. 0 0. 0 0. 15 0. 2 0. 30 0. 4 Normal- noise medium -4 -2 0 2 4 -4 -2 0 Bim odal - close 0. 0 0. 0 0. 4 0. 10 0. 8 0. 20 Exponent ial (1) 0 1 2 3 4 5 6 0. 10 0. 20 Bim odal - f ar 0. 0 Univariate prototype distributions: -2 -2 0 2 4 6 8 -2 0 2 4 6 8 Dataset Nr. Abbrev. 1 2 3 4 5 6 7 8 9 10 NN1 NN2 NN3 SkN1 SkN2 SkN3 Bi1 Bi2 Bi3 Bi4 contains 10 normal distributions with "small noise" 10 normal distributions with "medium noise" 10 normal distributions with "small noise" 2 skewed (exp-)distributions and 7 normals 5 skewed (exp-)distributions and 5 normals 7 skewed (exp-)distributions and 3 normals 4 normals, 4 skewed and 2 bimodal (close)-dist. 4 normals, 4 skewed and 2 bimodal (close)-dist. 8 skewed and 2 bimodal (far)-dist. 8 skewed and 2 bimodal (far)-dist. 10 datasets having equal covariance matrices +10 datasets having unequal covariance matrices + 1 insurance dataset 21 datasets total Simulation Study Introduction Theory Application Aspects Simulation Study Summary 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results Method 1(multivariate density estimator): Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion ,resp. (2) 8 estimators Method 2(„marginal normalizations“): Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators Classical methods: LDA and QDA (2) 2 estimators 14 estimators Simulation Study Introduction Theory Application Aspects Simulation Study Summary 21 datasets 14 estimators 2 misclassification criteria 21x14x2=588 classification scores Many results Misclassification Criteria Introduction Theory Application Aspects Simulation Study Summary The classical Misclassification rate („Error rate“) The Brier score Simulation Study Introduction Theory Application Aspects Simulation Study Summary 21 datasets 14 estimators 2 error criteria 21x14x2= 588 classification scores Many results Results The choice of the misclassification criterion is not essential Error rate vs. Brier score 0,80 Introduction Theory Application Aspects Simulation Study Summary Brier score 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 0,0 0,1 0,2 0,3 Error rate 0,4 0,5 0,6 Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases Error rates for method 1 0,500 0,450 0,400 0,350 Introduction Theory LSCV 0,300 0,250 0,200 0,150 Application Aspects Superiority of LSCV in case of bimodals having unequal covariance matrices 0,100 0,050 Simulation Study Summary 0,000 0,000 0,100 0,200 0,300 "Normal rule" 0,400 0,500 0,600 Results The choice of the univariate bandwidth parameter (method 2) is not essential Error rates for method 2 Introduction Theory Application Aspects Simulation Study Sheather-Jones selector 0,300 0,250 0,200 0,150 0,100 0,050 0,000 0,000 0,050 0,100 0,150 "Normal rule" Summary 0,200 0,250 0,300 Results The best trade-off is a projection onto 2-3 dimensions Error rate regarding different subspaces 0,350 0,300 Introduction Theory Application Aspects Simulation Study Summary 0,250 NN-distributions 0,200 0,150 0,100 SkN-distributions Bi-distributions 0,050 0,000 2 3 4 # dimensions 5 Results 0,400 0,300 0,350 0,250 0,300 0,200 0,250 0,150 0,200 0,150 0,100 0,100 0,050 0,050 0,000 0,000 NN 11 NNNN11 21 NNNN21 31 Sk NN31 N1 1 SkSkN11 N2 1 SkSkN21 N3 SkN31 1 Bi 1Bi11 1 Bi 2Bi21 1 Bi Bi31 31 Bi Bi41 41 Error Errorrate rate Equal Method 21 Equal covariance covariance matrices: Method performs inferior against LDA sometimes slightly improves LDA (classical) LDA (classical) Normal rule (in -method 2) LSCV(3) method1 Results Unqual covariance covariance matrices: Unequal matrices: Method2 2often performs quiteessentially poor, but Method improves not for skewed distributions NN 1N2 N N 12 2N22 NN 2 3N SkS 232 Nk1N SkS 212 Nk2N SkS 222 Nk3N 232 BiBi 1212 BiBi 2222 BiBi3 32 2 BiBi4 42 2 Error rate Error rate 0,250 0,250 0,200 0,200 0,150 0,150 0,100 0,100 0,050 0,050 0,000 0,000 QDA (classical) QDA (classical) LSCV(3) method1 Normal- rule (in method 2) Results Is the additional calculation time justified? Required calculation time Introduction Theory Application Aspects LDA,QDA Simulation Study Summary multivariate "normal rule" Preliminary univariate normalizations,LSCV, Sheather-Jones plug-in Summary Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L1-theory, L2-theory: ISEMISEAMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating logdensities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time The End