NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 1 Ricardo Gutierrez-Osuna Wright State University Outline g Introduction to Pattern Recognition g Dimensionality reduction g Classification g Validation g Clustering NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 2 Ricardo Gutierrez-Osuna Wright State University SECTION I: Introduction g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 3 Ricardo Gutierrez-Osuna Wright State University Features, patterns and classifiers g Feature n Feature is any distinctive aspect, quality or characteristic g Features may be symbolic (i.e., color) or numeric (i.e., height) n The combination of d features is represented as a d-dimensional column vector called a feature vector g The d-dimensional space defined by the feature vector is called feature space g Objects are represented as points in feature space. This representation is called a scatter plot Feature 2 x1 x 2 x= x d x3 Class 1 Class 3 x x1 x2 Class 2 Feature 1 Feature vector NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Feature space (3D) Statistical Pattern Recognition 4 Scatter plot (2D) Ricardo Gutierrez-Osuna Wright State University Features, patterns and classifiers g Pattern n Pattern is a composite of traits or features characteristic of an individual n In classification, a pattern is a pair of variables {x,ω} where g x is a collection of observations or features (feature vector) g ω is the concept behind the observation (label) g What makes a “good” feature vector? n The quality of a feature vector is related to its ability to discriminate examples from different classes g Examples from the same class should have similar feature values g Examples from different classes have different feature values “Good” features NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 “Bad” features Statistical Pattern Recognition 5 Ricardo Gutierrez-Osuna Wright State University Features, patterns and classifiers g More feature properties Linear separability Non-linear separability Multi-modal Highly correlated features g Classifiers n The goal of a classifier is to partition feature space into classlabeled decision regions n Borders between decision regions are called decision boundaries R1 R1 R2 R3 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 R3 R2 Statistical Pattern Recognition 6 R4 Ricardo Gutierrez-Osuna Wright State University Components of a pattern recognition system g A typical pattern recognition system contains n n n n n A sensor A preprocessing mechanism A feature extraction mechanism (manual or automated) A classification or description algorithm A set of examples (training set) already classified or described Feedback / Adaptation The “real world” Sensor / transducer Preprocessing and enhancement Classification algorithm Class assignment Description algorithm Description Feature extraction [Duda, Hart and Stork, 2001] NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 7 Ricardo Gutierrez-Osuna Wright State University An example g Consider the following scenario* n A fish processing plan wants to automate the process of sorting incoming fish according to species (salmon or sea bass) n The automation system consists of g g g g g a conveyor belt for incoming products two conveyor belts for sorted products a pick-and-place robotic arm a vision system with an overhead CCD camera a computer to analyze images and control the robot arm CCD camera Conveyor belt (salmon) computer Conveyor belt Robot arm Conveyor belt (bass) *Adapted from [Duda, Hart and Stork, 2001] NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 8 Ricardo Gutierrez-Osuna Wright State University An example g Sensor n The camera captures an image as a new fish enters the sorting area g Preprocessing n Adjustments for average intensity levels n Segmentation to separate fish from background g Feature Extraction n Suppose we know that, on the average, sea bass is larger than salmon g Classification n Collect a set of examples from both species g Plot a distribution of lengths for both classes n Determine a decision boundary (threshold) that minimizes the classification error g We estimate the system’s probability of error and obtain a discouraging result of 40% Decision boundary count Salmon Sea bass n What is next? length NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 9 Ricardo Gutierrez-Osuna Wright State University An example g Improving the performance of our PR system n Committed to achieve a recognition rate of 95%, we try a number of features g Width, Area, Position of the eyes w.r.t. mouth... g only to find out that these features contain no discriminatory information n Finally we find a “good” feature: average intensity of the scales Decision boundary Sea bass Salmon Avg. scale intensity n We combine “length” and “average intensity of the scales” to improve class separability n We compute a linear discriminant function to separate the two classes, and obtain a classification rate of 95.7% NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 10 length count Decision boundary Sea bass Salmon Avg. scale intensity Ricardo Gutierrez-Osuna Wright State University An example g Cost Versus Classification rate n Is classification rate the best objective function for this problem? g The cost of misclassifying salmon as sea bass is that the end customer will occasionally find a tasty piece of salmon when he purchases sea bass g The cost of misclassifying sea bass as salmon is a customer upset when he finds a piece of sea bass purchased at the price of salmon Decision boundary Sea bass New Decision boundary length length n We could intuitively shift the decision boundary to minimize an alternative cost function Sea bass Salmon Salmon Avg. scale intensity NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 11 Avg. scale intensity Ricardo Gutierrez-Osuna Wright State University An example g The issue of generalization g We then design an artificial neural network with five hidden layers, a combination of logistic and hyperbolic tangent activation functions, train it with the Levenberg-Marquardt algorithm and obtain an impressive classification rate of 99.9975% with the following decision boundary length n The recognition rate of our linear classifier (95.7%) met the design specs, but we still think we can improve the performance of the system Sea bass Salmon Avg. scale intensity n Satisfied with our classifier, we integrate the system and deploy it to the fish processing plant g A few days later the plant manager calls to complain that the system is misclassifying an average of 25% of the fish g What went wrong? NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 12 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g Probability n Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed A2 A1 A4 Probability Law probability Sample space A3 A1 A2 A3 A4 event g Conditional Probability n If A and B are two events, the probability of event A when we already know that event B has occurred P[A|B] is defined by the relation P[A | B] = P[A I B] for P[B] > 0 P[B] g P[A|B] is read as the “conditional probability of A conditioned on B”, or simply the “probability of A given B” NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 13 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g Conditional probability: graphical interpretation S A A∩B B S “B has occurred” A A∩B B g Theorem of Total Probability n Let B1, B2, …, BN be mutually exclusive events, then N P[A] = P[A | B1 ]P[B1 ] + ...P[A | BN ]P[B N ] = ∑ P[A | Bk ]P[B k ] k =1 B1 B3 BN-1 A B2 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 BN Statistical Pattern Recognition 14 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g Bayes Theorem n Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj? n Using the definition of conditional probability and the Theorem of total probability we obtain P[B j | A] = P[A I B j ] P[A] = P[A | B j ] ⋅ P[B j ] N ∑ P[A | B k ] ⋅ P[Bk ] k =1 n Bayes Theorem is definitely the fundamental relationship in Statistical Pattern Recognition Rev. Thomas Bayes (1702-1761) NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 15 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g For pattern recognition, Bayes Theorem can be expressed as P(ω j | x) = P(x | ω j ) ⋅ P(ω j ) N ∑ P(x | ωk ) ⋅ P(ωk ) = P(x | ω j ) ⋅ P(ω j ) P(x) k =1 n where ωj is the ith class and x is the feature vector g Each term in the Bayes Theorem has a special name, which you should be familiar with n n n n P(ωi) P(ωi|x) P(x|ωi) P(x) Prior probability (of class ωi) Posterior Probability (of class ωi given the observation x) Likelihood (conditional prob. of x given class ωi) A normalization constant that does not affect the decision g Two commonly used decision rules are n Maximum A Posteriori (MAP): choose the class ωi with highest P(ωi|x) n Maximum Likelihood (ML): choose the class ωi with highest P(x|ωi) g ML and MAP are equivalent for non-informative priors (P(ωi)=constant) NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 16 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g Characterizing features/vectors n Complete: Probability mass/density function 1 1 pmf pdf 5/6 4/6 3/6 2/6 1/6 x(lb) pdf for a person’s weight 100 200 300 400 500 x pmf for rolling a (fair) dice 1 2 3 4 5 6 n Partial: Statistics g Expectation n The expectation represents the center of mass of a density g Variance n The variance represents the spread about the mean g Covariance (only for random vectors) n The tendency of each pair of features to vary together, i.e., to co-vary NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 17 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g The covariance matrix (cont.) COV[X] = ∑ = E[(X − ; − T ] E[(x1 − 1 )(x1 − 1 )] ... E[(x1 − ... = E[(xN − N )(x1 − 1 )] ... E[(xN − O )] σ 12 ... c1N = ... ... 2 c 1N ... σ N N )(x N − N )] 1 )(xN − N n The covariance terms can be expressed as 2 c ii = σ i and c ik = ρ ikσ iσ k g where ρik is called the correlation coefficient g Graphical interpretation Xk Xk Xk Xi C ik=-σiσk ρik=-1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Xk Xi C ik=-½σiσ k ρik=-½ Xk Xi C ik=0 ρik=0 Statistical Pattern Recognition 18 Xi C ik=+½σiσk ρik=+½ Xi C ik=σ iσk ρik=+1 Ricardo Gutierrez-Osuna Wright State University Review of probability theory g Meet the multivariate Normal or Gaussian density N(µ,Σ): fX (x) = 1 (2 π )n/2 ∑ 1/2 1 exp − (X − 2 T ∑ −1(X − n For a single dimension, this formula reduces to the familiar expression 1 X - 2 1 f X (x) = exp− 2 σ 2π σ g Gaussian distributions are very popular n The parameters (µ,Σ) are sufficient to uniquely characterize the normal distribution n If the xi’s are mutually uncorrelated (cik=0), then they are also independent g The covariance matrix becomes diagonal, with the individual variances in the main diagonal n Marginal and conditional densities n Linear transformations NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 19 Ricardo Gutierrez-Osuna Wright State University SECTION II: Dimensionality reduction g The “curse of dimensionality” g Feature extraction vs. feature selection g Feature extraction criteria g Principal Components Analysis (briefly) g Linear Discriminant Analysis g Feature Subset Selection NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 20 Ricardo Gutierrez-Osuna Wright State University Dimensionality reduction g The “curse of dimensionality” [Bellman, 1961] n Refers to the problems associated with multivariate data analysis as the dimensionality increases g Consider a 3-class pattern recognition problem n A simple (Maximum Likelihood) procedure would be to g Divide the feature space into uniform bins g Compute the ratio of examples for each class at each bin and, g For a new example, find its bin and choose the predominant class in that bin n We decide to start with one feature and divide the real line into 3 bins x1 g Notice that there exists a large overlap between classes ⇒ to improve discrimination, we decide to incorporate a second feature NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 21 Ricardo Gutierrez-Osuna Wright State University Dimensionality reduction g Moving to two dimensions increases the number of bins from 3 to 32=9 n QUESTION: Which should we maintain constant? g The density of examples per bin? This increases the number of examples from 9 to 27 g The total number of examples? This results in a 2D scatter plot that is very sparse x2 Constant density x2 Constant # examples x2 x1 x1 g Moving to three features … n The number of bins grows to 33=27 n To maintain the initial density of examples, the number of required examples grows to 81 n For the same number of examples the 3D scatter plot is almost empty NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 22 x1 x3 Ricardo Gutierrez-Osuna Wright State University Dimensionality reduction g Implications of the curse of dimensionality n Exponential growth with dimensionality in the number of examples required to accurately estimate a function g How do we beat the curse of dimensionality? n By incorporating prior knowledge n By providing increasing smoothness of the target function n By reducing the dimensionality g In practice, the curse of dimensionality means that g In most cases, the information that was lost by discarding some features is compensated by a more accurate mapping in lowerdimensional space performance n For a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve dimensionality NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 23 Ricardo Gutierrez-Osuna Wright State University The curse of dimensionality g Implications of the curse of dimensionality n Exponential growth in the number of examples required to maintain a given sampling density g For a density of N examples/bin and D dimensions, the total number of examples grows as ND n Exponential growth in the complexity of the target function (a density) with increasing dimensionality g “A function defined in high-dimensional space is likely to be much more complex than a function defined in a lower-dimensional space, and those complications are harder to discern” –Jerome Friedman n This means that a more complex target function requires denser sample points to learn it well! n What to do if it ain’t Gaussian? g For one dimension, a large number of density functions are available g For most multivariate problems, only the Gaussian density is employed g For high-dimensional data, the Gaussian density can only be handled in a simplified form! n Human performance in high dimensions g Humans have an extraordinary capacity to discern patterns in 1-3 dimensions, but these capabilities degrade drastically for 4 or higher dimensions NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 24 Ricardo Gutierrez-Osuna Wright State University Dimensionality reduction g Two approaches to perform dim. reduction ℜN→ℜM (M<N) n Feature selection: choosing a subset of all the features [x1 x 2...x N ] →[xi feature selection 1 x i2 ...x iM ] n Feature extraction: creating new features by combining existing ones [x1 x 2...x N ] extraction →[y1 y 2 ...y M ] = f ([x i feature 1 x i2 ...x iM ]) g In either case, the goal is to find a low-dimensional representation of the data that preserves (most of) the information or structure in the data g Linear feature extraction n The “optimal” mapping y=f(x) is, in general, a non-linear function whose form is problem-dependent g Hence, feature extraction is commonly limited to linear projections y=Wx x1 y1 w 11 x y w 2 linear feature extraction → 2 = 21 yM w M1 xN M NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 M w 12 w 22 M w M2 Statistical Pattern Recognition 25 L L O x1 w 1N x2 w 2N w MN xN M M Ricardo Gutierrez-Osuna Wright State University Signal representation versus classification g Two criteria can be used to find the “optimal” feature extraction mapping y=f(x) n Signal representation: The goal of feature extraction is to represent the samples accurately in a lower-dimensional space n Classification: The goal of feature extraction is to enhance the classdiscriminatory information in the lower-dimensional space g Based on signal representation n Fisher’s Linear Discriminant (LDA) g Based on classification a gn Si n io at nt se re ) ep A l r (PC 2 2 2 22 2 2 111 22 2 1 2 22 1 1 11 2 2 1 1 2 2 2 1 111 111 1 2 11 1 C la ss (L ific D at A) i o n n Principal Components (PCA) Feature 2 g Within the realm of linear feature extraction, two techniques are commonly used Feature 1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 26 Ricardo Gutierrez-Osuna Wright State University Principal Components Analysis g Summary n The optimal* approximation of a random vector x∈ℜN by a linear combination of M (M<N) independent vectors is obtained by projecting the random vector x onto the eigenvectors ϕi corresponding to the largest eigenvalues λi of the covariance matrix Σx g *optimality is defined as the minimum of the sum-square magnitude of the approximation error n Since PCA uses the eigenvectors of the covariance matrix Σx, it is able to find the independent axes of the data under the unimodal Gaussian assumption g However, for non-Gaussian or multi-modal Gaussian data, PCA simply decorrelates the axes n The main limitation of PCA is that it does not consider class separability since it does not take into account the class label of the feature vector g PCA simply performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance g There is no guarantee that the directions of maximum variance will contain good features for discrimination!! NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 27 Ricardo Gutierrez-Osuna Wright State University PCA example g 3D Gaussian distribution 12 n Mean and covariance are = [0 5 2] T and 10 7 25 − 1 = −1 4 − 4 7 − 4 10 8 6 4 3 x g RESULTS 2 0 -2 n First PC has largest variance n PCA projections are de-correlated -4 -6 -8 10 5 -5 -10 0 0 x1 x2 12 15 10 10 5 10 10 5 8 6 2 y 0 3 y 5 4 2 0 -5 0 -5 -10 -15 -10 -5 0 y1 5 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 10 15 -2 -15 -10 -5 0 y1 5 Statistical Pattern Recognition 28 10 15 -10 -8 -6 -4 -2 0 y2 2 Ricardo Gutierrez-Osuna Wright State University 4 6 8 10 Linear Discriminant Analysis, two-classes g The objective of LDA is to perform dimensionality reduction while preserving as much of the class discriminatory information as possible n Assume a set of D-dimensional samples {x1, x2, …, xN}, N1 of which belong to class ω1, and N2 to class ω2 n We seek to obtain a scalar y by projecting the samples x onto a line y=wTx n Of all possible lines we want the one that maximizes the separability of the scalars x2 x2 x1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 29 x1 Ricardo Gutierrez-Osuna Wright State University Linear Discriminant Analysis, two-classes g In order to find a good projection vector, we need to define a measure of separation between the projections n We could choose the distance between the projected means J( w ) = ~1 − ~2 = w T ( 1 − 2 ) g Where the mean values of the x and y examples are i = 1 Ni ∑x x∈ and i ~ = 1 i Ni 1 ∑y = N ∑w y∈ i y∈ i T x = wT i i n However, the distance between projected means is not a very good measure since it does not take into account the standard deviation within the classes x2 µ1 This axis yields better class separability µ2 x1 This axis has a larger distance between means NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 30 Ricardo Gutierrez-Osuna Wright State University Linear Discriminant Analysis, two-classes n The solution proposed by Fisher is to normalize the difference between the means by a measure of the within-class variance g For each class we define the scatter, an equivalent of the variance, as ~ si2 = ∑ (y − ~ ) 2 i y∈ i g and the quantity (~ s12 + ~ s22 ) is called the within-class scatter of the projected examples n The Fisher linear discriminant is defined as the linear function wTx that maximizes the criterion function ~ −~ 2 1 2 J(w) = ~ 2 ~ 2 s +s 1 2 n In a nutshell: we look for a projection where examples from the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible x2 µ1 µ2 x1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 31 Ricardo Gutierrez-Osuna Wright State University Linear Discriminant Analysis, two-classes n Noticing that (~1 − ~2 )2 = (w T 1 − wT ) 2 2 T T )( 1 − 2 ) w = w SB w 144424443 = wT ( 1 − 2 BETWEEN - CLASS SCATTER SB ~ si2 = ∑ (y − ~ ) = ∑ (w 2 i y∈ i x∈ T x − wT ) 2 i = w T ∑ (x − i 14243 ~ s12 + ~ s22 = w T (S1 + S 2 ) w = w T S W w )(x − i )T w = w TSi w 144424443 x∈ i i CLASS SCATTER S i WITHIN -CLASS SCATTER S W n We can express J(w) in terms of x and w as w T SB w J( w ) = T w SW w n It can be shown that the projection vector w* which maximizes J(w) is w T SB w −1 w* = argmax T = SW ( 1 − w w SW w NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 32 2 ) Ricardo Gutierrez-Osuna Wright State University Linear Discriminant Analysis, C-classes g Fisher’s LDA generalizes for C-class problems very gracefully n For the C class problem we will seek (C-1) projection vectors wi, which can be arranged by columns into a projection matrix W=[w1|w2|…|wC-1] g The solution uses a slightly different formulation n The within-class scatter matrix is similar as the two-class problem C C S W = ∑ S i = ∑ ∑ (x − i=1 i =1 x∈ i )(x − i )T i n The generalization of the between-class scatter matrix is C SB = ∑ Ni ( i − i=1 )( i − )T g Where µ is the mean of the entire dataset n The objective function becomes ~ SB W T SB W J( W ) = ~ = W TS W W SW g Recall that we are looking for a projection that, in some sense, maximizes the ratio of between-class to within-class scatter: n Since the projection is not scalar (it has C-1 dimensions), we have used the determinant of the scatter matrices to obtain a scalar criterion function NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 33 Ricardo Gutierrez-Osuna Wright State University Linear Discriminant Analysis, C-classes n It can be shown that the optimal projection matrix W* is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem [ W* = w | w | * 1 * 2 L| w ] * C −1 W T SB W * = argmax T ⇒ (SB − iS W )w i = 0 W SW W g NOTE n SB is the sum of C matrices of rank one or less and the mean vectors are constrained by ΣNiµi=Nµ g Therefore, SB will be at most of rank (C-1) g This means that only (C-1) of the eigenvalues λi will be non-zero n Therefore, LDA produces at most C-1 feature projections g If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 34 Ricardo Gutierrez-Osuna Wright State University PCA Versus LDA 1 77 axis 2 44 2 29 2 58 93 7 6 8 46 2 1 5 2 9 10710410 10 4 98 1 511 664 3 2 7 7 4 9 9 8 2 1 58 9 710 4 6 89 10 10 77710 7 6 6 2 26 55255899 3285933 10 7 10 7 7 10 4 6 4 1 9 10 10 9 3 10 6 2 0 4 4 7 10 4 46 6 16 22125 19858 9 8 8 8 6 6 9 1 21 95 538 5 3 7 -2 46 2 61 21 2 9 85 10 4 2 5 6 6 5 10 7 4 15 31 2 8 383 1 -4 9 3 335 10 15 1 3 7 3 -6 83 3 2 4 PCA 10 4 10 6 5 axis 3 4 6 0 -5 -10 -8 -5 2 -10 -10 6 -5 7 7 7 10 74 77 4 44 7 1010 7 27 10 41010 4 2 9 7 7 10 4 6 28 10 46 6 109 7 7 7 10107 7 4 10 2 5 7 41 12 5528 22 85 10 6 692 9118 10 6 44 2564 5 10 10 1 2825 51 9 9 29 74410 10 466 5 664 61566229 124 1939 1 3988 93 589 538 1 6 9 6 5 85313 5 8 1769 10 62122985 9 9 8 89 51 2 134 6 2 1 5 8 2 8 8 1 3313 8 3 5 3 3 53 83 5 5 3 3 3 0 0 8 -5 5 0 axis 1 5 10 10 -10 axis 2 axis 1 -3 x 10 7 7 77 7 7777 7 7777 77 7 7 15 5 0 axis 3 LDA axis 2 10 -3 x 10 5 95 999 995555 9 9 59555 99 55555 8 9 9899 8 98 8 5555 888 8 8 5 0 88 338 8 38 33338 8 333 333333 3 -5 3 -0.02 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 2 22 22222 22 2222 2 2 11 11 1 11 12 1 2 11111 11 11 -0.01 -5 -10 -15 4 444 4444444 4 44 4 44 4 6 66 6 666666666 66 6 6 66 6 0 0.01 axis 1 0.02 0.03 9 99 9999 99 9 9 55 55 5252 9 5 9 9 8888 8 5 5 2 9 8 5 2 58 9 55 2 52 2522 5 22 8888 81212522 8888822 11111 11 13111 1 111 6 6 333 6666 666 333 666 6 3 3 3 6666 3 6 33 3 6 3 33 3 7 77 7 7 7 777 7777 7 7 4 444 44 44444444 44 4 4 10 10 10 10 10 10 10 10 10 10 10 1010 10 10 10 15 -0.02 10 10 10 10 10 10 10 10 10 10 1010 10 10 10 10 10 10 0.04 0.05 10 0 5 0.02 0.04 axis 1 Statistical Pattern Recognition 35 0 -5 axis 2 Ricardo Gutierrez-Osuna Wright State University -3 x 10 Limitations of LDA g LDA assumes unimodal Gaussian likelihoods n If the densities are significantly non-Gaussian, LDA may not preserve any complex structure of the data needed for classification µ1=µ2=µ ω1 ω2 ω2 µ1 ω1 µ1=µ2=µ ω1 ω2 ω1 µ2 g LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data x2 PC A LD A ω2 x1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 36 Ricardo Gutierrez-Osuna Wright State University Limitations of LDA g LDA has a tendency to overfit training data n To illustrate this problem, we generate an artificial dataset g Three classes, 50 examples per class, with the exact same likelihood: a multivariate Gaussian with zero mean and identity covariance g As we arbitrarily increase the number of dimensions, classes appear to separate better, even though they come from the same distribution 10 dimensions 50 dimensions 3 2 2 2 2 22 1 1 2 12 221 2 3 2 3 1 11 22 3 23 3 2 1 3 313 11 2 31 2 22322 3 1 1 2213 133 1 1 3 3 1 2 1 331 2 23 32 3 213 3 23 13 2 1 1 1311211 1 3 2 22 3 1 2 2 2 31 2 1 2213 1132111 3223323113 2 33 1 3 1 3 3 21 1 1333 2121 3 3 33 1 3 1 1 1 1 1 2 121 1 1 2 1 1 1 1 1 1 3 1 2 1 11 1 1 1 11 2 2 2 1 1 1 2 21 1 3 323 1 31 32 133 2221 2 3 1 11 1 22 3 1 33 3331 1 2 122 22 1 31 2 22 2 2 3 3 3 33 313 2 2 3 3 1 2 222 2 2 2 3 3 3133 122 2 2 2 2 3 3 2 1 2 2 3 33 2 3 3 1 3 3 33 3 32 3 1 3 2 3 2 100 dimensions 150 dimensions 1 1 1 11 1 1 1 111 1 1 1 1 11 3 3 1 111 111 111 1 1 1 3 333 3 33 111 1 1111 1 1 3 3333 1 33 33 3 1 11 3 3 3333 33 33 2 3 33 2 3 2 2 33 12 333 2212 2 2 2 22 2 2 3 3 3 2 2 22 222 3 3 22 22 222222222 2 2 2 2 2 2222 2 22 2 3 3 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 3 22 222 222222222 2 2 2 2 222 22 222222 2222 2 2 2 2 33 33333 333 333333 3 33 3333 3333333 3333333 3 3 33 3 1111111 111111 1 11 1111111 111 11 1111 11 1 1 1 1 Statistical Pattern Recognition 37 Ricardo Gutierrez-Osuna Wright State University Feature Subset Selection g Why Feature Subset Selection? n Feature Subset Selection is necessary in a number of situations g Features may be expensive to obtain n You evaluate a large number of features (sensors) in the test bed and select only a few for the final implementation g You may want to extract meaningful rules from your classifier n When you transform or project, the measurement units (length, weight, etc.) of your features are lost g Features may not be numeric n A typical situation in the machine learning domain g Although FSS can be thought of as a special case of feature extraction (think of an identity projection matrix with a few diagonal zeros), in practice it is a quite different problem n FSS looks at dimensionality reduction from a different perspective n FSS has a unique set of methodologies NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 38 Ricardo Gutierrez-Osuna Wright State University Search strategy and objective function g Feature Subset Selection requires n A search strategy to select candidate subsets n An objective function to evaluate candidates g Search Strategy N M n Exhaustive evaluation involves feature subsets for a fixed value of M, and 2N subsets if M must be optimized as well g This number of combinations is unfeasible, even for moderate values of M and N n For example, exhaustive evaluation of 10 out of 20 features involves 184,756 feature subsets; exhaustive evaluation of 10 out of 20 involves more than 1013 feature subsets n A search strategy is needed to explore the space of all possible feature combinations g Objective Function n The objective function evaluates candidate subsets and returns a measure of their “goodness”, a feedback that is used by the search strategy to select new candidates NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 39 Training data Complete feature set Feature Subset Selection Search Feature subset “Goodness” Objective function Final feature subset PR algorithm Ricardo Gutierrez-Osuna Wright State University Objective function g Objective functions are divided in two groups n Filters: The objective function evaluates feature subsets by their information content, typically interclass distance, statistical dependence or information-theoretic measures n Wrappers: The objective function is a pattern classifier, which evaluates feature subsets by their predictive accuracy (recognition rate on test data) by statistical resampling or cross-validation Training data Complete feature set Training data Complete feature set Filter FSS Wrapper FSS Search Search Feature subset Information content Feature subset Objective function PR algorithm Final feature subset Final feature subset ML algorithm NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Predictive accuracy PR algorithm Statistical Pattern Recognition 40 Ricardo Gutierrez-Osuna Wright State University Filter types g Distance or separability measures n These methods use distance metrics to measure class separability: g Distance between classes: Euclidean, Mahalanobis, etc. g Determinant of SW-1SB (LDA eigenvalues) g Correlation and information-theoretic measures n These methods assume that “good” subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with each other g Linear relation measures M n Linear relationship between variables can be measured using the correlation coefficient J(YM) g where ρ is a correlation coefficient J(YM ) = ∑ ic i=1 M M ∑∑ ij i=1 j=i+1 g Non-Linear relation measures n Correlation is only capable of measuring linear dependence. A more powerful method is the mutual information I(Yk;C) C J(YM ) = I(YM; C) = H(C ) − H(C | YM ) = ∑ ∫ P(YM, c =1 YM c ) lg P(YM, c ) dx P(YM ) P( c ) g The mutual information between feature vector and class label I(YM;C) measures the amount by which the uncertainty in the class H(C) is decreased by knowledge of the feature vector H(C|YM), where H(·) is the entropy function NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 41 Ricardo Gutierrez-Osuna Wright State University Filters Vs. Wrappers g Wrappers n Advantages g Accuracy: wrappers generally achieve better recognition rates than filters since they are tuned to the specific interactions between the classifier and the dataset g Ability to generalize: wrappers have a mechanism to avoid overfitting, since they typically use cross-validation measures of predictive accuracy n Disadvantages g Slow execution: since the wrapper must train a classifier for each feature subset (or several classifiers if cross-validation is used), the method is unfeasible for computationally intensive classifiers g Lack of generality: the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function g Filters n Advantages g Fast execution: Filters generally involve a non-iterative computation on the dataset, which can execute much faster than a classifier training session g Generality: Since filters evaluate the intrinsic properties of the data, rather than their interactions with a particular classifier, their results exhibit more generality: the solution will be “good” for a larger family of classifiers n Disadvantages g Tendency to select large subsets: Since the filter objective functions are generally monotonic, the filter tends to select the full feature set as the optimal solution. This forces the user to select an arbitrary cutoff on the number of features to be selected NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 42 Ricardo Gutierrez-Osuna Wright State University Naïve sequential feature selection g One may be tempted to evaluate each individual feature separately and select those M features with the highest scores x2 ωω45 ω3 n Unfortunately, this strategy will very rarely work since it does not account for feature dependence g An example will help illustrate the poor performance of this naïve approach n The scatter plots show a 4-dimensional pattern recognition problem with 5 classes ω2 ω1 x1 x4 g The objective is to select the best subset of 2 features using the naïve sequential FSS procedure n A reasonable objective function will generate the following feature ranking: J(x1)>J(x2)≈J(x3)>J(x4) g The optimal feature subset turns out to be {x1, x4}, because x4 provides the only information that x1 needs: discrimination between classes ω4 and ω5 g If we were to choose features according to the individual scores J(xk), we would choose x1 and either x2 or x3, leaving classes ω4 and ω5 non separable NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 43 ω5 ω1 ω3 ω2 ω4 x3 Ricardo Gutierrez-Osuna Wright State University Search strategies g FSS search strategies can be grouped in three categories* n Sequential g These algorithms add or remove features sequentially, but have a tendency to become trapped in local minima n Sequential Forward Selection n Sequential Backward Selection n Sequential Floating Selection n Exponential g These algorithms evaluate a number of subsets that grows exponentially with the dimensionality of the search space n Exhaustive Search (already discussed) n Branch and Bound n Beam Search n Randomized g These algorithms incorporating randomness into their search procedure to escape local minima n Simulated Annealing n Genetic Algorithms *The subsequent description of FSS is borrowed from [Doak, 1992] NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 44 Ricardo Gutierrez-Osuna Wright State University Sequential Forward Selection (SFS) g Sequential Forward Selection is a simple greedy search 1. 1. Start Start with with the the empty empty set set Y={∅} Y={∅} + 2. 2. Select Select the the next next best best feature feature x = arg max [J(Yk + x )] x∈X − Yk =Y +x; k=k+1 3. 3. Update Update Y Yk+1 k+1=Ykk+x; k=k+1 4. 4. Go Go to to 22 g Notes Empty feature set n SFS performs best when the optimal subset has a small number of features g When the search is near the empty set, a large number of states can be potentially evaluated g Towards the full set, the region examined by SFS is narrower since most of the features have already been selected g The main disadvantage of SFS is that it is unable to remove features that become obsolete with the addition of new features NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 45 Full feature set Ricardo Gutierrez-Osuna Wright State University Sequential Backward Selection (SBS) g Sequential Backward Selection works in the opposite manner as SFS 1. 1. Start Start with with the the full full set set Y=X Y=X − 2. 2. Remove Remove the the worst worst feature feature x = arg max [J(Yk − x )] x∈Yk =Y -x; k=k+1 3. 3. Update Update Y Yk+1 k+1=Ykk-x; k=k+1 4. 4. Go Go to to 22 Empty feature set g Notes n SBS works best when the optimal feature subset has a large number of features, since SBS spends most of its time visiting large subsets n The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded Full feature set NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 46 Ricardo Gutierrez-Osuna Wright State University Branch and Bound g The Branch and Bound algorithm is guaranteed to find the optimal feature subset under the monotonicity assumption Empty feature set n The monotonicity assumption states that the addition of features can only increase the value of the objective function, this is ( ) ( ) ( ) J x i1 < J x i1 , x i2 < J x i1 , x i2 , x i3 < L < J(x , x ,L, x i1 i2 iN ) n Branch and Bound starts from the full set and removes features using a depth-first strategy g Nodes whose objective function are lower than the current best are not explored since the monotonicity assumption ensures that their children will not contain a better solution NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 47 Full feature set Ricardo Gutierrez-Osuna Wright State University Branch and Bound g Algorithm n The algorithm is better explained by considering the subsets of M’=N-M features already discarded, where N is the dimensionality of the state space and M is the desired number of features n Since the order of the features is irrelevant, we will only consider an increasing ordering i1<i2<...iM’ of the feature indices, this will avoid exploring states that differ only in the ordering of their features n The Branch and Bound tree for N=6 and M=2 is shown below (numbers indicate features that are being removed) g Notice that at the level directly below the root we only consider removing features 1, 2 or 3, since a higher number would not allow sequences (i1<i2<i3<i4) with four indices 1 2 2 4 3 4 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 5 3 6 5 5 6 6 4 5 6 4 5 5 6 6 Statistical Pattern Recognition 48 3 4 5 6 3 4 4 5 5 5 6 6 6 Ricardo Gutierrez-Osuna Wright State University Branch and Bound 1. 1. 2. 2. 3. 3. Initialize: Initialize: αα==-∞ ∞,, k=0 k=0 Generate successors Generate successors of of the the current current node node and and store store them them in in LIST(k) LIST(k) Select Select new new node node ifif LIST(k) LIST(k) is is empty empty go go to to Step Step 55 else else [( ik = arg max J x i1 , x i2 ,...x ik −1 , j j ∈LIST ( k ) )] delete ik from LIST(k ) 4. 4. Check Check bound bound ifif J x i1 , xi2 ,...x ik < go go to to 55 else else ifif k=M’ k=M’ (we (we have have the the desired desired number number of of features) features) go go to to 66 else else k=k+1 k=k+1 go go to to 22 5. 5. Backtrack Backtrack to to lower lower level level set set k=k-1 k=k-1 ifif k=0 k=0 terminate terminate algorithm algorithm else else go go to to 33 6. 6. Last Last level level Set = J x i1 , x i2 ,...x ik −1 , j and YM* ’ = x i1 , x i2 ,...x ik go go to to 55 ( ) ( NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 ) { Statistical Pattern Recognition 49 } Ricardo Gutierrez-Osuna Wright State University Beam Search g Beam Search is a variation of best-first search with a bounded queue to limit the scope of the search n The queue organizes states from best to worst, with the best states placed at the head of the queue n At every iteration, BS evaluates all possible states that result from adding a feature to the feature subset, and the results are inserted into the queue in their proper locations Empty feature set Full feature set g It is trivial to notice that BS degenerates to Exhaustive search if there is no limit on the size of the queue. Similarly, if the queue size is set to one, BS is equivalent to Sequential Forward Selection NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 50 Ricardo Gutierrez-Osuna Wright State University Beam Search g This example illustrates BS for a 4D space and a queue of size 3 n BS cannot guarantee that the optimal subset is found: g in the example, the optimal is 2-3-4(9), which is never explored g however, with the proper queue size, Beam Search can avoid local minimal by preserving solutions from varying regions in the search space root 1(5) 2(6) 3(7) 3(5) 4(3) 4(2) 4(5) 2(3) 3(8) 3(4) 4(6) 4(1) LIST={∅} 4(1) LIST={1(5), 3(4), 2(3)} 4(1) LIST={1-2(6), 1-3(5), 3(4)} 4(1) LIST={1-2-3(7), 1-3(5), 3(4)} 4(5) 4(9) 4(5) root 1(5) 2(6) 3(7) 3(5) 4(3) 4(2) 4(5) 2(3) 3(8) 3(4) 4(6) 4(5) 4(9) 4(5) root 1(5) 2(6) 3(7) 3(5) 4(3) 4(2) 4(5) 2(3) 3(8) 3(4) 4(6) 4(5) 4(9) 4(5) root 1(5) 2(6) 3(7) 3(5) 4(3) 4(5) 4(2) 2(3) 3(8) 3(4) 4(6) 4(5) 4(9) 4(5) NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 51 Ricardo Gutierrez-Osuna Wright State University Genetic Algorithms g Genetic algorithms are optimization techniques that mimic the evolutionary process of “survival of the fittest” n Starting with an initial random population of solutions, evolve new populations by mating (crossover) pairs of solutions and mutating solutions according to their fitness (an objective function) n The better solutions are more likely to be selected for the mating and mutation operators and, therefore, carry their “genetic code” from generation to generation Empty feature set n In FSS, individual solutions are represented with a binary number (1 if the feature is selected, 0 otherwise) 1. 1. 2. 2. 3. 3. Create Create an an initial initial random random population population Evaluate Evaluate initial initial population population Repeat Repeat until until convergence convergence (or (or aa number number of of generations) generations) 3a. 3a. Select Select the the fittest fittest individuals individuals in in the the population population 3b. 3b. Perform Perform crossover crossover on on selected selected individuals individuals to to create create offspring offspring 3c. Perform mutation on selected individuals 3c. Perform mutation on selected individuals 3d. 3d. Create Create new new population population from from old old population population and and offspring offspring 3e. 3e. Evaluate Evaluate the the new new population population Full feature set NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 52 Ricardo Gutierrez-Osuna Wright State University Genetic operators g Single-point crossover n Select two individuals (parents) according to their fitness n Select a crossover point n With probability Pc (0.95 is reasonable) create two offspring by combining the parents Crossover point selected randomly Parenti 01001010110 Parentj 11010110000 11011010110 Offspringi 01000110000 Offspringj Crossover g Binary mutation n Select an individual according to its fitness n With probability PM (0.01 is reasonable) mutate each one of its bits Mutated bits Individual NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 11010110000 Mutation Statistical Pattern Recognition 53 11001010111 Offspringi Ricardo Gutierrez-Osuna Wright State University Selection methods g The selection of individuals is based on their fitness g We will describe a selection method called Geometric selection n Several methods are available: Roulette Wheel, Tournament Selection… g Geometric selection g q is the probability of selecting the best individual (0.05 is a reasonable value) n The geometric distribution assigns higher probability to individuals ranked better, but also allows unfit individuals to be selected g In addition, it is typical to carry the best individual of each population to the next one 0.08 0.07 Selection probability n The probability of selecting the rth best individual is given by the geometric r -1 probability mass function P(r ) = q(1 - q) 0.06 q=0.08 0.05 q=0.04 0.04 q=0.02 0.03 q=0.01 0.02 0.01 0 0 10 20 30 40 50 60 70 Individual rank n This is called the Elitist Model NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 54 Ricardo Gutierrez-Osuna Wright State University 80 90 100 GAs, parameter choices for Feature selection g The choice of crossover rate PC is not critical n You will want a value close to 1.0 to have a large number of offspring g The choice of mutation rate PM is very critical n An optimal choice of PM will allow the GA to explore the more promising regions while avoiding getting trapped in local minima g A large value (i.e., PM>0.25) will not allow the search to focus on the better regions, and the GA will perform like random search g A small value (i.e., close to 0.0) will not allow the search to escape local minima g The choice of ‘q’, the probability of selecting the best individual is also critical n An optimal value of ‘q’ will allow the GA to explore the most promising solution, and at the same time provide sufficient diversity to avoid early convergence of the algorithm g In general, poorly selected control parameters will result in suboptimal solutions due to early convergence NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 55 Ricardo Gutierrez-Osuna Wright State University Search Strategies, summary Exhaustive Sequential Randomized Accuracy Complexity Advantages Disadvantages Always finds the optimal solution Exponential High accuracy High complexity Quadratic O(NEX ) Simple and fast Cannot backtrack Generally low Designed to escape local minima Difficult to choose good parameters Good if no backtracking needed Good with proper control parameters 2 g A highly recommended review of this material Justin Doak “An evaluation of feature selection methods and their application to Computer Security” University of California at Davis, Tech Report CSE-92-18 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 56 Ricardo Gutierrez-Osuna Wright State University SECTION III: Classification g Discriminant functions g The optimal Bayes classifier g Quadratic classifiers g Euclidean and Mahalanobis metrics g K Nearest Neighbor Classifiers NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 57 Ricardo Gutierrez-Osuna Wright State University Discriminant functions g A convenient way to represent a pattern classifier is in terms of a family of discriminant functions gi(x) with a simple MAX gate as the classification rule Class assignment Select Select max max Costs Costs gg1(x) 1(x) xx1 1 gg2(x) 2(x) ggC(x) C(x) xx3 3 xxd d xx2 2 Assign x to class i Discriminant functions Features if gi (x) > g (x) ∀j ≠ i j g How do we choose the discriminant functions gi(x) n Depends on the objective function to minimize g Probability of error g Bayes Risk NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 58 Ricardo Gutierrez-Osuna Wright State University Minimizing probability of error g Probability of error P[error|x] is “the probability of assigning x to the wrong class” n For a two-class problem, P[error|x] is simply P( P(error | x) = P( 1 | x) if we decide 2 2 | x) if we decide 1 g It makes sense that the classification rule be designed to minimize the average probability of error P[error] across all possible values of x +∞ +∞ −∞ −∞ P(error) = ∫ P(error, x)dx = ∫ P(error | x)P(x)dx g To ensure P(error) is minimum we minimize P(error|x) by choosing the class with maximum posterior P(ωi|x) at each x n This is called the MAXIMUM A POSTERIORI (MAP) RULE g And the associated discriminant functions become gMAP (x) = P( i NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 i | x) Statistical Pattern Recognition 59 Ricardo Gutierrez-Osuna Wright State University g We “prove” the optimality of the MAP rule graphically n The right plot shows the posterior for each of the two classes n The bottom plots shows the P(error) for the MAP rule and another rule n Which one has lower P(error) (color-filled area) ? P(wi|x) Minimizing probability of error x THE MAP RULE Choose RED Choose BLUE NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 THE “OTHER” RULE Choose RED Choose RED Statistical Pattern Recognition 60 Choose BLUE Choose RED Ricardo Gutierrez-Osuna Wright State University Quadratic classifiers g Let us assume that the likelihood densities are Gaussian P(x | i )= 1 n/2 (2 ∑i 1/2 1 exp − (x − i )T ∑ i−1(x − i ) 2 g Using Bayes rule, the MAP discriminant functions become gi (x) = P( i | x) = P(x | )P( P(x) i i ) = 1 (2 n/2 ∑i 1/2 1 exp − (x − i )T ∑ i−1(x − i )P( 2 i ) 1 P(x) n Eliminating constant terms gi (x) = ∑ i -1/2 1 exp − (x − i )T ∑ i−1(x − i )P( 2 i ) n We take natural logs (the logarithm is monotonically increasing) 1 1 gi (x) = − (x − i )T ∑ i−1(x − i ) - log( ∑ i ) + log(P( 2 2 i )) g This is known as a Quadratic Discriminant Function g The quadratic term is know as the Mahalanobis distance NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 61 Ricardo Gutierrez-Osuna Wright State University Mahalanobis distance g The Mahalanobis distance can be thought of vector distance that uses a ∑i-1 norm x1 Mahalanobis Distance x-y −1 2 ∑i −1 = (x − y)T ∑ i (x − y) µ x2 xi xi - 2 ∑ −1 2 =K = n ∑-1 can be thought of as a stretching factor on the space n Note that for an identity covariance matrix (∑i=I), the Mahalanobis distance becomes the familiar Euclidean distance g In the following slides we look at special cases of the Quadratic classifier n For convenience we will assume equiprobable priors so we can drop the term log(P(ωi)) NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 62 Ricardo Gutierrez-Osuna Wright State University Special case I: Σi=σ2I g In this case, the discriminant becomes gi (x) = −(x − i )T (x − i ) n This is known as a MINIMUM DISTANCE CLASSIFIER n Notice the linear decision boundaries NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 63 Ricardo Gutierrez-Osuna Wright State University Special case 2: Σi= Σ (Σ diagonal) g In this case, the discriminant becomes 1 gi (x) = − (x − i )T ∑ −1(x − i ) 2 n This is known as a MAHALANOBIS DISTANCE CLASSIFIER n Still linear decision boundaries NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 64 Ricardo Gutierrez-Osuna Wright State University Special case 3: Σi=Σ (Σ non-diagonal) g In this case, the discriminant becomes 1 −1 gi (x) = − (x − i )T ∑ i (x − i ) 2 n This is also known as a MAHALANOBIS DISTANCE CLASSIFIER n Still linear decision boundaries NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 65 Ricardo Gutierrez-Osuna Wright State University Case 4: Σi=σi2I, example g In this case the quadratic expression cannot be simplified any further g Notice that the decision boundaries are no longer linear but quadratic Zoom out NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 66 Ricardo Gutierrez-Osuna Wright State University Case 5: Σi≠Σj general case, example g In this case there are no constraints so the quadratic expression cannot be simplified any further g Notice that the decision boundaries are also quadratic Zoom out NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 67 Ricardo Gutierrez-Osuna Wright State University Limitations of quadratic classifiers g The fundamental limitation is the unimodal Gaussian assumption n For non-Gaussian or multimodal Gaussian, the results may be significantly sub-optimal g A practical limitation is associated with the minimum required size for the dataset n If the number of examples per class is less than the number of dimensions, the covariance matrix becomes singular and, therefore, its inverse cannot be computed g In this case it is common to assume the same covariance structure for all classes and compute the covariance matrix using all the examples, regardless of class NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 68 Ricardo Gutierrez-Osuna Wright State University Conclusions g We can extract the following conclusions n The Bayes classifier for normally distributed classes is quadratic n The Bayes classifier for normally distributed classes with equal covariance matrices is a linear classifier n The minimum Mahalanobis distance classifier is optimum for g normally distributed classes and equal covariance matrices and equal priors n The minimum Euclidean distance classifier is optimum for g normally distributed classes and equal covariance matrices proportional to the identity matrix and equal priors n Both Euclidean and Mahalanobis distance classifiers are linear g The goal of this discussion was to show that some of the most popular classifiers can be derived from decision-theoretic principles and some simplifying assumptions n It is important to realize that using a specific (Euclidean or Mahalanobis) minimum distance classifier implicitly corresponds to certain statistical assumptions n The question whether these assumptions hold or don’t can rarely be answered in practice; in most cases we are limited to posting and answering the question “does this classifier solve our problem or not?” NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 69 Ricardo Gutierrez-Osuna Wright State University K Nearest Neighbor classifier g The kNN classifier is based on non-parametric density estimation techniques n Let us assume we seek to estimate the density function P(x) from a dataset of examples n P(x) can be approximated by the expression V is the volume surroundin g x k where N is the total number of examples P(x) ≅ NV k is the number of examples inside V n The volume V is determined by the D-dim distance RkD(x) between x and its k nearest neighbor V=πR2 P(x) = x P(x) ≅ N R R k k = NV N ⋅ c D ⋅ RDk (x) g Where cD is the volume of the unit sphere in D dimensions NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 70 k Ricardo Gutierrez-Osuna Wright State University 2 K Nearest Neighbor classifier g We use the previous result to estimate the posterior probability n The unconditional density is, again, estimated with P(x | i)= ki Ni V n And the priors can be estimated by P( i ) = Ni N n The posterior probability then becomes k i Ni ⋅ P(x | i )P( i ) Ni V N k i = = P( i | x) = k P(x) k NV n Yielding discriminant functions gi (x) = ki k g This is known as the k Nearest Neighbor classifier NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 71 Ricardo Gutierrez-Osuna Wright State University K Nearest Neighbor classifier g The kNN classifier is a very intuitive method n Examples are classified based on their similarity with training data g For a given unlabeled example xu∈ℜD, find the k “closest” labeled examples in the training data set and assign xu to the class that appears most frequently within the k-subset g The kNN only requires 10 4 10 4 10 10 10 4 10 44 1010 10 10 10 10 1010 ? 4 444 444 4 10 5 10 10 4 4 44 n An integer k n A set of labeled examples n A measure of “closeness” 66 6 6 66 6 6 6 6 6666 6 6 6 axis 2 0 -5 1 111 111111 1 1111 12 222 12 2 222 333 2 22 3 8338 3 22 2222 33838 2 2 58 3 8 8 3 8 8 35859 38 88838 5 8 9 9 5 9 5 5 855 555 99 535 5 9 5 9 9 8 99 55 9 9 9 9 -10 -15 -20 77 7 7 7 77 77 7777 7 77 7 7 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 axis 1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 72 0 0.01 0.02 Ricardo Gutierrez-Osuna Wright State University 0.03 kNN in action: example 1 g We generate data for a 2-dimensional 3class problem, where the class-conditional densities are multi-modal, and non-linearly separable g We used kNN with n k = five n Metric = Euclidean distance NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 73 Ricardo Gutierrez-Osuna Wright State University kNN in action: example 2 g We generate data for a 2-dim 3-class problem, where the likelihoods are unimodal, and are distributed in rings around a common mean n These classes are also non-linearly separable g We used kNN with n k = five n Metric = Euclidean distance NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 74 Ricardo Gutierrez-Osuna Wright State University kNN versus 1NN 1-NN NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 5-NN Statistical Pattern Recognition 75 20-NN Ricardo Gutierrez-Osuna Wright State University Characteristics of the kNN classifier g Advantages n Analytically tractable, simple implementation n Nearly optimal in the large sample limit (N→∞) g P[error]Bayes<P[error]1-NNR<2P[error]Bayes n Uses local information, which can yield highly adaptive behavior n Lends itself very easily to parallel implementations g Disadvantages n Large storage requirements n Computationally intensive recall n Highly susceptible to the curse of dimensionality g 1NN versus kNN n The use of large values of k has two main advantages g Yields smoother decision regions g Provides probabilistic information: The ratio of examples for each class gives information about the ambiguity of the decision n However, too large values of k are detrimental g It destroys the locality of the estimation NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 76 Ricardo Gutierrez-Osuna Wright State University Section IV: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 77 Ricardo Gutierrez-Osuna Wright State University Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model selection and performance estimation g Model selection n Almost invariably, all pattern recognition techniques have one or more free parameters g The number of neighbors in a kNN classification rule g The network size, learning parameters and weights in MLPs n How do we select the “optimal” parameter(s) or model for a given classification problem? g Performance estimation n Once we have chosen a model, how do we estimate its performance? g Performance is typically measured by the TRUE ERROR RATE, the classifier’s error rate on the ENTIRE POPULATION NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 78 Ricardo Gutierrez-Osuna Wright State University Motivation g If we had access to an unlimited number of examples these questions have a straightforward answer n Choose the model that provides the lowest error rate on the entire population and, of course, that error rate is the true error rate g In real applications we only have access to a finite set of examples, usually smaller than we wanted n One approach is to use the entire training data to select our classifier and estimate the error rate g This naïve approach has two fundamental problems n The final model will normally overfit the training data g This problem is more pronounced with models that have a large number of parameters n The error rate estimate will be overly optimistic (lower than the true error rate) g In fact, it is not uncommon to have 100% correct classification on training data n A much better approach is to split the training data into disjoint subsets: the holdout method NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 79 Ricardo Gutierrez-Osuna Wright State University The holdout method g Split dataset into two groups n Training set: used to train the classifier n Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set g A typical application the holdout method is determining a stopping point for the back propagation error MSE Stopping point Test set error Training set error Epochs NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 80 Ricardo Gutierrez-Osuna Wright State University The holdout method g The holdout method has two basic drawbacks n In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing n Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split g The limitations of the holdout can be overcome with a family of resampling methods at the expense of more computations n Cross Validation g Random Subsampling g K-Fold Cross-Validation g Leave-one-out Cross-Validation n Bootstrap NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 81 Ricardo Gutierrez-Osuna Wright State University Random Subsampling g Random Subsampling performs K data splits of the dataset n Each split randomly selects a (fixed) no. examples without replacement n For each data split we retrain the classifier from scratch with the training examples and estimate Ei with the test examples Total number of examples Test example Experiment 1 Experiment 2 Experiment 3 g The true error estimate is obtained as the average of the separate estimates Ei n This estimate is significantly better than the holdout estimate 1 K E = ∑ Ei K i=1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 82 Ricardo Gutierrez-Osuna Wright State University K-Fold Cross-validation g Create a K-fold partition of the the dataset n For each of K experiments, use K-1 folds for training and the remaining one for testing Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 g K-Fold Cross validation is similar to Random Subsampling n The advantage of K-Fold Cross validation is that all the examples in the dataset are eventually used for both training and testing g As before, the true error is estimated as the average error rate 1 K E = ∑ Ei K i=1 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 83 Ricardo Gutierrez-Osuna Wright State University Leave-one-out Cross Validation g Leave-one-out is the degenerate case of K-Fold Cross Validation, where K is chosen as the total number of examples n For a dataset with N examples, perform N experiments n For each experiment use N-1 examples for training and the remaining example for testing Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example Experiment N g As usual, the true error is estimated as the average error rate on test examples N E= NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 1 ∑ Ei N i=1 Statistical Pattern Recognition 84 Ricardo Gutierrez-Osuna Wright State University How many folds are needed? g With a large number of folds + The bias of the true error rate estimator will be small (the estimator will be very accurate) - The variance of the true error rate estimator will be large - The computational time will be very large as well (many experiments) g With a small number of folds + The number of experiments and, therefore, computation time are reduced + The variance of the estimator will be small - The bias of the estimator will be large (conservative or larger than the true error rate) g In practice, the choice of the number of folds depends on the size of the dataset n For large datasets, even 3-Fold Cross Validation will be quite accurate n For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible g A common choice for K-Fold Cross Validation is K=10 NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 85 Ricardo Gutierrez-Osuna Wright State University Three-way data splits g If model selection and true error estimates are to be computed simultaneously, the data needs to be divided into three disjoint sets n Training set: a set of examples used for learning the parameters of the model g In the MLP case, we would use the training set to find the “optimal” weights with the back propagation rule n Validation set: a set of examples used to tune the meta-parameters of a model g In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back propagation algorithm n Test set: a set of examples used only to assess the performance of a fully-trained classifier g In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) g After assessing the final model with the test set, YOU MUST NOT further tune the model NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 86 Ricardo Gutierrez-Osuna Wright State University Three-way data splits g Why separate test and validation sets? n The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model n After assessing the final model with the test set, YOU MUST NOT tune the model any further g Procedure outline 1. 1. Divide Divide the the available available data data into into training, training, validation validation and and test test set set 2. 2. Select Select architecture architecture and and training training parameters parameters 3. 3. Train Train the the model model using using the the training training set set 4. 4. Evaluate Evaluate the the model model using using the the validation validation set set 5. 5. Repeat Repeat steps steps 22 through through 44 using using different different architectures architectures and and training training parameters parameters 6. 6. Select Select the the best best model model and and train train itit using using data data from from the the training training and and validation validation set set 7. 7. Assess Assess this this final final model model using using the the test test set set n This outline assumes a holdout method g If CV or Bootstrap are used, steps 3 and 4 have to be repeated for each of the K folds NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 87 Ricardo Gutierrez-Osuna Wright State University Three-way data splits Test set Training set Validation set Model2 Model3 Model4 Σ Error1 Σ Error2 Σ Error3 Σ Error4 Min Model1 Final Model Model Selection NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 88 Σ Final Error Error Rate Ricardo Gutierrez-Osuna Wright State University Section V: Clustering g Unsupervised learning concepts g K-means g Agglomerative clustering g Divisive clustering NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 89 Ricardo Gutierrez-Osuna Wright State University Unsupervised learning g In this section we investigate a family of techniques which use unlabeled data: the feature vector X without the class label ω n These methods are called unsupervised since they are not provided the correct answer g Although unsupervised learning methods may appear to have limited capabilities, there are several reasons that make them extremely useful n Labeling large data sets can be costly (i.e., speech recognition) n Class labels may not be known beforehand (i.e., data mining) n Very large datasets can be compressed by finding a small set of prototypes (i.e., kNN) g Two major approaches for unsupervised learning n Parametric (mixture modeling) g A functional form for the underlying density is assumed, and we seek to estimate the parameters of the model n Non-parametric (clustering) g No assumptions are made about the underlying densities, instead we seek a partition of the data into clusters n These methods are typically referred to as clustering NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 90 Ricardo Gutierrez-Osuna Wright State University Unsupervised learning g Non-parametric unsupervised learning involves three steps n Defining a measure of similarity between examples n Defining a criterion function for clustering n Defining an algorithm to minimize (or maximize) the criterion function g Similarity measures n The most straightforward similarity measure is distance: Euclidean, city block or Mahalanobis g Euclidean and city block measures are very sensitive to scaling of the axis, so caution must be exercised g Criterion function n The most widely used criterion function for clustering is the sum-ofsquare-error criterion C JMSE = ∑ ∑ x − i =1 x ≈ 2 i where i i = 1 Ni ∑x x≈ i g This criterion measures how well the data set X={x(1, x(2, …, x(N} is represented by the cluster centers µ={µ(1, µ(2, …, µ(N} NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 91 Ricardo Gutierrez-Osuna Wright State University Unsupervised learning g Once a criterion function has been defined, we must find a partition of the data set that minimizes the criterion n Exhaustive enumeration of all partitions, which guarantees the optimal solution, is unfeasible n For example, a problem with 5 clusters and 100 examples yields 1067 different ways to partition the data g The common approach is to proceed in an iterative fashion n Find some reasonable initial partition and then n Move samples from one cluster to another in order to reduce the criterion function g These iterative methods produce sub-optimal solution but are computationally tractable g We will consider two groups of iterative methods n Flat clustering algorithms g These algorithms produce a set of disjoint clusters (a.k.a. a partition) g Two algorithms are widely used: k-means and ISODATA n Hierarchical clustering algorithms: g The result is a hierarchy of nested clusters g These algorithms can be broadly divided into agglomerative and divisive approaches NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 92 Ricardo Gutierrez-Osuna Wright State University The k-means algorithm g The k-means algorithm is a simple clustering procedure that attempts to minimize the criterion function JMSE iteratively 1. 1. Define Define the the number number of of clusters clusters 2. 2. Initialize Initialize clusters clusters by by •• an an arbitrary arbitrary assignment assignment of of examples examples to to clusters clusters or or •• an an arbitrary arbitrary set set of of cluster cluster centers centers (examples (examples assigned assigned to to nearest nearest centers) centers) 3. 3. Compute Compute the the sample sample mean mean of of each each cluster cluster 4. 4. Reassign Reassign each each example example to to the the cluster cluster with with the the nearest nearest mean mean 5. 5. IfIf the the classification classification of of all all samples samples has has not not changed, changed, stop, stop, else else go go to to step step 33 n The k-means algorithm is widely used in the fields of signal processing and communication for Vector Quantization g Scalar signal values are usually quantized into a number of levels (typically a power of 2 so the signal can be transmitted in binary) g The same idea can be extended for multiple channels n However, rather than quantizing each separate channel, we can obtain a more efficient signal coding if we quantize the overall multidimensional vector by finding a number of multidimensional prototypes (cluster centers) g The set of cluster centers is called a “codebook”, and the problem of finding this codebook is normally solved using the k-means algorithm NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 93 Ricardo Gutierrez-Osuna Wright State University Hierarchical clustering g k-means creates disjoint clusters, resulting in a “flat” data representation (a partition) g Sometimes it is desirable to obtain a hierarchical representation of data, with clusters and sub-clusters arranged in a treestructured fashion n Hierarchical representations are commonly used in the sciences (i.e., biological taxonomy) g Hierarchical clustering methods can be grouped in two general classes n Agglomerative (bottom-up, merging): g Starting with N singleton clusters, successively merge clusters until one cluster is left n Divisive (top-down, splitting) g Starting with a unique cluster, successively split the clusters until N singleton examples are left NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 94 Ricardo Gutierrez-Osuna Wright State University Hierarchical clustering g The preferred representation for hierarchical clusters is the dendrogram n The dendrogram is a binary tree that shows the structure of the clusters g In addition to the binary tree, the dendrogram provides the similarity measure between clusters (the vertical axis) n An alternative representation is based on sets: {{x1, {x2, x3}}, {{{x4, x5}, {x6, x7}}, x8}} but, unlike the dendrogram, sets cannot express quantitative information x1 x 2 x 3 x4 x5 x6 x7 x 8 High similarity Low similarity NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 95 Ricardo Gutierrez-Osuna Wright State University Divisive clustering g Outline n Define g NC: Number of clusters g NEX: Number of examples 1. 1. Start Start with with one one large large cluster cluster 2. 2. Find Find “worst” “worst” cluster cluster 3. 3. Split Split itit 4. go to 1 4. IfIf N NCC<N <NEX EX go to 1 g How to choose the “worst” cluster n n n n Largest number of examples Largest variance Largest sum-squared-error ... g How to split clusters n Mean-median in one feature direction n Perpendicular to the direction of largest variance n … g The computations required by divisive clustering are more intensive than for agglomerative clustering methods and, thus, agglomerative approaches are more common NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 96 Ricardo Gutierrez-Osuna Wright State University Agglomerative clustering g Outline 1. singleton clusters 1. Start Start with with N NEX EX singleton clusters 2. 2. Find Find nearest nearest clusters clusters 3. 3. Merge Merge them them 4. 4. IfIf N NCC>1 >1 go go to to 11 n Define g NC: Number of clusters g NEX: Number of examples g How to find the “nearest” pair of clusters n Minimum distance n Maximum distance n Average distance n Mean distance NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 dmin ( i , j ) = min x − y x∈ y∈ dmax ( i , j davg ( i , )= dmean ( i , j i j ) = max x − y j x∈ y∈ i j 1 NiN j )= i ∑∑ x−y x∈ − Statistical Pattern Recognition 97 i y∈ j j Ricardo Gutierrez-Osuna Wright State University Agglomerative clustering g Minimum distance n When dmin is used to measure distance between clusters, the algorithm is called the nearest-neighbor or single-linkage clustering algorithm n If the algorithm is allowed to run until only one cluster remains, the result is a minimum spanning tree (MST) n This algorithm favors elongated classes g Maximum distance n When dmax is used to measure distance between clusters, the algorithm is called the farthest-neighbor or complete-linkage clustering algorithm n From a graph-theoretic point of view, each cluster constitutes a complete subgraph n This algorithm favors compact classes g Average and mean distance n The minimum and maximum distance are extremely sensitive to outliers since their measurement of between-cluster distance involves minima or maxima n The average and mean distance approaches are more robust to outliers n Of the two, the mean distance is computationally more attractive g Notice that the average distance approach involves the computation of NiNj distances for each pair of clusters NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 98 Ricardo Gutierrez-Osuna Wright State University Conclusions g Dimensionality reduction n PCA may not find the most discriminatory axes n LDA may overfit the data g Classifiers n Quadratic classifiers are efficient for unimodal data n kNN is very versatile but is computationally inefficient g Validation n An fundamental subject, oftentimes overlooked g Clustering n Helps understand the underlying structure of the data ACKNOWLEDGMENTS This research was supported by award from NSF/CAREER 9984426. NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 99 Ricardo Gutierrez-Osuna Wright State University References g Fukunaga, K., 1990, Introduction to statistical pattern recognition, 2nd edition, Academic Pr. g Duda, R. O.; Hart, P. E. and Stork, D. G., 2001, Pattern classification, 2nd Ed., Wiley. g Bishop, C. M., 1995, Neural networks for Pattern Recognition, Oxford. g Therrien, C. W., 1989, Decision, estimation, and classification : an introduction to pattern recognition and related topics, Wiley. g Devijver, P. A. and Kittler, J., 1982, Pattern recognition: a statistical approach, Prentice-Hall. g Ballard, D., 1997, An introduction to Natural Computation, The MIT Press. g Silverman, B. W., 1986, Density estimation for statistics and data analysis, Chapman & Hall. g Smith, M., 1994, Neural networks for statistical modeling, Van Nostrand Reinhold. g Doak, J., 1992, An evaluation of feature selection methods and their application to Computer Security, University of California at Davis, Tech Report CSE-92-18 g Kohavi, R. and Sommerfield, D., 1995, "Feature Subset Selection Using the Wrapper Model: Overfitting and Dynamic Search Space Topology" in Proc. of the 1st Intl’ Conf. on Knowledge Discovery and Data Mining, pp. 192-197. NOSE 2nd Summer School Lloret de Mar, Spain, October 2-5, 2000 Statistical Pattern Recognition 100 Ricardo Gutierrez-Osuna Wright State University