240-650 Principles of Pattern Recognition Montri Karnjanadecha montri@coe.psu.ac.th http://fivedots.coe.psu.ac.th/~montri 240-650: Chapter 2: Bayesian Decision Theory 1 Chapter 2 Bayesian Decision Theory 240-650: Chapter 2: Bayesian Decision Theory 2 Statistical Approach to Pattern Recognition 240-650: Chapter 2: Bayesian Decision Theory 3 A Simple Example • Suppose that we are given two classes w1 and w2 – P(w1) = 0.7 – P(w2) = 0.3 – No measurement is given • Guessing – What shall we do to recognize a given input? – What is the best we can do statistically? Why? 240-650: Chapter 2: Bayesian Decision Theory 4 A More Complicated Example • Suppose that we are given two classes – A single measurement x – P(w1|x) and P(w2|x) are given graphically 240-650: Chapter 2: Bayesian Decision Theory 5 A Bayesian Example • Suppose that we are given two classes – A single measurement x – We are given p(x|w1) and p(x|w2) this time 240-650: Chapter 2: Bayesian Decision Theory 6 A Bayesian Example – cont. 240-650: Chapter 2: Bayesian Decision Theory 7 Bayesian Decision Theory • Bayes formula p(w j , x) P(w j | x) p( x) p( x | w j ) P(w j ) P(w j | x) • In case of two categories p( x | w j ) P(w j ) p( x) p( x | w j ) P(w j ) 2 p( x) p( x | w j ) P(w j ) • In English, it can be expressed as 240-650: Chapter 2: Bayesian Decision Theory j 1 likelihoodx prior posterior evidence 8 Bayesian Decision Theory – cont. • A posterior probability – The probability of the state of nature being w j given that feature value x has been measured • Likelihood – p( x | w j ) is the likelihood of w j with respect to x • Evidence – The evidence factor can be viewed as a scaling factor that guarantees that the posterior probabilities sum to one. 240-650: Chapter 2: Bayesian Decision Theory 9 Bayesian Decision Theory – cont. • Whenever we observe a particular x, the prob. of error is P(w1 | x) if we decide w2 P(error | x) P(w2 | x) if we decide w1 • The average prob. of error is given by P(error) P(error, x)dx P(error | x) p( x)dx 240-650: Chapter 2: Bayesian Decision Theory 10 Bayesian Decision Theory – cont. • Bayes decision rule Decide w1 if P(w1|x) > P(w2|x); otherwise decide w2 • Prob. of error P(error|x)=min[P(w1|x), P(w2|x)] • If we ignore the “evidence”, the decision rule becomes: Decide w1 if P(x|w1) P(w1) > P(x|w2) P(w2) Otherwise decide w2 240-650: Chapter 2: Bayesian Decision Theory 11 Bayesian Decision Theory--continuous features • Feature space – In general, an input can be represented by a vector, a point in a d-dimensional Euclidean space Rd • Loss function – The loss function states exactly how costly each action is and is used to convert a probability determination into a decision – Written as ( i | w j ) 240-650: Chapter 2: Bayesian Decision Theory 12 Loss Function ( i | w j ) • Describe the loss incurred for taking action i when the state of nature is wj 240-650: Chapter 2: Bayesian Decision Theory 13 Conditional Risk • • • • • Suppose we observe a particular x We take action i If the true state of nature is wj By definition we will incur the loss (i|wj) We can minimize our expected loss by selecting the action that minimize the condition risk, R(i|x) R( i | x) ( i | w j P(w j | x c j 1 240-650: Chapter 2: Bayesian Decision Theory 14 Bayesian Decision Theory • Suppose that there are c categories {w1, w2, ..., wc} • Conditional risk c R( i | x) ( i | w j ) P(w j | x) j 1 • Risk is the average expected loss R R( (x) | x) p (x)dx 240-650: Chapter 2: Bayesian Decision Theory 15 Bayesian Decision Theory • Bayes decision rule – For a given x, select the action i for which the conditional risk is minimum i* min R(i | x) i – The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved 240-650: Chapter 2: Bayesian Decision Theory 16 Two-Category Classification • Let ij = (i|wj) • Conditional risk R(1 | x) 11 P(w1 | x) 12 P(w2 | x) R( 2 | x) 21 P(w1 | x) 22 P(w2 | x) • Fundamental decision rule Decide w1 if R(1|x) < R(w2|x) 240-650: Chapter 2: Bayesian Decision Theory 17 Two-Category Classification – cont. • The decision rule can be written in several ways – Decide w1 if one of the followings is true (21 11 ) P(w1 | x) (12 22 ) P(w2 | x) (21 11 ) p(x | w1 ) P(w1 ) (12 22 ) p(x | w2 ) P(w2 ) p(x | w1 ) 12 22 P(w2 ) p(x | w2 ) 21 11 P(w1 ) Likelihood Ratio 240-650: Chapter 2: Bayesian Decision Theory These rules are equivalent 18 Minimum-Error-Rate Classification • A special case of the Bayes decision rule with the following zero-one loss function 0 ( i | w j ) 1 if i j if i j – Assigns no loss to correct decision – Assigns unit loss to any error – All errors are equally costly 240-650: Chapter 2: Bayesian Decision Theory 19 Minimum-Error-Rate Classification • Conditional risk R( i | x ( c j 1 240-650: Chapter 2: Bayesian Decision Theory i | w j P(w j | x P(w 1 P(w j i j | x j | x 20 Minimum-Error-Rate Classification • We should select i that maximizes the posterior probability P(w j | x) • For minimum error rate: Decide wi if P(wi | x) P(w j | x) 240-650: Chapter 2: Bayesian Decision Theory for all j i 21 Minimum-Error-Rate Classification 240-650: Chapter 2: Bayesian Decision Theory 22 Classifiers, Discriminant Functions, and Decision Surfaces • There are many ways to represent pattern classifiers • One of the most useful is in terms of a set of discriminant functions gi(x), i=1,…,c • The classifier assigns a feature vector x to class wi if gi (x) g j (x) 240-650: Chapter 2: Bayesian Decision Theory for all i j 23 The Multicategory Classifier 240-650: Chapter 2: Bayesian Decision Theory 24 Classifiers, Discriminant Functions, and Decision Surfaces • There are many equivalent discriminant functions – i.e., the classification results will be the same even though they are different functions – For example, if f is a monotonically increasing function, then g i ( x) f ( g i ( x)) 240-650: Chapter 2: Bayesian Decision Theory 25 Classifiers, Discriminant Functions, and Decision Surfaces • Some of discriminant functions are easier to understand or to compute 240-650: Chapter 2: Bayesian Decision Theory 26 Decision Regions • The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc If gi (x) g j (x) for all j i, thenx Ri – The regions are separated with decision boundaries, where ties occur among the largest discriminant functions 240-650: Chapter 2: Bayesian Decision Theory 27 Decision Regions – cont. 240-650: Chapter 2: Bayesian Decision Theory 28 Two-Category Case (Dichotomizer) • Two-category case is a special case – Instead of two discriminant functions, a single one can be used g (x) g1 (x) g 2 (x) g (x) P(w1 | x) P(w2 | x) p(x | w1 ) P(w1 ) g (x) ln ln p ( x | w2 ) P(w2 ) 240-650: Chapter 2: Bayesian Decision Theory 29 The Normal Density • Univariate Gaussian Density 1 x 2 1 p ( x) exp 2 2 • Mean x xp( x)dx • Variance (x 2 240-650: Chapter 2: Bayesian Decision Theory 2 (x 2 p( x)dx 30 The Normal Density 240-650: Chapter 2: Bayesian Decision Theory 31 The Normal Density • Central Limit Theorem – The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution – Gaussian is often a good model for the actual probability distribution 240-650: Chapter 2: Bayesian Decision Theory 32 The Multivariate Normal Density • Multivariate Density (in d dimension) 1 t 1 p ( x) exp (x μ Σ (x μ 1/ 2 d /2 2 (2 Σ 1 Abbreviation p(x) N (μ, Σ 240-650: Chapter 2: Bayesian Decision Theory 33 The Multivariate Normal Density • Mean μ x xp (x)dx • Covariance matrix Σ (x μ (x μ (x μ (x μ p (x)dx t t • The ijth component of Σ ij (xi i (x j j 240-650: Chapter 2: Bayesian Decision Theory 34 Statistically Independence – If xi and xj are statistically independence then ij 0 – The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero 240-650: Chapter 2: Bayesian Decision Theory 35 Whitening Transform Diagonal matrix of the corresponding eigenvalues of Σ 1/ 2 Aw ΦΛ matrix whose columns are the orthonormal eigenvectors of Σ 240-650: Chapter 2: Bayesian Decision Theory 36 Whitening Transform 240-650: Chapter 2: Bayesian Decision Theory 37 Squared Mahalanobis Distance from x to r (x μ Σ1 (x μ 2 t Constant density Principle axes of hyperellipsiods are given by the eigenvectors of S Length of axes are determined by eigenvalues of S 240-650: Chapter 2: Bayesian Decision Theory 38 Discriminant Functions for the Normal Density • Minimum distance classifier gi (x) ln p(x | wi ) ln P(wi ) • If the density p(x | wi ) are multivariate normal– i.e., if p(x | wi ) N (μi , Σi ) Then we have: 1 d 1 t 1 g i (x) (x μ i Σ i (x μ i ln 2 ln Σ i ln P(wi ) 2 2 2 240-650: Chapter 2: Bayesian Decision Theory 39 Discriminant Functions for the Normal Density • Case 1: Σi 2I – Features are statistically independence and each feature has the same variance gi (x) x μi 2 2 2 ln P(wi ) – Where || . || denotes the Euclidean norm x μi 2 240-650: Chapter 2: Bayesian Decision Theory (x μ i )t (x μ i ) 40 Case 1: Si = 2I 240-650: Chapter 2: Bayesian Decision Theory 41 Linear Discriminant Function • It is not necessary to compute distances – Expanding the form (x μi )t (x μi ) yields g i ( x) 1 2 x x 2μ x μ μ ln P(w ) t 2 t i t i i i t – The term x x is the same for all i – We have the following linear discriminant function gi (x) wti x wi 0 240-650: Chapter 2: Bayesian Decision Theory 42 Linear Discriminant Function where and wi 1 2 μi 1 t wi 0 μ μ ln P(w ) 2 Threshold or bias for the ith category 240-650: Chapter 2: Bayesian Decision Theory 43 Linear Machine • A classifier that uses linear discriminant functions is called a linear machine • Its decision surfaces are pieces of hyperplanes defined by the linear equations gi (x) g j (x) for the two categories with the highest posterior probabilities. For our case this equation can be written as w (x x 0 ) 0 t 240-650: Chapter 2: Bayesian Decision Theory 44 Linear Machine Where And w μ μ j 1 x 0 (μ i μ j 2 μi μ j 2 P(wi ) (μi μ j ln P(w j ) If P(wi ) P(w j ) then the second term vanishes It is called a minimum-distance classifier 240-650: Chapter 2: Bayesian Decision Theory 45 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 46 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 47 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 48 Case 2: Si = S • Covariance matrices for all of the classes are identical but otherwise arbitrary • The cluster for the ith class is centered about i • Discriminant function: 1 t g i (x) (x μ i Σ 1 (x μ i ln P(wi ) 2 240-650: Chapter 2: Bayesian Decision Theory Can be ignored if prior probabilities are the same for all classes 49 Case 2: Discriminant function gi (x) w x wi 0 t i Where and wi Σ1μi 1 t 1 wi 0 μ Σ μ ln P(w ) 2 240-650: Chapter 2: Bayesian Decision Theory 50 For 2-category case • If Ri and Rj are contiguous, the boundary between them has the equation w (x x 0 ) 0 t where and w Σ (μ μ j 1 ln P(wi ) / P(w j ) 1 (μi μ j x 0 (μ i μ j t 1 2 (μi μ j Σ (μi μ j 240-650: Chapter 2: Bayesian Decision Theory 51 240-650: Chapter 2: Bayesian Decision Theory 52 240-650: Chapter 2: Bayesian Decision Theory 53 Case 3: Si = arbitrary • In general, the covariance matrices are different for each category • The only term that can be dropped is the (d/2) ln 2 term 240-650: Chapter 2: Bayesian Decision Theory 54 Case 3: Si = arbitrary The discriminant functions are gi (x) x Wi x w x wi 0 t Where t i 1 1 Wi Σ i 2 wi Σi1i and 1 t 1 1 wi 0 μ Σ μ ln Σ i ln P(w ) 2 2 240-650: Chapter 2: Bayesian Decision Theory 55 Two-category case • The decision surface are hyperquadrics (hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids,…) 240-650: Chapter 2: Bayesian Decision Theory 56 240-650: Chapter 2: Bayesian Decision Theory 57 240-650: Chapter 2: Bayesian Decision Theory 58 240-650: Chapter 2: Bayesian Decision Theory 59 240-650: Chapter 2: Bayesian Decision Theory 60 Example 240-650: Chapter 2: Bayesian Decision Theory 61