Chapter02

240-650 Principles of Pattern Recognition Montri Karnjanadecha montri@coe.psu.ac.th http://fivedots.coe.psu.ac.th/~montri 240-650: Chapter 2: Bayesian Decision Theory 1 Chapter 2 Bayesian Decision Theory 240-650: Chapter 2: Bayesian Decision Theory 2 Statistical Approach to Pattern Recognition 240-650: Chapter 2: Bayesian Decision Theory 3 A Simple Example • Suppose that we are given two classes w1 and w2 – P(w1) = 0.7 – P(w2) = 0.3 – No measurement is given • Guessing – What shall we do to recognize a given input? – What is the best we can do statistically? Why? 240-650: Chapter 2: Bayesian Decision Theory 4 A More Complicated Example • Suppose that we are given two classes – A single measurement x – P(w1|x) and P(w2|x) are given graphically 240-650: Chapter 2: Bayesian Decision Theory 5 A Bayesian Example • Suppose that we are given two classes – A single measurement x – We are given p(x|w1) and p(x|w2) this time 240-650: Chapter 2: Bayesian Decision Theory 6 A Bayesian Example – cont. 240-650: Chapter 2: Bayesian Decision Theory 7 Bayesian Decision Theory • Bayes formula p(w j , x)  P(w j | x) p( x)  p( x | w j ) P(w j ) P(w j | x)  • In case of two categories p( x | w j ) P(w j ) p( x)  p( x | w j ) P(w j ) 2 p( x)   p( x | w j ) P(w j ) • In English, it can be expressed as 240-650: Chapter 2: Bayesian Decision Theory j 1 likelihoodx prior posterior evidence 8 Bayesian Decision Theory – cont. • A posterior probability – The probability of the state of nature being w j given that feature value x has been measured • Likelihood – p( x | w j ) is the likelihood of w j with respect to x • Evidence – The evidence factor can be viewed as a scaling factor that guarantees that the posterior probabilities sum to one. 240-650: Chapter 2: Bayesian Decision Theory 9 Bayesian Decision Theory – cont. • Whenever we observe a particular x, the prob. of error is  P(w1 | x) if we decide w2 P(error | x)   P(w2 | x) if we decide w1 • The average prob. of error is given by     P(error)   P(error, x)dx   P(error | x) p( x)dx 240-650: Chapter 2: Bayesian Decision Theory 10 Bayesian Decision Theory – cont. • Bayes decision rule Decide w1 if P(w1|x) > P(w2|x); otherwise decide w2 • Prob. of error P(error|x)=min[P(w1|x), P(w2|x)] • If we ignore the “evidence”, the decision rule becomes: Decide w1 if P(x|w1) P(w1) > P(x|w2) P(w2) Otherwise decide w2 240-650: Chapter 2: Bayesian Decision Theory 11 Bayesian Decision Theory--continuous features • Feature space – In general, an input can be represented by a vector, a point in a d-dimensional Euclidean space Rd • Loss function – The loss function states exactly how costly each action is and is used to convert a probability determination into a decision – Written as  ( i | w j ) 240-650: Chapter 2: Bayesian Decision Theory 12 Loss Function  ( i | w j ) • Describe the loss incurred for taking action i when the state of nature is wj 240-650: Chapter 2: Bayesian Decision Theory 13 Conditional Risk • • • • • Suppose we observe a particular x We take action i If the true state of nature is wj By definition we will incur the loss (i|wj) We can minimize our expected loss by selecting the action that minimize the condition risk, R(i|x) R( i | x)    ( i | w j P(w j | x c j 1 240-650: Chapter 2: Bayesian Decision Theory 14 Bayesian Decision Theory • Suppose that there are c categories {w1, w2, ..., wc} • Conditional risk c R( i | x)    ( i | w j ) P(w j | x) j 1 • Risk is the average expected loss R   R( (x) | x) p (x)dx 240-650: Chapter 2: Bayesian Decision Theory 15 Bayesian Decision Theory • Bayes decision rule – For a given x, select the action i for which the conditional risk is minimum i*  min R(i | x) i – The resulting minimum overall risk is called the Bayes risk, denoted as R*, which is the best performance that can be achieved 240-650: Chapter 2: Bayesian Decision Theory 16 Two-Category Classification • Let ij = (i|wj) • Conditional risk R(1 | x)  11 P(w1 | x)  12 P(w2 | x) R( 2 | x)  21 P(w1 | x)  22 P(w2 | x) • Fundamental decision rule Decide w1 if R(1|x) < R(w2|x) 240-650: Chapter 2: Bayesian Decision Theory 17 Two-Category Classification – cont. • The decision rule can be written in several ways – Decide w1 if one of the followings is true (21  11 ) P(w1 | x)  (12  22 ) P(w2 | x) (21  11 ) p(x | w1 ) P(w1 )  (12  22 ) p(x | w2 ) P(w2 ) p(x | w1 ) 12  22 P(w2 )  p(x | w2 ) 21  11 P(w1 ) Likelihood Ratio 240-650: Chapter 2: Bayesian Decision Theory These rules are equivalent 18 Minimum-Error-Rate Classification • A special case of the Bayes decision rule with the following zero-one loss function 0  ( i | w j )   1 if i  j if i  j – Assigns no loss to correct decision – Assigns unit loss to any error – All errors are equally costly 240-650: Chapter 2: Bayesian Decision Theory 19 Minimum-Error-Rate Classification • Conditional risk R( i | x     ( c j 1   240-650: Chapter 2: Bayesian Decision Theory i | w j P(w j | x   P(w 1  P(w j i j | x j | x 20 Minimum-Error-Rate Classification • We should select i that maximizes the posterior probability P(w j | x) • For minimum error rate: Decide wi if P(wi | x)  P(w j | x) 240-650: Chapter 2: Bayesian Decision Theory for all j  i 21 Minimum-Error-Rate Classification 240-650: Chapter 2: Bayesian Decision Theory 22 Classifiers, Discriminant Functions, and Decision Surfaces • There are many ways to represent pattern classifiers • One of the most useful is in terms of a set of discriminant functions gi(x), i=1,…,c • The classifier assigns a feature vector x to class wi if gi (x)  g j (x) 240-650: Chapter 2: Bayesian Decision Theory for all i  j 23 The Multicategory Classifier 240-650: Chapter 2: Bayesian Decision Theory 24 Classifiers, Discriminant Functions, and Decision Surfaces • There are many equivalent discriminant functions – i.e., the classification results will be the same even though they are different functions – For example, if f is a monotonically increasing function, then g i ( x)  f ( g i ( x)) 240-650: Chapter 2: Bayesian Decision Theory 25 Classifiers, Discriminant Functions, and Decision Surfaces • Some of discriminant functions are easier to understand or to compute 240-650: Chapter 2: Bayesian Decision Theory 26 Decision Regions • The effect of any decision is to divide the feature space into c decision regions, R1, ..., Rc If gi (x)  g j (x) for all j  i, thenx  Ri – The regions are separated with decision boundaries, where ties occur among the largest discriminant functions 240-650: Chapter 2: Bayesian Decision Theory 27 Decision Regions – cont. 240-650: Chapter 2: Bayesian Decision Theory 28 Two-Category Case (Dichotomizer) • Two-category case is a special case – Instead of two discriminant functions, a single one can be used g (x)  g1 (x)  g 2 (x) g (x)  P(w1 | x)  P(w2 | x) p(x | w1 ) P(w1 ) g (x)  ln  ln p ( x | w2 ) P(w2 ) 240-650: Chapter 2: Bayesian Decision Theory 29 The Normal Density • Univariate Gaussian Density  1  x   2  1 p ( x)  exp    2   2     • Mean    x   xp( x)dx • Variance    (x        2  240-650: Chapter 2: Bayesian Decision Theory 2    (x    2 p( x)dx 30 The Normal Density 240-650: Chapter 2: Bayesian Decision Theory 31 The Normal Density • Central Limit Theorem – The aggregate effect of the sum of a large number of small, independent random disturbances will lead to a Gaussian distribution – Gaussian is often a good model for the actual probability distribution 240-650: Chapter 2: Bayesian Decision Theory 32 The Multivariate Normal Density • Multivariate Density (in d dimension)  1  t 1 p ( x)  exp (x  μ  Σ (x  μ  1/ 2 d /2  2  (2  Σ 1 Abbreviation p(x)  N (μ, Σ 240-650: Chapter 2: Bayesian Decision Theory 33 The Multivariate Normal Density • Mean μ   x   xp (x)dx • Covariance matrix Σ   (x  μ (x  μ     (x  μ (x  μ  p (x)dx t t • The ijth component of Σ  ij   (xi  i (x j   j  240-650: Chapter 2: Bayesian Decision Theory 34 Statistically Independence – If xi and xj are statistically independence then  ij  0 – The covariance matrix will become a diagonal matrix where all off-diagonal elements are zero 240-650: Chapter 2: Bayesian Decision Theory 35 Whitening Transform Diagonal matrix of the corresponding eigenvalues of Σ 1/ 2 Aw  ΦΛ matrix whose columns are the orthonormal eigenvectors of Σ 240-650: Chapter 2: Bayesian Decision Theory 36 Whitening Transform 240-650: Chapter 2: Bayesian Decision Theory 37 Squared Mahalanobis Distance from x to  r  (x  μ Σ1 (x  μ 2 t Constant density Principle axes of hyperellipsiods are given by the eigenvectors of S Length of axes are determined by eigenvalues of S 240-650: Chapter 2: Bayesian Decision Theory 38 Discriminant Functions for the Normal Density • Minimum distance classifier gi (x)  ln p(x | wi )  ln P(wi ) • If the density p(x | wi ) are multivariate normal– i.e., if p(x | wi )  N (μi , Σi ) Then we have: 1 d 1 t 1 g i (x)   (x  μ i  Σ i (x  μ i   ln 2  ln Σ i  ln P(wi ) 2 2 2 240-650: Chapter 2: Bayesian Decision Theory 39 Discriminant Functions for the Normal Density • Case 1: Σi   2I – Features are statistically independence and each feature has the same variance gi (x)   x  μi 2 2 2  ln P(wi ) – Where || . || denotes the Euclidean norm x  μi 2 240-650: Chapter 2: Bayesian Decision Theory  (x  μ i )t (x  μ i ) 40 Case 1: Si = 2I 240-650: Chapter 2: Bayesian Decision Theory 41 Linear Discriminant Function • It is not necessary to compute distances – Expanding the form (x  μi )t (x  μi ) yields g i ( x)   1 2 x x  2μ x  μ μ  ln P(w ) t 2 t i t i i i t – The term x x is the same for all i – We have the following linear discriminant function gi (x)  wti x  wi 0 240-650: Chapter 2: Bayesian Decision Theory 42 Linear Discriminant Function where and wi  1  2 μi 1 t wi 0  μ μ  ln P(w )    2 Threshold or bias for the ith category 240-650: Chapter 2: Bayesian Decision Theory 43 Linear Machine • A classifier that uses linear discriminant functions is called a linear machine • Its decision surfaces are pieces of hyperplanes defined by the linear equations gi (x)  g j (x) for the two categories with the highest posterior probabilities. For our case this equation can be written as w (x  x 0 )  0 t 240-650: Chapter 2: Bayesian Decision Theory 44 Linear Machine Where And w  μ  μ j 1  x 0  (μ i  μ j   2 μi  μ j 2 P(wi ) (μi  μ j  ln P(w j ) If P(wi )  P(w j ) then the second term vanishes It is called a minimum-distance classifier 240-650: Chapter 2: Bayesian Decision Theory 45 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 46 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 47 Priors change -> decision boundaries shift 240-650: Chapter 2: Bayesian Decision Theory 48 Case 2: Si = S • Covariance matrices for all of the classes are identical but otherwise arbitrary • The cluster for the ith class is centered about i • Discriminant function: 1 t g i (x)   (x  μ i  Σ 1 (x  μ i   ln P(wi ) 2 240-650: Chapter 2: Bayesian Decision Theory Can be ignored if prior probabilities are the same for all classes 49 Case 2: Discriminant function gi (x)  w x  wi 0 t i Where and wi  Σ1μi 1 t 1 wi 0   μ Σ μ  ln P(w ) 2 240-650: Chapter 2: Bayesian Decision Theory 50 For 2-category case • If Ri and Rj are contiguous, the boundary between them has the equation w (x  x 0 )  0 t where and w  Σ (μ  μ j  1   ln P(wi ) / P(w j ) 1 (μi  μ j  x 0  (μ i  μ j   t 1 2 (μi  μ j  Σ (μi  μ j  240-650: Chapter 2: Bayesian Decision Theory 51 240-650: Chapter 2: Bayesian Decision Theory 52 240-650: Chapter 2: Bayesian Decision Theory 53 Case 3: Si = arbitrary • In general, the covariance matrices are different for each category • The only term that can be dropped is the (d/2) ln 2 term 240-650: Chapter 2: Bayesian Decision Theory 54 Case 3: Si = arbitrary The discriminant functions are gi (x)  x Wi x  w x  wi 0 t Where t i 1 1 Wi   Σ i 2 wi  Σi1i and 1 t 1 1 wi 0   μ Σ μ  ln Σ i  ln P(w ) 2 2 240-650: Chapter 2: Bayesian Decision Theory 55 Two-category case • The decision surface are hyperquadrics (hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids,…) 240-650: Chapter 2: Bayesian Decision Theory 56 240-650: Chapter 2: Bayesian Decision Theory 57 240-650: Chapter 2: Bayesian Decision Theory 58 240-650: Chapter 2: Bayesian Decision Theory 59 240-650: Chapter 2: Bayesian Decision Theory 60 Example 240-650: Chapter 2: Bayesian Decision Theory 61

Chapter02

Related documents

Products

Support

Chapter02

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib