9.913 Pattern Recognition for Vision Class IV Part I – Bayesian Decision Theory Yuri Ivanov Fall 2004 TOC • Roadmap to Machine Learning • Bayesian Decision Making – – – – Fall 2004 Minimum Error Rate Decisions Minimum Risk Decisions Minimax Criterion Operating Characteristics Pattern Recognition for Vision Notation x - scalar variable x - vector variable, sometimes x when clear p ( x ) - probability density function (continuous variable) P( x) - probability mass function (discrete variable) p ( a� , b, c� , d | e, f , g , h ) - conditional density ��� � ��� � densityof these functionof these � f (x)dx ” � f ( x , x ,...x )dx dx ...dx ” � � ...� f ( x , x ,...x ) dx dx ...dx 1 2 n 1 2 1 n 2 n 1 2 n w1 A > B - if A > B then the answer is ω1, otherwise ω2 w2 Fall 2004 Pattern Recognition for Vision Machine Learning Learning machine: y x G S LM ŷ = f ( x, w) G – Generator, or Nature: implements p(x) S – Supervisor: implements p(y|x) LM - Learning Machine: implements f(x, w) Fall 2004 Pattern Recognition for Vision Loss and Risk L( y , f ( x, w )) - Loss Function – how much penalty we get for deviations from true y z R( w ) = L( y , f (x, w )) p( x, y) dxdy - Expected Risk – how much penalty we get on average Goal of learning is to find f(x,w*) such that R(w*) is minimal. Fall 2004 Pattern Recognition for Vision Quick Illustration What does it mean? R ( w) = � � L( y, f ( x, w)) p( x, y ) dxdy From basic probability: If no noise: p( x, y ) = p ( y | x) p ( x ) p( y | x) = d ( y, g ( x )) R ( w) = � L( g ( x), f ( x, w)) p( x ) dx Fall 2004 Pattern Recognition for Vision Illustration cont. So, R ( w) = � L( g ( x), f ( x, w)) p( x ) dx y x Fall 2004 Pattern Recognition for Vision Example Classification: Measurements: x ˛ [ -1, 1] Labels: y = {0,1} 0 -1 1 y 0 <= p(x, y) 1 x … … y=0 y =1 x Find f(x,w*) such that R(w*) is minimal. Fall 2004 Pattern Recognition for Vision Example Let’s choose f ( x, a) = H ( x, a) - step function L( y, f ( x, a )) = 1 - d ( y - H ( x, a)) - +1 for every mistake (superimposed) 1 R (a ) = � � [1 - d (Y - H ( x, a)) ] p( x, y = Y ) dx Y =0 Fall 2004 Pattern Recognition for Vision Learning in Reality Fundamental problem: where do we get p(x, y)??? What we want: z R( w ) = L( y , f (x, w )) p( x, y) dxdy Approximate: estimate risk functional by averaging loss over observed (training) data. What we get: 1 n R ( w ) ‹ Re ( w ) = � L( yi , f (x i , w )) n i =1 Replace expected risk with empirical risk Fall 2004 Pattern Recognition for Vision What We Call Learning Taxonomies of machine learning: • by source of evaluation – Supervised, Transductive, Unsupervised, Reinforcement • by inductive principle – ERM, SRM, MDL, Bayesian estimation • by objective – classification, regression, vector quantization and density estimation Fall 2004 Pattern Recognition for Vision Taxonomy by Evaluation Source • Supervised (classification, regression) Evaluation source - immediate error vector, that is, we get to see the true y • Transductive Evaluation source – immediate error vector for SOME of the data • Unsupervised (clustering, density estimation) Evaluation source - internal metric – we don’t get to see true y • Reinforcement Evaluation source - environment – we get to see some scalar value (possibly delayed) that in some way related to whether the label we chose was correct… Fall 2004 Pattern Recognition for Vision Taxonomy by Inductive Principle • Empirical Risk Minimization (ERM) 1 n min � L( yi , f ( xi , w )), w n i =1 f (x i , w ) ˛` • Structural Risk Minimization (SRM) “complexity” 1 n min( � L( yi , f ( xi , w )) + F( h )), w ,h n i =1 f (x i , w ) ˛ `k , `1 � `2 � … � `k � … , k � h • Minimum Description Length 1 n �( D, H ) = � ( D| H ) + � ( H ) � min( � L( yi , f ( xi , w )) + F( w)) w n i=1 • Bayesian Estimation z P (x| C) = P( x|q ) p(q | C )dq Fall 2004 � 1 n min( � L( yi , f (x i , w)) + F( P( w ))) n i =1 Pattern Recognition for Vision Taxonomy by Learning Objective • Classification L( y , f ( x , w )) = 1 - d ( y , f ( x , w )) • Regression L ( y , f (x , w )) = ( y - f (x , w )) 2 • Density Estimation L( f (x , w )) = - log( f (x , w )) • Clustering/Vector Quantization L ( f (x , w )) = ( x - f ( x, w )) � ( x - f ( x, w )) Fall 2004 Pattern Recognition for Vision The Lay of the Land 1 n find f (or w ) = arg min( � L( yi , f (xi , w )) + F ) n i =1 f ( or w ) Classification Supervised ERM Regression Transductive SRM Density Estimation Unsupervised MDL Vector Quantization/Clustering Reinforcement Bayesian Regularization Fall 2004 Pattern Recognition for Vision Class Priors Making a decision about observation x is finding a rule that says: If x is in region A, decide a, if x is in region B, decide b… w w = {w i }i =1 C - state of nature P(w ) - Prior probability Shorthand P (w i ) ” P (w = w i ) C � P(w ) = 1 i =1 i Poor man’s decision rule: Decide w1 if P (w1 ) > P (w 2 ) otherwise w 2 Fall 2004 Pattern Recognition for Vision Class-Conditional Density p(x | w ) - class-conditional density function ¥ � p( x | w )dx = 1 i -¥ p ( x | w ) - density for x given that the nature is in the state w How do we decide which class x came from? Deciding on this is not fair Fall 2004 ? Pattern Recognition for Vision Joint Density p ( x, w ) = p ( x | w ) P (w ) Good – It is fair - joint density function P (w1 ) = 0.2 P (w 2 ) = 0.8 Bad – not very convenient. This relates to the measurement probabilistically, we need a function! Fall 2004 Pattern Recognition for Vision Bayes Rule and Posterior We want a probability of ω for each value of x: P (w | x) Note that p ( x, w ) = p ( x | w ) P (w ) = P(w | x) p ( x) likelihood prior � Bayes rule p ( x | w ) P (w ) � P(w | x ) = p( x) posterior evidence Continuum of binary distributions Fall 2004 Pattern Recognition for Vision Marginalization Bayes Rule: how to convert prior to posterior by using measurements: p ( x | w ) P (w ) P(w | x ) = p( x) What is p(x)? C C i =1 i =1 p ( x )= � p ( x , w i ) = � p ( x | w i ) P (w i ) Or, more generally: p ( x )= � p ( x, y ) dy - marginalization Fall 2004 Pattern Recognition for Vision Making Decisions With posteriors we can compare class probabilities Intuitively: w = arg max ( p(w i | x ) ) What is the probability of error? x � p (w 2 | x ) p (error | x ) = � � p (w1 | x ) if we choose w1 if we choose w 2 P (error ) = � P( error , x) dx = � P (error | x) p (x ) dx Fall 2004 Pattern Recognition for Vision Errors How bad is a decision threshold? P(e) = P ( x ˛ R1 ,w 2 ) + P( x ˛ R2 , w1 ) T* Recall that P ( a ˛ A) = � p ( a) da � a˛ A P (e ) = � p ( x, w 2 ) dx + x˛R1 R2 x˛R2 = � x˛ R1 Fall 2004 � R1 p ( x, w1 )dx = p ( x | w 2 ) P (w 2 ) dx + � p ( x | w1 ) P (w1 )dx x˛ R2 Pattern Recognition for Vision Bayes Decision Rule To minimize P (error )= � P(error | x) p (x )dx we need to make P(error|x) as small as possible for all x � min [ P (error )] = � min [ P (error | x ) ] p ( x )dx w Bayes error = � min [ P(w1 | x ), P(w 2 | x ) ] p (x )dx Bayes decision rule: � Decide w1 if P (w1 | x) >P (w 2 | x ); otherwise w 2 Fall 2004 Pattern Recognition for Vision Bayes Decision Rule For Bayes decision rule (back to the 1st example): R1 – region where we always choose ω1 R2 – region where we always choose ω2 Fall 2004 Pattern Recognition for Vision Loss Function {w1 , w 2 ,..., w c } {a1 , a 2 ,...,a a } L (a i | w j ) - set of classes - set of actions - Loss function, penalty sustained for taking an action αi when the state of nature is ωj We make this up For x ˛ R d conditional risk for taking an action αi in ωj is the expected (average) loss for classifying ONE x: c R (a i | x) = � L (a i | w j ) P (w j | x ) j =1 Fall 2004 Pattern Recognition for Vision Bayes Risk We will see a lot of x –es. To see how well we do, we average again: Ø c ø R = � R (a i | x) p ( x )dx = � Œ � L (a i | w j ) P (w j | x) œ p ( x ) dx º j =1 ß Ø c ø = � Œ � L(a i | w j ) p (w j , x) œ dx º j =1 ß This is exactly the expression for expected risk from before Similarly to the earlier argument about P(error): min [ R ] = � min [ R(a i | x) ] p ( x ) dx = R * - Bayes risk Fall 2004 Pattern Recognition for Vision Quick Summary L (a i | w j ) - loss R (a i | x ) = Ew | x غ L(a i | w j ) øß - conditional risk (expected loss) R = E x [ R (a i | x) ] - total risk (expected cond. risk) R * = min [ R ] - Bayes risk (minimum risk) Fall 2004 Pattern Recognition for Vision Minimum Risk Classification Two classes and a simple action: Ø -1 0.2ø L=Œ 0 œß º3 a i - decide to choose ωi lij = L (a i | w j ) w = {w1 , w 2 } Then: � R (a1 | x ) = l11 P (w 1 | x ) + l12 P (w 2 | x ) � � R (a 2 | x ) = l21 P (w1 | x ) + l22 P (w 2 | x ) Obvious decision – decide in favor of the class with minimal risk: w1 R (a 1 | x) < R (a 2 | x ) w2 Fall 2004 Pattern Recognition for Vision Likelihood Ratio Test Rewriting R(αi|x)’s: w1 (l21 - l11 ) P (w1 | x ) > (l12 - l22 ) P (w 2 | x) � w2 w1 (l21 - l11 ) p ( x | w1 ) P (w1 ) > (l12 - l22 ) p ( x | w 2 ) P (w 2 ) � w2 “Class model” “Class prior” p ( x | w1 ) w1 (l12 - l22 ) P (w 2 ) > p ( x | w 2 ) w 2 (l21 - l11 ) P (w1 ) Likelihood Ratio Test Fall 2004 Pattern Recognition for Vision LRT Example • You are driving to Blockbuster’s to return a video due today • It is 5 min to midnight • You hit a red light • You see a car that you 60% sure looks like a police car • Traffic fine is $5 AND you are late • Blockbuster’s fine is $10 Should you run the red light? Fall 2004 Pattern Recognition for Vision Minimum Risk P( police | x ) = 0.6 P ( police | x ) = 0.4 You pay police not police run $15 $0 not run $10 $10 � 15 0 � L=� � Ł10 10 ł �� R ( run | x ) = l11P ( police | x ) + l12 P ( police | x) = $9 � � R ( wait | x ) = l21P ( police | x ) + l22P ( police | x ) = $10 The risk is higher if you wait Fall 2004 Pattern Recognition for Vision LRT Way OK Let’s say we have this “policeness” feature WAIT Decision threshold p ( x | w1 ) w1 (l12 - l22 ) P (w 2 ) > p( x | w 2 ) w 2 (l21 - l11 ) P (w1 ) Fall 2004 Pattern Recognition for Vision LRT Example OK WAIT OK WAIT Threshold is dependent on priors p ( x | w1 ) w1 (l12 - l22 ) P (w 2 ) > p( x | w 2 ) w 2 (l21 - l11 ) P (w1 ) Fall 2004 Pattern Recognition for Vision LRT Example � 15 -60 � L=� � 10 10 Ł ł � 15 0 � L=� � 10 10 Ł ł OK WAIT Threshold is dependent on loss OK OK WAIT p ( x | w1 ) w1 (l12 - l22 ) P (w 2 ) > p( x | w 2 ) w 2 (l21 - l11 ) P (w1 ) Fall 2004 Pattern Recognition for Vision Minimum Error Rate Classification Let‘s simplify the Min. Risk classification: � 0 1 1� � � L = �1 � 1� �1 1 0� Ł ł - zero-one loss, just counts errors Then the conditional risk becomes: c R (a i | x ) = � L (a i | w j ) P (w j | x ) j =1 = � P (w j | x ) = 1 - P (w i | x ) "j „ i So, ωi having the highest value of the posterior minimizes the risk: wi P (w i | x ) >P(w j | x ) "j „ i Fall 2004 - good ol’ Bayes decision rule Pattern Recognition for Vision Minimax Criterion Is there a decision rule such that the risk is insensitive to priors? R= � [l 11 P (w1 | x) + l12 P (w 2 | x )] dx R1 + � [ l21 P (w1 | x ) + l22 P (w 2 | x ) ] dx R2 After some algebra: R ( P (w1 )) = l22 + ( l12 - l22 ) � p ( x | w 2 ) dx + P (w1 ) f (T ) R1 Minimax risk Goal – find the decision boundary T, such that f(T) is 0. At the boundary for which the minimum risk is maximal the risk is independent of priors. Fall 2004 Pattern Recognition for Vision Messy Illustration of the Minimax Solution Rmm + P(w1 ) f (T ) -cannot be smaller than Bayes risk Rmm + P(w1 ) f (T ) R ( P (w1 )) Bayes risk is concave down 0 1 P(w1 ) T : f (T ) = 0 T : f (T ) „ 0 Worst possible Bayes risk is independent of priors Fall 2004 Pattern Recognition for Vision Discriminant Functions and Decision Surfaces Discriminant functions conveniently represent classifiers: C = {g1 ( x), g 2 ( x),...g n ( x)} w i : i = arg max ( g i ( x) ) Eg: g i ( x ) = -R (a i | x ) gi ( x ) = P (w i | x ) g i ( x ) = ln p ( x | w i ) + ln P (w i ) Discriminants DO NOT have to relate to probabilities. Fall 2004 Pattern Recognition for Vision Discriminant for Binary Classification For two-class problem: g ( x ) = g1 ( x ) - g 2 ( x ) Then (assuming that classes are encoded as -1 and +1): w i = sign ( g( x)) Eg: g ( x ) = P (w1 | x) - P (w 2 | x ) P (w1 ) p ( x | w1 ) g ( x ) = ln + ln P (w 2 ) p( x | w2 ) Fall 2004 Pattern Recognition for Vision Boundary for Two Normal Distributions If we assume a Gaussian for a class model: p( x | wi ) = 1 (2p ) d /2 | Si | 1/2 e - ( 1 ( x- m i )T S i-1 ( x- m i ) 2 ) … and the minimum error rate classifier: g i ( x ) = ln [ p ( x | w i ) P (w i ) ] = 1 T -1 d/2 1/2 Ø = - ( ( x - m i ) Si ( x - m i ) ) - ln º (2p ) | Si | ßø + ln P (w i ) 2 Fall 2004 Pattern Recognition for Vision Boundary Between Two Normal Distributions Discriminant (after some algebra): g ( x ) = g1 ( x ) - g 2 ( x ) == xT Wx + wx + w0 - quadratic where 1 -1 W = - ( S1 - S 2-1 ) 2 w = S1-1m1 - S 2-1m 2 wo = ... well, the rest of it - a matrix - a vector - a scalar Special cases: 1) S1 = S 2 � W = 0 � g ( x ) - linear Ø 1 1 ø 2) S i = s i I � W = Œ œ I = s I � g ( x ) - a circle º 2s 2 2s 1 ß Fall 2004 Pattern Recognition for Vision Evaluating Decisions Is 70% classification rate good or bad? 70% = BAD 70% = GOOD Discriminability: d'= m1 - m 2 s High d’ means that the classes are easy to discriminate. Fall 2004 Pattern Recognition for Vision ROC We do not know m1 , m 2 , s but we can get: P ( x > x * | x ˛ w 2 ) - probability of hit P ( x > x * | x ˛ w1 ) - probability of false alarm P ( x < x * | x ˛ w 2 ) - probability of miss P ( x < x * | x ˛ w1 ) - probability of correct rejection Each x* corresponds to a point on hit/false_alarm plane. This is called an ROC curve Fall 2004 Pattern Recognition for Vision ROC Curve Fall 2004 Pattern Recognition for Vision Reality In practice, it is done for a single parameter Using the data for which true ω is known: • Identify a parameter of interest • Identify the parameter range • Vary the parameter within the range • Compute P(hit) and P(false_alarm) empirically for each value It tells us how well the classifier can deal with the data set. Fall 2004 Pattern Recognition for Vision Homework – Part I • A “chance” puzzle – Try to solve it – Understand the solution – Simulate in Matlab • Build an ROC curve – Almost like in class – Can you implement it efficiently? Fall 2004 Pattern Recognition for Vision