Statistical Decision Theory, Bayes Classifier Lecture Notes for CMPUT 466/551 Nilanjan Ray 1 Supervised Learning Problem Input (random) variable: X Output (random) variable: Y A toy example: unknown X ~ N ( ; 2 ) Y f (X ) A set of realizations of X and Y are available in (input, output) pairs: T {( x1 , y1 ), ( x2 , y2 ), , ( xN , y N )} This set T is known as the training set Supervised Learning Problem: given the training set, we want to estimate the value of the output variable Y for a new input variable value, X=x0 ? Essentially here we try to learn the unknown function f from T. 2 Basics of Statistical Decision Theory We want to attack the supervised learning problem from the viewpoint of probability and statistics. Thus, let’s consider X and Y as random variables with a joint probability distribution Pr(X,Y). Assume a loss function L(Y,f(x)), such as a squared loss (L2): L(Y , f ( X )) (Y f ( X )) 2 Choose f so that the expected prediction error is minimized: EPE( f ) EL(Y , f ( X )) ( y f ( x)) 2 Pr( dx, dy ) Known as regression function Also known as (non-linear) filtering in signal processing The minimizer is the conditional expectation : fˆ ( x) E(Y | X x) So, if we knew Pr(Y|X), we would readily estimate the output See [HTF] for the derivation of this minimization; Also see [Bishop] for a different derivation– know about a heavier machinery, called calculus of variations 3 Loss Functions and Regression Functions L(Y , f ( X )) | Y f ( X ) | L1 loss: 0-1 loss: 0, if Y f ( X ) L(Y , f ( X )) 1, otherwise RF is conditional median: fˆ ( x) Median (Y | X x) RF is conditional mode: fˆ ( x) max Pr (Y | X x) Exponential loss function Used in boosting… Huber loss function Robust to outliers… Your loss function Own student award in … Y Observation: the estimator/regression function is always defined in terms of the conditional probability Pr(Y|X) Pr(Y|X) is Typically unknown to us, so what can we do?? 4 Create Your Own World • Example Strategies: – Try to estimate regression function directly from training data – Assume some models, or structure on the solution,….so that the EPE minimization becomes tractable – Your own award winning paper – Etc… Structure free Highly structured 5 Nearest-Neighbor Method Use those observations to predict that are neighbors of the (new) input: 1 Yˆ ( x) yi k i:xi N k ( x ) Nk(x) is the neighborhood of x defined by k-closest points xi in the training sample Estimates the regression function directly from the training set: fˆ ( x) Ave( yi | xi N k ( x)) Nearest neighbor method is also (traditionally) known as a nonparametric method 6 Nearest-Neighbor Fit Major limitations: (1) Very in inefficient in high dimension (2) Could be unstable (wiggly) (3) If training data is scarce, may not be the right choice For data description see [HTF], Section 2.3.3 7 How to Choose k: Bias-Variance Trade-off Let’s try to answer the obvious question: how can we choose k? One way to get some insight about k is the bias-variance trade-off Let’s say input-output model is Test error: Y f (X ) EPEk ( x0 ) ET [(Y fˆk ( x0 )) 2 | X x0 ] 2 Bias 2 ( fˆ ( x )) Var ( fˆ ( x )) k 0 T k 0 1 k 2 2 [ f ( x0 ) f ( x(l ) )] k l 1 k 2 Irreducible error (the regressor has no control) Will most likely increase as k increases Will decrease as k increases So we can find a trade-off between bias and variance to choose k 8 Bias Variance Trade-off Bias variance trade-off: Graphical representation 9 Assumption on Regression Function: Linear Models Structural assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp) p Predict the output by: Yˆ 0 X j j X T j 1 Vector notation, 1 included in X We estimate β from the training set (by least square): ˆ ( XT X) 1 XT y Where, 1 x11 x1 p y1 1 x y x2 p 21 X , y 2 yN 1 xN 1 xNp So, for a new input Can be shown that EPE minimization leads to this estimate with the linear model (see the comments in [HTF] section 2.4) x0 ( x01, x02 ,, x0 p ) T T The regression output is Yˆ ( x0 ) x0 ˆ x0 (XT X)1 XT y 10 Example of Linear Model Fit Classification rule: Red, if Yˆ 0.5 ˆ G Green, otherwise Very stable (not wiggly), and computationally efficient. X2 However, if the linear model assumption fails, the regression fails miserably X1 X = (X1, X2) Y = (Red, Green), coded as Red=1, Green=0 11 More Regressors • Many regression techniques are based on modifications of linear models and nearestneighbor techniques – – – – – Kernel methods Local regression Basis expansions Projection pursuits and neural networks …. Chapter 2 of [HTF] gives their quick overview 12 Bayes Classifier The key probability distribution Pr(Y|X) in the Bayesian paradigm called posterior distribution EPE is also called Risk in the Bayesian terminology; however Risk is more general than EPE We can show that EPE is the average error of classification under 0-1 penalty in the Risk So the Bayes’ classifier is minimum error classifier Corresponding error is called Bayes’ risk / Bayes rate etc. Materials mostly based on Ch 2 of [DHS] book 13 Bayes’ Classifier… We can associate probabilities to the categories (output classes) of the classifier This probability is called prior probability or simply the prior Pi Pr[ H i ], i 1,2,..., M We assign a cost (Cij) to each possible decision outcome: Hi is true, we choose Hj R1 R2 R3 R1 M=3 2D input space, X We know the conditional probability densities p(X|Hi) Also called likelihood of the hypothesis Hi, or simply the likelihood M Total risk M Risk Cij Pr[ H i is true ] Pr[ H j is chosen | H i is true ] j 1 i 1 M M Cij Pi p( x | H i )dx j 1 i 1 Rj M Cij Pi p( x | H i ) dx j 1 R j i 1 M 14 Bayes’ Classifier… Choose the Rj’s in such as way that this risk is minimized Since Rj partitions the entire region, any x will belong to exactly one such Rj So we can minimize the risk via the following rule to construct the partitions: Rk x : k arg min j M C P p( x | H ) i 1 ij i i You should realize that the reasoning is similar to that in [HTF] while minimizing EPE in Section 2.4. 15 Minimum Error Criterion 0-1 criterion: 0, if i j Cij 1, if i j Rk x : k arg min j x : k arg min j C P p ( x | H ) ij i i i 1 M P p ( x | H ) i i i 1,i j M x : k arg max Pj p( x | H j ) j We are minimizing the Risk with 0-1 criterion This is also known as minimum error criterion In this case the total risk is the probability of error Note that arg max Pj p( x | H j ) arg max j j Pj p( x | H j ) p ( x) arg max P( H j | x) j This is same result if you minimize the EPE with 0-1 loss function (see [HTF] Section 2.4) Posterior probability 16 Two Class Case: Likelihood Ratio Test We can write Risk as: Risk C11P1 p ( x | H1 )dx C12 P2 p ( x | H 2 )dx R1 R1 C21P1 p ( x | H1 )dx C22 P2 p ( x | H 2 )dx R2 R2 [(C21 C11 ) P1 p ( x | H1 ) (C12 C22 ) P2 p ( x | H 2 )]dx const R1 We can immediately see the rule of classification from above: Assign x to H1 if P1 p( x | H1 )(C21 C11) P2 p( x | H 2 )(C12 C22 ) else assign x to H2 Likelihood ratio Special case: 0-1 criterion p ( x | H1 ) P2 (C12 C22 ) p ( x | H 2 ) P1 (C21 C11 ) p ( x | H1 ) P2 p ( x | H 2 ) P1 17 An Example: Bayes Optimal Classifier We exactly know the posterior distribution of the two distribution, from which these boundaries are created 18 Minimax Classification Bayes rule: P(Y | X ) P( X | Y ) P(Y ) P( X ) In many real life applications, prior probabilities may be unknown So we cannot have a Bayes’ optimal classification. However, one may wish to minimize the worst case overall risk We can write the Risk as a function of P1 and R1: Risk ( P1 , R1 ) C22 (C12 C22 ) p ( x | H 2 )dx R1 P1[(C11 C22 ) (C21 C11 ) p ( x | H1 )dx (C12 C22 ) p ( x | H 2 )dx] R2 R1 Observation 1: If we fix R1 then Risk is a linear function of P1 [ Risk ( P1 , R1 )] is concave in P1 Observation 2: The function g ( P1 ) min R 1 19 Minimax… Let’s arbitrarily fix P1 = a, and compute By observation 1, Risk ( P1 , R1a ) R1a arg min Risk (a, R1 ) R1 is the straight line AB Q: why should it be a tangent to g(P1)? Claim: R1a is not a good classification when P1 is not known. Why? So, what could be a better classification here? The classification corresponding to MN. Why? This is the minimax solution Why the name? We can reach the solution by max [min [ Risk ( P1 , R1 )]] P1 R1 An aside Q: when can we reverse the order and get the same result? Another way to solve R1 in minimax is from: (C11 C22 ) (C21 C11 ) p( x | H1 )dx (C12 C22 ) p( x | H 2 )dx R2 R1 If you get multiple solutions, choose one that gives you the minimum Risk 20 Neyman-Pearson Criterion Consider a two class problem 1 0.9 Following four probabilities can be computed: 0.8 p( x | H1 ) 0.7 Probability of detection (hit) P( x R2 | H 2 ) p( x | H 2 ) 0.6 0.5 Probability of false alarm P( x R2 | H1 ) 0.4 0.3 Probability of miss P( x R1 | H 2 ) 0.2 0.1 Probability of correct rejection P( x R1 | H1 ) 0 -10 -8 -6 -4 -2 0 R1 2 4 6 8 10 R2 We do not know the prior probabilities, so Bayes’s optimum classification is not possible However we do know that Probability of False alarm must be below Based on this constraint (Neyman-Pearson criterion) we can design a classifier Observation: maximizing probability of detection and minimizing probability of 21 false alarm are conflicting goals (in general) Receiver Operating Characteristics ROC is a plot: probability of false alarm vs. probability of detection 1 0.9 Probability of detection Classifier 1 0.8 0.7 0.6 0.5 0.4 0.3 Classifier 2 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of false alarm Area under ROC curve is a measure of performance Used also to find a operating point for the classifier 22