2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004. Decisions with Uncertainty • Bayes Decision Theory is a theory for how to make decisions in the presence of uncertainty. • Input data x. • Salmon y= +1, Sea Bass y=-1. • Learn decision rule: f(x) taking values Decision Rule for Fish. • Classify fish as Salmon or Sea Bass by decision rule f(x). Basic Ingredients. • Assume there are probability distributions for generating the data. • P(x|y=1) and P(x|y=-1). • Loss function L(f(x),y) specifies the loss of making decision f(x) when true state is y. • Distribution P(y). Prior probability on y. • Joint Distribution P(x,y) = P(x|y) P(y). Minimize the Risk • The risk of a decision rule f(x) is: • Bayes Decision Rule f*(x): • The Bayes Risk: • Minimize the Risk. • Write P(x,y) = P(y|x) P(x). • Then we can write the Risk as: • The best decision for input x is f*(x): Bayes Rule. • Posterior distribution P(y|x): • Likelihood function P(x|y) • Prior P(y). • Bayes Rule has been controversial (historically) because of the Prior P(y) (subjective?). • But in Bayes Decision Theory, everything starts from the joint distribution P(x,y). Risk. • The Risk is based on averaging over all possible x & y. Average Loss. • Alternatively, can try to minimize the worst risk over x & y. Minimax Criterion. • This course uses the Risk, or average loss. Generative & Discriminative. • Generative methods aim to determine probability models P(x|y) & P(y). • Discriminative methods aim directly at estimating the decision rule f(x). • Vapnik argues for Discriminative Methods: Don’t solve a harder problem than you need to. Only care about the probabilities near the decision boundaries. Discriminant Functions. • For two category case the Bayes decision rule depends on the discriminant function: • The Bayes decision rule is of form: • Where T is a threshold, which is determined by the loss function. Two-State Case • Detect “target” or “non-target”. • Let loss function pay a penalty of 1 for misclassification, 0 otherwise. • Risk becomes Error. Bayes Risk becomes Bayes Error. • Error is the sum of false positives F+ (non- targets classified as targets) and false negatives F- (targets classified as non-targets). Gaussian Example: 1 • Is a bright light flashing? • n is no. photons emitted by dim or bright light. 8. Gaussian Example: 2 • are Gaussians with • means and s.d. . Bayes decision rule selects “dim” if • Errors: ; Example: Multidimensional Gaussian Distributions. • Suppose the two classes have Gaussian distributions for P(x|y). • Different means but same covariance • The discriminant function is a plane: • Alternatively, seek a planar decision rule without attempting to model the distributions. • Only care about the data near the decision boundary. Generative vrs. Discriminant. • The Generative approach will attempt to estimate the Gaussian distributions from data – and then derive the decision rule. • The Discriminant approach will seek to estimate the decision rule directly by learning the discriminant plane. • In practice, we will not know the form of the distributions of the form of the discriminant. Gaussian. • Gaussian Case with unequal covariance. Discriminative Models & Features. • In practice, the Discriminative methods are usually defined based on features extracted from the data. (E.g. length and brightness of fish). • Calculate features z=h(x). • Bayes Decision Theory says that this throws away information. • Restrict to a sub-class of possible decision rules – those that can be expressed in terms of features z=h(x). Bayes Decision Rule and Learning. • Bayes Decision Theory assumes that we know, or can learn, the distributions P(x|y). • This is often not practical, or extremely difficult. • In real problems, you have a set of classified data • You can attempt to learn P(x|y=+1) & P(x|y=-1) from these (next few lectures). • Parametric & Non-parametric approaches. • Question: when do you have enough data to learn these probabilities accurately? • Depends on the complexity of the model. Machine Learning. • Replace Risk by Empirical Risk • How does minimizing the empirical risk relate to minimizing the true risk? • Key Issue: When can we generalize? Be confident that the decision rule we have learnt on the training data will yield good results on unseen data? Machine Learning • Vapnik’s theory gives a mathematically elegant way of answering these issues. • It assumes that the data is sampled from an unknown distribution. • Vapnik’s theory gives bounds for when we can generalize. • Unfortunately these bounds are very conservative. • In practice, train on part of dataset and test on other part(s). Extensions to Multiple Classes Conceptually straightforward – see Duda, Hart & Stork. The decision partitionsf the feature space into k subspaces ik1 i i j , i j 5 3 2 1 4