Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5 Evaluation Given: • Hypothesis h(x): XC, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...C] • A data-sample S(n) of size n. Questions: • What is the error of “h” on unseen data? • If we have two competing hypotheses, which one is better on unseen data? • How do we compare two learning algorithms in the face of limited data? • How certain are we about our answers? Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 1 n error (h | S ) [h (xi ) yi ] n i 1 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): error (h | P ) dx P (x ) [h (x ) f (x )] where f(x) is the true hypothesis. Binomial Distributions • Assume you toss a coin n times. • And it has probability p of coming heads (which we will call success) • What is the probability distribution governing the number of heads in n trials? • Answer: the Binomial distribution. p (# heads r | p , n ) n! p r (1 p )n r r !(n r )! Distribution over Errors • Consider some hypothesis h(x) • Draw n samples X~P(X). • Do this k times. • Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). • {e1,...,ek} are samples from a Binomial distribution ! • Why? imagine a magic coin, where God secretly determines the probability of heads by the following procedure. First He takes some random hypothesis h. Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. If it does, he makes sure the coin lands heads up... • You have a single sample S, for which you observe e(S) errors. What would be a reasonable estimate for Error(h|P) you think? Binomial Moments mean (r ) E [r | n , p ] np var var (r ) E [(r E [r ]) ] np (1 p ) 2 mean • If we match the mean, np, with the observed value n*error(h|S) we find: E [error (h | P )] E [r / n ] p error (h | S ) • If we match the variance we can obtain an estimate of the width: var[error (h | P )] var[r / n ] error (h | S ) (1 error (h | S )) n Confidence Intervals • We would like to state: With N% confidence we believe that error(h|P) is contained in the interval: error (h | P ) error (h | S ) z N 80% error (h | S ) (1 error (h | S )) n z0.8 1.28 Normal(0,1) 1.28 • In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to approximate a Binomial by a Gaussian for which we can easily compute “z-values”. Bias-Variance • The estimator is unbiased if E [error (h | X )] p • Imagine again you have infinitely many sample sets X1,X2,.. of size n. • Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) • If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. • Two unbiased estimators can still differ in their variance (efficiency). Which one do you prefer? p Eav Flow of Thought • Determine the property you want to know about the future data (e.g. error(h|P)) • Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) • Determine the distribution P(E) of E under the assumption you have infinitely many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) • Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) • Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. (sums of random variables converge to normal distributions – central limit theorem) • State you confidence interval as: with confidence N% error(h|P) is contained in the interval Y mean zN var Assumptions • We only consider discrete valued hypotheses (i.e. classes) • Training data and test data are drawn IID from the same distribution P(x). (IID: independently & identically distributed) • The hypothesis must be chosen independently from the data sample S! Homework: Consider training a classifier h(x) on data S. Argue why the third assumption is violated. Evaluating Learned h(x) • If we learn h(x) from a sample Sn, it is a bad idea to evaluate it on the same data. (You will be overly confident of yourself). • Instead, follow the following procedure: • Split the data into k subsets s1,..sK of size ni=N/k. • Learn a hypothesis on complement S-si • Compute error(h|si) on left-out subset si. • After you finished, compute the total error as: • Compute variance as: 1 k error (h | P ) error (h | S ) ni error (h | si ) n i 1 var[error (h | P )] • This “avoids” violating assumption 3. error (h | S ) (1 error (h | S )) n Comparing 2 Hypotheses • Assume we like to compare 2 hypothesis h1 and h2, which we have tested on two independent samples S1 and S2 of size n1 and n2. • I.e. we are interested in the quantity: d error (h1| P ) error (h 2| P ) ? • Define estimator for d: dˆ error (h 1 | X 1) error (h 2 | X 2) with X1,X2 sample sets of size n1,n2. • Since error(h1|S1) and error(h2|S2) are both approximately Normal their difference is approximately Normal with: mean d error (h 1 | S 1) error (h 2 | S 2) var error (h 1 | S 1)(1 error (h 1 | S 1)) error (h 2 | S 2)(1 error (h 2 | S 2)) n1 n2 • Hence, with N% confidence we believe that d is contained in the interval: d mean zN var Say, mean = -0.1, sqrt(var)=0.5. Z(0.8)=1.28. Do you want to conclude that h1 is significantly better than h2? Paired Tests • It is more likely that we want to compare 2 hypothesis on the same data. • E.g. say we have a single data-set S and two hypothesis h1, h2. Split the data again into subsets s1,..sk error(h1|s1)=0.1 error(h2|s1)=0.13 error(h1|s2)=0.2 error(h2|s2)=0.21 error(h1|s3)=0.66 error(h2|s3)=0.68 error(h1|s4)=0.45 error(h2|s4)=0.47 ... and so on. • We have var(error(h1)) = large, var(error(h2)) = large. However, h1 is consistently better than h2. • We should look at differences not at differences error(h1|si)-error(h2|si), error(h1|S) - error(h2|S) • Problem: we cannot use: var error (h1| si )(1 error (h1| si ))/ni error (h 2| si )(1 error (h 2| si ))/ni because that assumes the errors are independent. Since they are estimates on the same data, they are not independent. • Solution: a “paired t-test” (next slide) Paired t-test • Chunk the data up in subsets s1,...,sk with |si|>30 • On each subset compute the error and compute: i error (h1| si ) error (h 2| si ) • Now compute: 1 k k i 1 i k 1 s ( ) (i )2 k (k 1) i 1 • State: With N% confidence the difference in error between h1 and h2 is: tN ,k 1s ( ) • “t” is the t-statistic which is related to the student-t distribution (table 5.6). Comparing Learning Algorithms • Again, we split the data into k subsets: S{s1,s2,...sk}. • Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement of each subset: {S-s1,S-s2,...) to produce hypotheses {L1(S-si), L2(S-si)} for all i. • Compute for all i : i error (L1 (S si ) | si ) error (L2 (S si ) | si ) • As in the last slide, perform a paired t-test on these differences to compute an estimate and a confidence interval for the relative error of the hypothesis produced by L1 and L2. Homework: Are all assumptions for the statistical test respected? If not, find one that is violated. Do you think that this estimate is correct, optimistic or pessimistic? Evaluation: ROC curves moving threshold class 0 (negatives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives class 1 (positives) Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter. Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course) Appendix The following slide is optional: Hypothesis Tests • Consider you want to compare again two hypothesis h1 and h2 on sample sets S1 and S2 of size n1 and n2. • Now we like to answer: will h2 have significantly less error on future data than h1? • We can formulate this as: is the probability P(d>0) larger than 95% (say). ˆ • Since d mean (d ) this means: What is the total probability that we observe dˆ 0.1 when d>0. P (dˆ) • This can be computed as: d 0 dd P (dˆ 0.1 | mean d ) 0.1 ddˆ P (dˆ | mean 0) (move Gauss to the right) mean (dˆ) dˆ 0.1