Hypothesis Testing The goal of hypothesis testing is to set up a procedure(s) to allow us to decide if a model is acceptable in light of our experimental observations. Example: A theory predicts that BR(Higgs+ -)= 2x10-5 and you measure (42) x10-5 . The hypothesis we want to test is “are experiment and theory consistent?” Hypothesis testing does not have to compare theory and experiment. Example: CLEO measures the Lc lifetime to be (180 7)fs while SELEX measures (198 7)fs. The hypothesis we want to test is “are the lifetime results from CLEO and SELEX consistent?” There are two types of hypotheses tests: parametric and non-parametric Parametric: compares the values of parameters (e.g. does the mass of proton = mass of electron ?) Non-parametric: deals with the shape of a distribution (e.g. is angular distribution consistent with being flat?) Consider the case of neutron decay. Suppose we have two theories that both predict the energy spectrum of the electron emitted in the decay of the neutron. Here a parametric test might not be able to distinguish between the two theories since both theories might predict the same average energy of the emitted electron. However a non-parametric test would be able to distinguish between the two theories as the shape of the energy spectrum differs for each theory. 880.P20 Winter 2006 Richard Kass 1 Hypothesis testing A procedure for using hypothesis testing: a) b) c) Measure (or calculate) something Find something that you wish to compare with your measurement (theory, experiment) Form a hypothesis (e.g. my measurement, x, is consistent with the PDG value) H0: x=xPDG H0 is called the “null hypothesis” d) e) Calculate the confidence level that the hypothesis is true Accept or reject the hypothesis depending on some minimum acceptable confidence level Problems with the above procedure a) b) c) What is a confidence level ? How do you calculate a confidence level? What is an acceptable confidence level ? How would we test the hypothesis “the space shuttle is safe?” Is 1 explosion per 10 launches safe? Or 1 explosion per 1000 launches? A working definition of the confidence level: The probability of the event happening by chance. Example: Suppose we measure some quantity (X) and we know that it is described by a gaussian pdf with =0 and s=1. What is the confidence level for measuring X 2 (i.e. 2s from the mean)? 2 2 P( X 2) P( ,s , x)dx P(0,1, x)dx 2 1 2 -x e 2 dx 0.025 2 Thus we would say that the confidence level for measuring X 2 is 0.025 or 2.5% and we would expect to get a value of X 2 one out of 40 tries if the underlying pdf is gaussian. 880.P20 Winter 2006 Richard Kass 2 Hypothesis testing Cautions Two types of errors associated with hypothesis testing: Type I: reject H0 when it is true Type II: accept H0 when it is false f(x|q0) For the case where H0: qq0 and the alternative H1: qq1 we can calculate the probability of a Type I or Type II error. Assume we reject H0 if x>xc. area=prob. of a Type I error x f(x|q1) xc prob .Type I error f(x | q 0 )dx prob .Type II error area=prob. of a Type II error xc xc f(x | q )dx 1 - x xc A few cautions about using confidence limits a) You must know the underlying pdf to calculate the limits. Example: suppose we have a scale of known accuracy (s = 10 gm ) and we weigh something to be 20 gm. Assuming a gaussian pdf we could calculate a 2.5% chance that our object weighs ≤ 0 gm?? We must make sure that the probability distribution is defined in the region where we are trying to extract information. b) What does a confidence level really mean? Classical vs Baysian viewpoints 880.P20 Winter 2006 Richard Kass 3 Hypothesis Testing-Gaussian Variables We wish to test if a quantity we have measured (=average of n measurements ) is consistent with a known mean (0). Test = o Conditions s2 known = o s2 unknown Test Statistic - o s/ n - o s/ n Test Distribution Gaussian avg. s t(n -1) student’s “t-distribution” with n-1 DOF Example: Do free quarks exist? Quarks are nature's fundamental building blocks and are thought to have electric charge (|q|) of either (1/3)e or (2/3)e (e = charge of electron). Suppose we do an experiment to look for |q| = 1/3 quarks. H0: 0.90.2=0.33 We measure: q = 0.90 ± 0.2 This gives and s Quark theory: q = 0.33 This is We want to test the hypothesis = o when s is known. Thus we use the first line in the table. - o 0.9 - 0.33 z 2.85 s/ n 0.2 / 1 We want to calculate the probability for getting a z 2.85, assuming a Gaussian pdf. 2.85 2.85 prob(z 2.85) P( , s ,x)dx P(0,1, x)dx 1 2 2 -x 2 e dx 0.002 2.85 The CL here is just 0.2 %! What we are saying here is that if we repeated our experiment 1000 times then the results of 2 of the experiments would measure a value q 0.9 if the true mean was q = 1/3. This is not strong evidence for q = 1/3 quarks! 880.P20 Winter 2006 If acceptable CL=5%, then we would reject H0 Richard Kass 4 Hypothesis Testing-Gaussian Variables Do charge 2/3 quarks exist? If instead of q = 1/3 quarks we tested for q = 2/3 what would we get for the CL? Now we have = 0.9 and s = 0.2 as before but o = 2/3. We now have z = 1.17 and prob(z 1.17) = 0.13 and the CL = 13%. H0: 0.90.2=0.67 Now free quarks are starting to get believable! If acceptable CL=5%, then we would accept H0 Another variation of the quark problem Suppose we have 3 measurements of the charge q: q1 = 1.1, q2 = 0.7, and q3 = 0.9 We don't know the variance beforehand so we must determine the variance from our data. Thus we use the second test in the table. = (q1+ q2+ q3)/3 = 0.9 n (qi - )2 0.2 2 + (-0.2) 2 + 0 0.04 n -1 2 - o 0.9 - 0.33 z 4.94 s/ n 0.2 / 3 s2 i1 H0: our charge measurements are consistent with q=1/3 quarks In this problem z is described by Student’s t-distribution. Note: Student is the pseudonym of statistician W.S. Gosset who was employed by a famous English brewery. Just like the Gaussian pdf, in order to evaluate the t-distribution one must resort to a look up table (see for example Table 7.2 of Barlow). In this problem we want prob(z 4.94) when n-1 = 2. The probability of z 4.94 is 0.02. This is ~ 10X greater than the 1st part of this example where we knew the variance ahead of time. If acceptable CL=5%, then we would reject H0 880.P20 Winter 2006 Richard Kass 5 Hypothesis Testing-Gaussian Variables Tests when both means are unknown but come from a gaussian pdf: Test 1 - 2 = 0 s1 1 - 2 =0 2 Conditions and s22 known s12 = s22 = s2 unknown Test Statistic 1 - 2 s12 / n + s 22 / m 1 - 2 Q 1 / n +1 / m Q2 1 - 2 =0 s1 s2 2 2 unknown Test Distribution Gaussian avg. s t (n+m-2) ( n-1)s12 +( m-1)s 2 2 n+ m- 2 1 - 2 approx. Gaussian avg. s s12 / n + s22 / m n and m are the number of measurements for each mean Example: Do two experiments agree with each other? CLEO measures the Lc lifetime to be (180 7)fs while SELEX measures (198 7)fs. z 1 - 2 198 - 180 1.82 2 2 2 2 s1 / n + s 2 / m (7 ) + (7 ) H0: 1807=198 7 2 1.82 1.82 -1.82 -1.82 P( z 1.82) 1 - P( ,s , x)dx 1 - P(0,1, x)dx 1 - 1 2 1.82 - x e 2 dx 1 - 0.93 0.07 -1.82 Thus 7% of the time we should expect the experiments to disagree at this level. If acceptable CL=5%, then we would accept H0 880.P20 Winter 2006 Richard Kass 6 Hypothesis Testing-non-Gaussian Variables Assume we have a bunch of measurements (xi’s) that are NOT from a gaussian pdf. (e.g. they could be from a poisson distribution) We can calculate a quantity that looks like a c2 which compares our data (di) with corresponding predictions (PR(xi)) from a pdf: 2 ( d PR ) i X2 i PRi i 1 n K. Pearson showed that for a wide variety of pdfs (e.g. Poisson) the above test statistic becomes distributed according to a c2 pdf with n-1 dof. For example, the data are from a poisson distribution with mean=. e - m P(m, ) m! If we have a total of N events in our sample then we predict: NP(m,) events for a given value of m (= 0, 1, 2, 3, 4….) At the same time, we have observed N(m) events. If NP(m,) >5 for all m then the following is approximately a c2 with n-1 dof. M ( N m - NP(m, )) 2 2 X NP(m, ) m 0 M ( N m - NP(m, )) 2 ( N m - NP(m, )) 2 Note for a Poisson NP(m, ) s X NP(m, ) s m2 m 0 m 0 M 2 m 880.P20 Winter 2006 2 Richard Kass 7 Hypothesis Testing-Pearson’s c2 Test The following is the numbers of neutrino events detected in 10 second intervals by the IMB experiment on 23 February 1987 around which time the supernova S1987a was first seen by experimenters: #events 0 1 2 3 4 5 6 7 8 9 #intervals 1024 860 307 58 15 3 0 0 0 1 Assuming the data is described by a Poisson distribution. Calculate the average & compute the average number events expected in an interval. 8 (# events ) (# intervals ) i0 8 #intervals = 0.777 if we include interval with 9 events 0.774 i0 We can calculate a c2 assuming the data are described by a Poisson distribution: i8 e- n The predicted number of intervals is given by: # intervals #intervals (# intervals - prediction ) 2 c 3.6 prediction i0 2 8 i0 Note: we use s2=prediction n! for a Poisson #events 0 1 2 3 4 5 6 7 8 9 #intervals predicted 1064 823 318 82 16 2 0.3 0.03 0.003 0.0003 There are 7 (= 9-2) DOF’s here and the probability of c2/D.O.F. = 3.6/7 is high (≈80%), indicating a good fit to a Poisson (# intervals - prediction ) 2 2 3335 and c /D.O.F. = 3335 / 8 417. prediction i0 2 The probability of getting a c /D.O.F. which is this large from a Poisson distribution with = 0.774 is 0. 9 However, if the last data point is included: c 2 Hence the nine events are most likely coming from the supernova explosion and not just from a Poisson. Reject H0: “interval with 9 events is from Poisson with =0.77” 880.P20 Winter 2006 Richard Kass 8