Goodness of Fit Tests for Probability Distributions We have seen that we can compute a test statistic χ2 which compares the observed frequencies (𝑓𝑖 ) with the expected frequencies (𝑒𝑖 ) by: 𝑘 (𝑓𝑖 − 𝑒𝑖 )2 𝜒 = ∑( ) 𝑒𝑖 2 𝑖=1 𝑘 𝑓𝑖2 = ∑( ) − 𝑛 𝑒𝑖 𝑖=1 2 In the case of a multinomial with k cells, this calculation has an approximate 𝜒𝑘−1 distribution as long as each cell has an expected frequency that is “large enough,” like 𝑒𝑖 ≥ 5. The Discrete Random Variable Case In this handout we consider how to perform a test to see if a population has values in it that are distributed according to a hypothesized distribution. In order to do this, let’s consider the ways that a probability distribution can be represented. Perhaps the easiest way is to use a table for a discrete distribution. For example, imagine that we believe that the number of cars, C, sold per week by a salesman follows the following distribution. C 0 1 2 p(c) 0.5 0.4 0.1 If a sample of 100 weeks reveal that the salesman sold 0 cars 65 times, 1 car 30 times, and 2 cars 5 times, is this evidence to refute our belief at α = .01? See worksheet Salesman in the file GoF Probability distributions.xlsx. How should we write our hypotheses? 𝑯𝟎 : 𝑷{𝑪 = 𝟎} =. 𝟓, 𝑷{𝑪 = 𝟏} =. 𝟒, 𝑷{𝑪 = 𝟐} =. 𝟏 𝑯𝒂 : At least one of these probabilities is different Another way a probability distribution can be represented is through a formula. For example, we could claim that the probability that our salesman sells x cars is given by 1 2 𝑃{𝐶 = 𝑥} = ( ) . 3𝑥 . 72−𝑥 𝑥 = 0, 1, 2. 𝑥 Such a distribution is called a binomial distribution, as you learned in DSC 210. We can easily convert such a statement to a table by substituting each value of x into the formula and computing the associated probability: 2 𝑃{𝐶 = 0} = ( ) . 30 . 72 = 1 ∙ 1 ∙ (.49) = .49 0 2 𝑃{𝐶 = 1} = ( ) . 31 . 71 = 2 ∙ (.3) ∙ (.7) = .42 1 2 𝑃{𝐶 = 2} = ( ) . 32 . 70 = 1 ∙ (.09) ∙ 1 = .09 2 Thus our distribution looks like C 0 1 2 p(c) 0.49 0.42 0.09 In general, if a random variable X has a binomial distribution with parameters n and p, then X represents the number of successes in n trials where the probability of success is p at each trial, and the binomial probabilities can be computed using the formula: 𝑛 𝑃{𝑋 = 𝑥} = ( ) 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥 𝑥 = 0, 1, 2, … , 𝑛. 𝑥 In worksheet Binomial we test to see if the number of cars sold by our salesman follows a binomial distribution with 𝑛 = 2 and 𝑝 = .3, as computed above. Note that while I could have copied the probabilities from the table, I computed the probabilities in the spreadsheet using the formula binom.dist. How could I write the hypotheses? 𝑯𝟎 : 𝑪 follows a binomial distribution with 𝒏 = 𝟐 and 𝒑 =. 𝟑 𝑯𝟎 : 𝑪 does not follow a binomial distribution with 𝒏 = 𝟐 and 𝒑 =. 𝟑 In general, you could perform a goodness of fit test for any discrete random variable as we have performed above. Even if the probability distribution is given as a formula, you (or Excel if the built in function exists!) can simply retrieve the probabilities by industriously substituting each value of X into the formula. 2 The Continuous Random Variable Case Example 1—A continuous random variable assumes values on a continuum. For example, assume that T = the time I wait until the next subway train arrives takes on values between 0 and 5 minutes. Now it is possible, in fact, very likely that I will not wait exactly 1 or 2 minutes, but instead will wait some fraction of a minute, such as 3.545 minutes. Thus, T can assume values anywhere in the interval [0, 5], and is continuous random variable. The probability distribution of a continuous random variable can be described by the area under a carefully chosen density function. You have seen examples of density functions when we used the normal, t, χ2, and F distributions. Now imagine that I wish to test the assumption that T assumes values uniformly in [0, 5] (naturally called a uniform distribution). Its density function would look like 𝑓(𝑡) = 1⁄5 5 0 Unfortunately, when we collect values of T we will get lots of different values between 0 and 5. How many possibilities are there? How many possibilities existed in the car sales example? This is the fundamental problem with using a goodness of fit test for a continuous random variable. To get around this, we need to group our observations; for example, we might decide to count the number of values of T are 1 minute or less, more than 1 but less than 2 minutes, etc. When we group our data, we need to make sure that 𝑒𝑖 = 𝑛𝑝𝑖 is at least 5. Fortunately, we can always regroup our data! From the distribution shown above we can compute 𝑝1 = 𝑃{0 ≤ 𝑇 ≤ 1} =𝑏×ℎ 1 1 =1× = . 5 5 3 Or using an approach that will work when simple geometry will not, 1 𝑃{0 ≤ 𝑇 ≤ 1} = ∫ 𝑓(𝑡)𝑑𝑡 0 1 = ∫(1⁄5) 𝑑𝑡 = 0 1 1 𝑡| 5 0 1 1 1 = 5(1) − 5(0) = 5. Thus we could split our interval [0, 5] into 5 equal intervals, each with probability 1/5 = .2. Then if we collected n = 100 waiting times, T, we would expect that 20% of the values would fall between 0 and 1, 20% would fall between 1 and 2, etc. In our terminology, 𝑒1 = 𝑛𝑝1 = 100(. 2) = 20. Similarly, 𝑒2 = 𝑒3 = 𝑒4 = 𝑒5 = 20. Assuming our sample of 100 waiting times shows 15 values between 0 and 1 minute, 17 values between 1 and 2 minutes, 26 values between 2 and 3 minutes, 23 values between 3 and 4 minutes, and 19 values between 4 and 5 minutes, can we refute the claim that the waiting time is uniformly distributed at α = .05? See worksheet Uniform. 𝑯𝟎 : 𝑻 is uniformly distributed on [𝟎, 𝟓] 𝑯𝒂 : 𝑻 is not uniformly distributed on [𝟎, 𝟓] Note that the selection of the groupings, while convenient, was otherwise arbitrary, and that someone else might have used different groups. Other possibilities could be ten equally space intervals of .5 minutes each, or grouping 0-2 minutes, 2-4 minutes, and 4-5 minutes. As long as 𝑒𝑖 = 𝑛𝑝𝑖 > 5, the choices are equally valid from a technical point of view. Example 2—Suppose that it is hypothesized that 𝑋 = highway speeds (on a specific stretch of an interstate highway passing near a town) have a normal distribution with µ = 55 mph and σ = 5 mph. We wish to test this assertion using α = .05. Data has been collected as follows and appears in the spreadsheet Normal. 𝑯𝟎 : 𝑿 has a normal distribution with 𝝁 = 𝟓𝟓 and 𝝈 = 𝟓 𝑯𝒂 : 𝑿 does not have a normal distribution with 𝝁 = 𝟓𝟓 and 𝝈 = 𝟓 4 𝑓𝑖 Speed < 45 mph ≥ 45 mph but < 50 mph ≥ 50 mph but < 55 mph ≥ 55 mph but < 60 mph ≥ 60 mph but < 65 mph ≥ 65 mph but < 70 mph ≥ 70 mph 10 75 165 175 58 12 5 Notice that if we had been given the raw data, we could have categorized the data into cells any way we wish. If we want to compare these observations with the expected frequencies, how do we determine 𝑝3 = 𝑃{50 ≤ 𝑋 ≤ 55}? Now notice that 𝑒7 = 500 ∗ 𝑃{𝑋 ≥ 70} = 0.67, which is less than 5 and even less than 1. To correct for this, just lump cell 7 into cell 6 and keep going! See spreadsheet Normal Corrected. Our 𝜒 2 value becomes 𝜒 2 = 505.43 − 500 = 5.43. Since our p-value is 0.3652, we do not have a rare value of 𝜒 2 under H0, where rare is defined as one of the rarest large 5% values. Thus we have no reason to reject the null hypothesis. Final note: If you have to estimate 𝜇 and 𝜎 2 with 𝑥̅ and 𝑠 2 , you lose 2 degrees of freedom. Thus, if we had hypothesized that highway speeds are normally distributed, but had not specified μ and σ2, you would use 𝑥̅ and 𝑠 2 from the sample and perform the test with 𝑑𝑓 = (6 − 1) − 2 = 3. Homework: Use 𝛼 = .03. (1) Use the Salesman data discussed on page 1 to see if C follows the following distribution: C 0 1 2 p(c) 0.7 0.25 0.05 (2) Use the data on highway speeds to see if 𝑋~𝑁𝑜𝑟𝑚𝑎𝑙(𝜇 = 57, 𝜎 = 6). 5