Goodness of Fit Tests for Probability Distributions

advertisement
Goodness of Fit Tests for Probability Distributions
We have seen that we can compute a test statistic χ2 which compares the observed
frequencies (𝑓𝑖 ) with the expected frequencies (𝑒𝑖 ) by:
𝑘
(𝑓𝑖 − 𝑒𝑖 )2
𝜒 = ∑(
)
𝑒𝑖
2
𝑖=1
𝑘
𝑓𝑖2
= ∑( ) − 𝑛
𝑒𝑖
𝑖=1
2
In the case of a multinomial with k cells, this calculation has an approximate 𝜒𝑘−1
distribution as
long as each cell has an expected frequency that is “large enough,” like 𝑒𝑖 ≥ 5.
The Discrete Random Variable Case
In this handout we consider how to perform a test to see if a population has values in it that are
distributed according to a hypothesized distribution. In order to do this, let’s consider the ways
that a probability distribution can be represented. Perhaps the easiest way is to use a table for a
discrete distribution. For example, imagine that we believe that the number of cars, C, sold per
week by a salesman follows the following distribution.
C
0
1
2
p(c)
0.5
0.4
0.1
If a sample of 100 weeks reveal that the salesman sold 0 cars 65 times, 1 car 30 times, and 2 cars
5 times, is this evidence to refute our belief at α = .01? See worksheet Salesman in the file GoF
Probability distributions.xlsx.
How should we write our hypotheses?
𝑯𝟎 : 𝑷{𝑪 = 𝟎} =. 𝟓, 𝑷{𝑪 = 𝟏} =. 𝟒, 𝑷{𝑪 = 𝟐} =. 𝟏
𝑯𝒂 : At least one of these probabilities is different
Another way a probability distribution can be represented is through a formula. For example, we
could claim that the probability that our salesman sells x cars is given by
1
2
𝑃{𝐶 = 𝑥} = ( ) . 3𝑥 . 72−𝑥 𝑥 = 0, 1, 2.
𝑥
Such a distribution is called a binomial distribution, as you learned in DSC 210. We can easily
convert such a statement to a table by substituting each value of x into the formula and
computing the associated probability:
2
𝑃{𝐶 = 0} = ( ) . 30 . 72 = 1 ∙ 1 ∙ (.49) = .49
0
2
𝑃{𝐶 = 1} = ( ) . 31 . 71 = 2 ∙ (.3) ∙ (.7) = .42
1
2
𝑃{𝐶 = 2} = ( ) . 32 . 70 = 1 ∙ (.09) ∙ 1 = .09
2
Thus our distribution looks like
C
0
1
2
p(c)
0.49
0.42
0.09
In general, if a random variable X has a binomial distribution with parameters n and p, then X
represents the number of successes in n trials where the probability of success is p at each trial,
and the binomial probabilities can be computed using the formula:
𝑛
𝑃{𝑋 = 𝑥} = ( ) 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥 𝑥 = 0, 1, 2, … , 𝑛.
𝑥
In worksheet Binomial we test to see if the number of cars sold by our salesman follows a
binomial distribution with 𝑛 = 2 and 𝑝 = .3, as computed above. Note that while I could have
copied the probabilities from the table, I computed the probabilities in the spreadsheet using the
formula binom.dist.
How could I write the hypotheses?
𝑯𝟎 : 𝑪 follows a binomial distribution with 𝒏 = 𝟐 and 𝒑 =. 𝟑
𝑯𝟎 : 𝑪 does not follow a binomial distribution with 𝒏 = 𝟐 and 𝒑 =. 𝟑
In general, you could perform a goodness of fit test for any discrete random variable as we have
performed above. Even if the probability distribution is given as a formula, you (or Excel if the
built in function exists!) can simply retrieve the probabilities by industriously substituting each
value of X into the formula.
2
The Continuous Random Variable Case
Example 1—A continuous random variable assumes values on a continuum. For example,
assume that T = the time I wait until the next subway train arrives takes on values between 0 and
5 minutes. Now it is possible, in fact, very likely that I will not wait exactly 1 or 2 minutes, but
instead will wait some fraction of a minute, such as 3.545 minutes. Thus, T can assume values
anywhere in the interval [0, 5], and is continuous random variable.
The probability distribution of a continuous random variable can be described by the area under
a carefully chosen density function. You have seen examples of density functions when we used
the normal, t, χ2, and F distributions.
Now imagine that I wish to test the assumption that T assumes values uniformly in [0, 5]
(naturally called a uniform distribution). Its density function would look like
𝑓(𝑡) = 1⁄5
5
0
Unfortunately, when we collect values of T we will get lots of different values between 0 and 5.
How many possibilities are there?
How many possibilities existed in the car sales example?
This is the fundamental problem with using a goodness of fit test for a continuous random
variable. To get around this, we need to group our observations; for example, we might decide
to count the number of values of T are 1 minute or less, more than 1 but less than 2 minutes, etc.
When we group our data, we need to make sure that 𝑒𝑖 = 𝑛𝑝𝑖 is at least 5. Fortunately, we can
always regroup our data!
From the distribution shown above we can compute
𝑝1 = 𝑃{0 ≤ 𝑇 ≤ 1}
=𝑏×ℎ
1 1
=1× = .
5 5
3
Or using an approach that will work when simple geometry will not,
1
𝑃{0 ≤ 𝑇 ≤ 1} = ∫ 𝑓(𝑡)𝑑𝑡
0
1
= ∫(1⁄5) 𝑑𝑡
=
0
1 1
𝑡|
5 0
1
1
1
= 5(1) − 5(0) = 5.
Thus we could split our interval [0, 5] into 5 equal intervals, each with probability 1/5 = .2. Then
if we collected n = 100 waiting times, T, we would expect that 20% of the values would fall
between 0 and 1, 20% would fall between 1 and 2, etc. In our terminology,
𝑒1 = 𝑛𝑝1 = 100(. 2) = 20.
Similarly, 𝑒2 = 𝑒3 = 𝑒4 = 𝑒5 = 20.
Assuming our sample of 100 waiting times shows 15 values between 0 and 1 minute, 17 values
between 1 and 2 minutes, 26 values between 2 and 3 minutes, 23 values between 3 and 4
minutes, and 19 values between 4 and 5 minutes, can we refute the claim that the waiting time is
uniformly distributed at α = .05? See worksheet Uniform.
𝑯𝟎 : 𝑻 is uniformly distributed on [𝟎, 𝟓]
𝑯𝒂 : 𝑻 is not uniformly distributed on [𝟎, 𝟓]
Note that the selection of the groupings, while convenient, was otherwise arbitrary, and that
someone else might have used different groups. Other possibilities could be ten equally space
intervals of .5 minutes each, or grouping 0-2 minutes, 2-4 minutes, and 4-5 minutes. As long as
𝑒𝑖 = 𝑛𝑝𝑖 > 5, the choices are equally valid from a technical point of view.
Example 2—Suppose that it is hypothesized that 𝑋 = highway speeds (on a specific stretch of an
interstate highway passing near a town) have a normal distribution with µ = 55 mph and σ = 5
mph. We wish to test this assertion using α = .05. Data has been collected as follows and
appears in the spreadsheet Normal.
𝑯𝟎 : 𝑿 has a normal distribution with 𝝁 = 𝟓𝟓 and 𝝈 = 𝟓
𝑯𝒂 : 𝑿 does not have a normal distribution with 𝝁 = 𝟓𝟓 and 𝝈 = 𝟓
4
𝑓𝑖
Speed
< 45 mph
≥ 45 mph but < 50 mph
≥ 50 mph but < 55 mph
≥ 55 mph but < 60 mph
≥ 60 mph but < 65 mph
≥ 65 mph but < 70 mph
≥ 70 mph
10
75
165
175
58
12
5
Notice that if we had been given the raw data, we could have categorized the data into cells any
way we wish. If we want to compare these observations with the expected frequencies, how do
we determine 𝑝3 = 𝑃{50 ≤ 𝑋 ≤ 55}?
Now notice that 𝑒7 = 500 ∗ 𝑃{𝑋 ≥ 70} = 0.67, which is less than 5 and even less than 1. To
correct for this, just lump cell 7 into cell 6 and keep going! See spreadsheet Normal Corrected.
Our 𝜒 2 value becomes 𝜒 2 = 505.43 − 500 = 5.43. Since our p-value is 0.3652, we do not have
a rare value of 𝜒 2 under H0, where rare is defined as one of the rarest large 5% values. Thus we
have no reason to reject the null hypothesis.
Final note: If you have to estimate 𝜇 and 𝜎 2 with 𝑥̅ and 𝑠 2 , you lose 2 degrees of freedom.
Thus, if we had hypothesized that highway speeds are normally distributed, but had not specified
μ and σ2, you would use 𝑥̅ and 𝑠 2 from the sample and perform the test with
𝑑𝑓 = (6 − 1) − 2 = 3.
Homework: Use 𝛼 = .03.
(1) Use the Salesman data discussed on page 1 to see if C follows the following distribution:
C
0
1
2
p(c)
0.7
0.25
0.05
(2) Use the data on highway speeds to see if 𝑋~𝑁𝑜𝑟𝑚𝑎𝑙(𝜇 = 57, 𝜎 = 6).
5
Download