Lecture 4

advertisement
Hypothesis Testing
Hypothesis testing concerns assessing the truth of stated conjectures using formal statistical
methods. The stated conjecture may be a widely held belief, a “status quo” position, or simply an
opinion regarding the true state of things. In many cases, the stated conjecture is advanced even
though it is believed to be false. This is because in hypothesis testing, the stated
conjecturecalled the null hypothesiscan be disproved, but it cannot be proved. However, by
disproving the null hypothesis, one proves that the contrary is true. The contrary of the null
hypothesis is called the alternative hypothesis. To make this concrete, we will draw an analogy
with the American legal system.
When someone is brought to trial on criminal charges, it is because the prosecution (or
more accurately, the state) believes a crime has been committed. But it is insufficient for the
prosecution to simply declare a crime has been committed; instead, the prosecution must prove
its case. This is why a person is presumed innocent until the prosecution convinces a jury—
beyond a reasonable doubt—that they are guilty. Observe that proving a defendant’s guilt is
equivalent to disproving a defendant’s innocence.
In hypothesis testing, the null hypothesis is on trial. Like the defendant in a criminal case,
the null hypothesis is presumed innocent (or correct) until we, as statistical prosecutors, disprove
it using sample data. If we are successful in disproving the null hypothesis, then we have
equivalently shown that the alternative hypothesis is true. In both court cases and statistics, we
win some cases, we lose some cases, and sometimes we arrive at the wrong conclusion.
In a criminal case, we are typically more fearful of putting an innocent person in jail than
letting a guilty person go free, and this is why we insist the jury be convinced of a person’s guilt
beyond a reasonable doubt. This is a fairly rigorous standard, but it is meant to ensure that those
who are pronounced guilty really are guilty. In hypothesis testing, we adopt a similar philosophy
regarding the standard needed to disprove the null hypothesis. The sample data is used to
compute a test statistic, whose distribution is known (at least approximately). We then reject the
null hypothesis if the computed value of this test statistic falls in a region that is very improbable
if the null hypothesis is true but very likely if it is false. In other words, the data needs to offer
substantial evidence that the null hypothesis is false before we reject it.
Some of the most common (and useful) hypothesis tests concern population means. Here are the
null ( H 0 ) and alternative ( H A ) hypotheses for three standard tests. The names will make more
sense later.
1. The “Two-Tailed” Test.
H 0 :   a (Null)
H A :   a (Alternati ve)
2. The “One-Tailed” Test to the Left
H0 :   a
HA :  a
Copyright 2007 John Semple
or
H0 :   a
HA :  a
29
3. The “One-Tailed” Test to the Right
H0 :   a
HA :  a
or
H0 :   a
HA :  a
Example: Confiscating Scallops from Illegal Fishing
Here’s a summary of the story published by Arnold Barnett in Interfaces, Vol. 25, March-April,
1995.
In an effort to protect over-fishing of local scallop populations, US law requires the average1
meat per scallop to weigh 1/36 of a pound in a fishing vessel’s catch. One such vessel, arriving
in Massachusetts with 11,000 bags of fresh scallops, was tested for compliance by the US
Fisheries and Wildlife Service (FWS). The FWS took a “large scoop” from each of 18
“randomly selected” bags and computed the scoop’s average scallop meat. Each scoop’s average
was then divided by the 1/36 of a pound US requirement so that a fraction of the standard could
be reported. These 18 fractions are given below
Average Meat per Scallop in the 18 Scoops
(measured as a fraction of the US Requirement 1/36 lb)
.93
.88
.85
.91
.91
.84
.89
.98
.87
.91
.92
.99
.90
1.14
.98
1.06
.88
.93
What sort of evidence does this sample provide that the fishing outfit has violated the law?
In this case, the FWS should take as its null hypothesis “The mean weight is at least 1/36
of a pound,” versus the alternative hypothesis “The mean weight is less than 1/36 of a pound.”
This is because the FWS may suspect that the law has been violated, but they need to prove that
this is the case. The FWS assumes compliance as its null hypothesis (“innocent”), but if the null
hypothesis is disproved, then the fishing outfit is out of compliance (“guilty”). In short, H0 is
compliance with the law, HA is violation of the law.
Since the data are reported in fractions of the US minimum requirement (1/36 of a
pound), we choose to state the null and alternative hypotheses in terms of these fractions. In the
formal language of hypothesis testing, we would write:
H0:   1
HA:   1
This is a one-tailed test to the left.
Here, the overall catch’s average is its population mean. Of course, calculating this mean is impossible because
there are too many scallops.
1
Copyright 2007 John Semple
30
The Test Statistic
The test statistic is a formula based on the sample data whose distribution is known under the
null hypothesis. Certain values of the test statistic support the null hypothesis, others appear to
contradict it. For the three tests described above, the test statistic most commonly used is
X  0
, whose components are defined below.
s/ n
Symbol
X
0
General Meaning
The sample mean of the random sample
The conjectured value of  under H0
Value in this problem
.931
1
s
n
The sample standard deviation
The sample size
.075
18
People often abuse the term test statistic and refer to the value one obtains for a particular sample
as the “test statistic.” Technically speaking, one should say this is the value of the test statistic or
the computed test statistic. Even statisticians abuse this terminology.
If the null hypothesis is true, then, under suitable conditions (namely, the sample consists of
independent draws from a normal distribution), ( X   0 ) (s / n ) follows a t distribution with
17 degrees of freedom. This distributional result should look familiar to you from our work with
confidence intervals. If the value of this test statistic is “not too negative,” then the null
hypothesis is at least plausible. In the language of statistics, we would “fail to reject” the null
hypothesis. But if the value is “too negative” (in a sense refined later), then the null hypothesis
becomes indefensible, and so one should conclude that it is false. In the language of statistics,
when the value of the t-statistic becomes too negative we “reject” the null hypothesis and
conclude that the alternative hypothesis is true.
The level of the Test (α) and the Critical Values
So what values of the test statistic are “so negative” that they demand rejection of the null
hypothesis? In other words, what values of ( X   0 ) (s / n ) ) lead us reject the null
hypothesis? The answer depends on you and your tolerance for making a mistake. Let the Greek
letter alpha (  ) be a small probability that represents your tolerance for rejecting the null
hypothesis when it is actually true. Using the jury analogy, this is your tolerance for convicting
an innocent person. Your tolerance for making a mistake clearly depends on the situation. You
might tolerate a 1 in 10 chance of falsely convicting someone of jaywalking because the
consequences are minimal (a small fine), but you would probably demand a higher standard (say
1 in 100 or 1 in 1000) if the charge is murder. Typical values for  include .10, .05, .01 and
.001.
For your selected value of  , determine a region of values for ( X   0 ) (s / n ) that have
probability  and are least supportive of the null hypothesis (and thus most supportive of the
Copyright 2007 John Semple
31
alternative hypothesis). In the context of the scallop example, this means values of
( X   0 ) (s / n ) that fall in the leftmost “tail” of a t distribution with 17 degrees of freedom.
For   .05 (the default value used by most people), the values that are least supportive of the
null hypothesis are those to the left of  t .05,17  1.74 (see picture below). Note that these values
are unlikely if the null hypothesis is true but much more likely if the alternative hypothesis is
true.
t distribution with 17 degrees of freedom
Area = 
 .05
Rejection region for
 = .05.
 t .05,17  1.74 =TINV(.10,17)
(critical value for

= .05)
In the language of hypothesis testing, the value of  is called the level of significance or simply
the “level” of the test. The value -1.74 is called the critical value. Note that we calculate this
value in Excel as TINV(.10,17) because Excel automatically splits the probability  in half.
Values to the left of the critical value constitute the rejection region. In our example, the value
of the test statistic computed from the random sample of 18 bags is
t
.931667  1
 3.849 .
.075323 / 18
This value is far to the left of the critical value  t .05,17  1.74 , and thus it does not support the
null hypothesis at level   .05 . This value does support the alternative hypothesis that the true
mean should be smaller. We would therefore reject the null hypothesis and conclude that the
average scallop weight (for the boat) is less than that allowed by US law. In the formal language
of statistics:
“At level   05 , there is sufficient sample information to reject the null
hypothesis that the mean scallop weight for the catch is at least 1/36 of a pound.
We therefore conclude that the mean scallop weight for the catch is less than 1/36
of a pound.”
If we had been unable to reject the null hypothesis, then the word “sufficient” would be replaced
by the word “insufficient” and we would not be able to conclude that the boat’s catch violated the
law.
Copyright 2007 John Semple
32
Example. Suppose you manufacture plastic medical parts that are intended to have a mean
thickness of 15 mm. You want to test the null hypothesis   15 against the alternative   15 .
You collect a sample of n = 81 observations and calculate the following sample statistics
x  14.87 , s 2  .25 . Conduct an appropriate hypothesis test at level  .05 to see if your parts
have a mean thickness of 15 mm.
Here, the opposing hypotheses are
H 0 :   15
H A :   15
In this test, we again use the test statistic
X  15
. If the null hypothesis is true, we would rarely
s/ n
expect to see values of the test statistic that are “very negative” or “very positive” (i.e., either
“tail”). Using  .05 , this translates into two critical values:  t / 2,80  1.990 and
 t / 2,80  1.990 (=TINV(.05,80)). The situation is pictured below.
 / 2  .025
 / 2  .025
 t .025,80  1.99
t .025,80  1.99 = TINV(.05,80)
The rejection region for   .05 consists of two intervals, (,1.99) and (1.99, ) . Notice
that we have divided the small probability  .05 in half in our picture because we have two
directions that do not support our null hypothesis and  .05 must accommodate both
directions. However, we do not split this value when using TINV for a two-tailed test. This is
because the TINV function splits the value of  automatically. In our example, t .025,80 =
TINV(.05,80). When doing one tailed tests, one must remember to anticipate this split.
The value of out test statistic using the sample data is
t
14.87  15
 2.34
.25 / 81
This value falls in the portion of the rejection region below -1.99. This leads us to reject the null
hypothesis and conclude that   15 . In formal statistical language:
Copyright 2007 John Semple
33
“At level   05 , there is sufficient sample information to reject that the mean is
15 mm (and thus conclude that the mean is not 15).”
P-Values
In each of the previous examples, we could have started with a smaller value of  and still
rejected the null hypothesis. How small could  have been? The smallest value of  that still
allows you to reject the null hypothesis for the computed value of your test statistic is called the
p-value for the hypothesis test. In a very informal sense, this can be thought of as the
“probability that the null hypothesis is true.” Technically speaking, it is the probability of
observing a value of the test statistic that is at least as unlikely as the one computed given the null
hypothesis is true. Since this involves calculating a probability, you can use the TDIST function
applied to the value of your test statistic to calculate the p-value.
Observe that the p-value is always between 0 and 1. As the p-value approaches 0, the data offers
less support for the null hypothesis and therefore more support for the alternative hypothesis.
How small should your p-value be before you decide the null hypothesis is false and the
alternative is true? That’s up to you. Typical threshold values are .05, .01, and .001. Some
people prefer to simply report the p-value and let the audience decide. The important thing to
remember is this: the smaller the p-value, the more the data contradicts the null hypothesis and
supports the alternative hypothesis.
Example: Scallop Confiscation (revisited). What is the p-value for the test conducted in the
scallop problem?
Solution. Start by drawing a picture. What is the smallest value of  that still allows us to reject
the null hypothesis given the current value of the test statistic t = –3.849? After a little
discussion, you will realize the p-value in the scallop example is the area to the left of the value
of the test statistic. Remember the test statistic follows a t distribution with 17 degrees of
freedom. Because the t distribution is symmetric, this p-value is also the area to the right of the
positive value t = 3.849, namely TDIST(3.849,17,1) = .000643.
Example. What is the p-value for the medical part problem?
Solution. Start by drawing a picture. What is the smallest value of  that still allows us to
reject the null hypothesis? Recall the value of the test statistic was 2.34 and the test was a two
tailed test.
(Answer: TDIST(2.34,80,2) = .02177)
Copyright 2007 John Semple
34
Example (Tamir Ayad, SMU Class P43P) A new production process is being considered to
reduce the number of “large” particles (2 microns and higher) that contaminate silicon wafers.
The old process was known to produce an average of 3.01 such particles per wafer. The new
process was tested using a sample of n=81 observations. The following sample statistics were
computed: x  1.88 s 2  2.71 . Is the new process better? What is the p-value for your test?
Solution. Start by drawing a picture.
All hypothesis tests in this course work essentially the same way. In summary:
1.
2.
3.
4.
5.
State your Null and Alternative Hypotheses.
Determine the appropriate test, test statistic, and critical value(s) for the given level of
significance (  ).
Compute a value for your test statistic based on the sample data.
Depending on the value of the test statistic, either “reject” or “fail to reject” the null
hypothesis at the given level of  .
Compute the p-value for your test if required.
Assumptions for the t Test
There is an important assumption underlying the t test: the underlying population we draw our
random sample from is normally distributed (or approximately normally distributed). This is the
same assumption used for constructing confidence intervals based on the t-statistic. The
normality assumption is probably safe in our scallop case since we are actually dealing with 18
averages (one for each scoop), and we know from the CLT that averages follow a normal
distribution for large enough sample sizes. However, the averages do not all have the same
number of observations. Could this create problems? Could the FWS come up with a better
statistical procedure for testing compliance? Could the scoops have been selected in a
nonrandom fashion?
Hypothesis Testing: Hints for Testing a Population mean
Remember that if you want to prove something, it must be your alternative hypothesis ( H A ). If
you fail to reject (“there is insufficient information at level alpha to reject the null hypothesis”),
you have not proven H 0 is true. By analogy, a defendant who is found not guilty is not
necessarily innocent of all charges.
Copyright 2007 John Semple
35
Here are a few tips for identifying the null and alternative hypotheses in word problems. They
are a little cynical, but then so are word problems. In the real world, it’s typically easier to
identify the two hypotheses.
Tip #1: For tests of a population mean, H A will always be of the form
(1)    0
(2)    0
or (3)    0 (equivalently,    0 or    0 )
English statements like “less than” “greater than” and “not equal to” therefore identify alternative
hypotheses.
Tip #2: For testing a population mean, the corresponding null hypotheses are of the form (1)
H0 :    0 , (2) H0 :    0 or (3) H0 :    0 . Thus English statements like “at least,” “no
more than,” etc., must be statements regarding the null hypothesis. However, for actual testing
purposes, a null hypothesis of the form H 0 :    0 will be replaced by the test H 0 :    0
(with alternative H A :    0 ). Similarly, a null hypothesis of the form H 0 :    0 will be
replaced by the test H 0 :    0 (with alternative H A :    0 ). There are basically three tests:
H 0 :   0
H 0 :   0
H A :   0
H A :   0
x  0
 t ~ tn 1 df
s
n
reject H 0 if t  t , n 1df
x  0
 t ~ t n 1 df
s
n
reject H 0 if t  t  , n 1df
H 0 :   0
H A :   0
x  0
 t ~ tn 1 df n
s
n
reject H 0 if t  t / 2, n 1 df
or t  t / 2, n 1 df
Assignment #4 (Due Saturday, Oct. 20th)
1.
2.
3.
4.
5.
6.
Book, 9.27
Book, 9.29
Book, 9.31
Book, 9.32
Book, 9.33
Book, 9.34
Copyright 2007 John Semple
36
Download