Module 8 - Wharton Statistics Department

advertisement
STAT 101, Module 8:
Statistical Testing, null hypotheses, test statistics, p-values
(Book: chapter 10)
Motivation
 At the end of the last module we talked about values of μ that are
compatible with the data in terms of their X and s. In this module we
expand on these thoughts and develop the vocabulary and
argumentation methods of statistical testing.
 Example 1: A manufacturer of consumer electronics would like to
know how many households intend to purchase a computer next year.
Management hopes that the proportion is greater than 10% in order to
justify sales projections. Does the survey that shows 14% willingness
to purchase support their claim? Could the good news be the result of
chance?
Since 10% is a crucial threshold it makes sense to use p = 0.10 as
what we call a null hypothesis and check whether the data from their
survey is compatible with this assumption.
 Example 2: Suppose you sample 25 students from the Penn class of
2006 and observe that their average SAT is 1380 with an SD of 125.
An admissions officer claims the average SAT is at least 1420. You
are surprised at the inconsistency, but then again, this assertion might
be compatible with the data. One could use μ = 1420 as a null
hypothesis and check how compatible it is with the numbers from the
sample.
 Example 3: It is claimed that a coin is fair. A sample of 144 tosses
results in 64 heads. Is this compatible with the assumption of a fair
coin? The natural null hypothesis is p=0.5 which defines fairness.
 Example 4: At the end of Module 7 we mentioned elections. Again, a
rate of 0.50 of favorable likely voters is critical to the claim of being
ahead in the polls. Therefore p = 0.50 is a natural null hypothesis.
 Example 5: It is known that a standard surgery requires a mean
hospital stay of 5.4 days. A new and less invasive type of surgery is
said to require only 3.3 days in the hospital on average. How sure are
we that the new method actually does require fewer hospital days?
It would appear natural in this case to play “devil’s advocate” and
check whether the null hypothesis of a population mean of 5.4
hospital days is actually compatible with the data for the new method
that seem to have a sample average of 3.3. If it is compatible with the
data, maybe the case for the new method is not strong enough, or one
needs more data.
Note that because of the long experience with the old type of surgery
one can assume N ≈ ∞, hence 5.4 can be seen as a population mean.
For the new type of surgery there will be much less experience and N
will be small at this exploratory stage, hence 3.3 should be interpreted
as X .
The Components of Statistical Testing
 Statistical testing can be used in several ways:
o Statistical testing can be a Socratic game: Allow someone to
make an assertion, and play along till it leads to an apparent
absurdity… or not. Example: the admission officer’s assertion
about SAT averages of students admitted by Penn.
o Statistical testing can be a devil’s advocate game: Assume an
undesirable scenario, and try to show it probably isn’t so.
Example: the assumption that the new type of surgery is no
better than the old type.
o Statistical testing can be used to check whether a norm is
likely to be satisfied. Example: examining fairness of a coin.
Note that we tend to cast statements in vague terms: “probably”, “likely”.
The reason is that statistical testing quantifies uncertainty about
conclusions. Science never deals in absolute certainties, although some
conclusions can reach certainty beyond reasonable doubt.
 Null hypotheses: A null hypothesis is an assumption about a
population quantity. Note the “population” part. Null hypotheses
are never about actually observed statistics computed from data.
Population values are the targets estimated by sample values, and the
sample values are used for inference about population values.
The two fundamental methods of statistical inference methods are:
1) confidence intervals and
2) statistical tests.
We consider only population means μ and population proportions p
(=probabilities), and the only type of assumption we will consider is
that μ or p take on a specific value of interest.
What these hypothesized values are depends on the context: If it is
about testing fairness of a coin or commanding a majority in the polls,
the natural null hypothesis is p=0.5. If the business plan asks for a
minimum demand of 10% of households, then p=0.1 is the natural
null hypothesis. If a new type of surgery is asserted to shorten
hospital stays, then the devil’s advocate says the mean reduction is
zero.
One could also consider null hypotheses about population standard
deviations σ, and this is done, but it is much less important. Below we
will consider differences in population means and population proportions
between groups, and finally null hypotheses about population slopes in
regression.
Notation for null hypotheses:
H0: μ = μ0
and
H0: p = p0
where μ0 and p0 are the assumed population values.
In the case of the new type of surgery, we could let μ be the
population mean of hospital days with the new procedure, so the null
hypothesis of no improvement over the old type of surgery is that both
types have the same population average:
H0: μ = 5.4
When it comes to testing fairness of coins or claims to majorities in
polls, the null hypothesis is:
H0: p = 0.5
o Reminder: H0: X = 5.4 is completely mistaken. The quantity X will lend
evidence about μ, but it cannot be the subject of a null hypothesis.
o Why “null” hypothesis? There is another type called “alternative
hypothesis”, hence “null” is opposed to “alternative”. The alternative
hypothesis is essentially “not the null hypothesis”. There are subtleties
about alternative hypotheses that we will not discuss here (two-sided
versus one-sided alternatives: Ha: μ ≠ μ0 and Ha: μ > μ0).
 Test Statistics: A test statistic computed from data provides evidence
for or against the null hypothesis. It is not too difficult for us to
devise a test statistics for the above null hypotheses. Similar to
confidence intervals, the ideas center on the deep fact that means vary
across datasets, that their variation can be quantified by the standard
error σ( X ), and that σ( X ) can be estimated from a single dataset by
the standard error estimate stderr = s(X)/N1/2 .
Let us have another look at the graphs at the end of Module 7:
To play the game of testing a null hypothesis, we assume that it is true
and that the data have the hypothesized population mean μ0. We then
check how extreme the estimate X of μ0 is in light of the distribution
of X :
o In the first graph above, X is less than two standard errors away
from μ0 . This is counted as compatible with the null hypothesis
that μ has this particular value.
o In the second figure, X is more than two standard errors away.
One judges this X to be too unlikely under the null hypothesis
and hence incompatible with it.
In light of the CLT, a good test statistic would be a Z-score formed
under the null hypothesis:
z =
X  0
(X )
If z is more extreme than ±2 (that is, > +2 or < –2), we will say:
we reject H0. What we really mean is: H0 (the assumption that μ =
μ0) is not very compatible with the data.
An obvious problem is that while μ0 is specified by the null
hypothesis, the standard deviation σ(X) of the data and hence the
standard error σ( X ) = σ(X)/N ½ are not specified and hence need to
be estimated. The result is what is called the t-statistic:
t =
X  0
stderr ( X )
where stderr( X ) = s(X)/N ½ is the standard error estimate as usual.
Comments:
o We can think of the t-statistic as a change of units in X : make
μ0 the new origin of the scale and make stderr the new unit.
If t =1.5, then X is 1.5 stderr to the right of μ0. Therefore, |t|
measures the distance of X from μ0 in multiples of stderr.
o |t| is a measure of evidence against the null hypothesis: if |t|
> 2, we “reject the null hypothesis” (although see what
follows).
 Null Distribution: The probability distribution of the test statistic t
assuming H0: μ = μ0 is called the null distribution. Note it is a
hypothetical distribution, literally. It is used to judge what values of
t and hence of X should be considered as giving evidence for or
against μ0. Large values |t| will count as evidence against μ0.
Now that we have replaced the denominator σ( X ) of z with the
quantity stderr which is no longer a constant but a random variable,
the probability distribution of the resulting t has changed: If the
observations themselves are normal, the random variable z is normal,
but the random variable t is no longer exactly normal. It has what is
called “Student’s t-distribution” (recall the story of “Student” aka Gosset at
the Guinness Brewery in 1908). The t-distribution becomes very nearly
normal for large N, but for N <60, the cut-off value, which should be
the 97.5% quantile, is greater than 2 and grows as N gets smaller.
Here is one more time the table from Module 7, where we included
N=∞, which is the normal distribution:
N:
t0.975 :
10
2.23
15
2.13
20
2.09
30
2.04
40
2.02
N:
t0.975 :
50
2.01
60
2.00
75
1.99
100
1.98
∞
1.96
Using these “exact” cut-offs, we say we reject H0 when |t| > t0.975. The
union of the two intervals (–∞, –t0.975) and (t0.975, +∞) is called the
rejection region. The interval (–t0.975, t0.975) is called the “nonrejection region”.
Purists are against using the term “acceptance region”, hence it’s “nonrejection region”. Nicer terminology would use the words “incompatible”
and “compatible”, which is what μ0 and X are depending on where t falls.
In the next graph below, the part of the axis with the gray area is the
rejection region, the part in between is the non-rejection region.
The t-statistic is always reported in null hypothesis testing. When you
see it, check it against the rough cut-offs ±2, but be aware that JMP
and all other software use the t-quantiles as in the above table; they
are exact if the observations are normally distributed. If the data are
not normally distributed (as for discrete and skewed distributions),
even the t-distribution is only an approximation. Visually, the tdistribution is indistinguishable from the normal distribution, except
when N is extremely small. The following figure shows the t-density
function for N=20.
 Significance Levels: The choice of boundaries at the 2.5% and 97.5%
quantiles of the null distribution amounts to a test at the significance
level α =5%, or simply at the 5% level. The significance level α is
the tail probability that defines the cut-off values, approximately ±2.
In the figures above, the gray areas denote the 5% tail probability α,
divided into two areas of α/2 = 2.5% each.
The choice of 5% is a convention that can be changed. The
significance level of 5% is the most frequent choice, but when the
evidence against the null hypothesis is required to be more stringent
in order to reject it, one chooses a significance level of 1% or even
lower. In this case, the quantiles for the t-distribution are as follows:
N:
t0.995:
N:
t0.995:
10
15
20
30
40
3.25
2.98
2.86
2.76
2.71
50
60
75
100
∞
2.68
2.66
2.64
2.63
2.58
It appears that cut-offs ±2⅔ are a good and conservative choice for
testing at the 1% significance level. Again, all software, including
JMP, uses the “exact” quantiles of the t-distribution.
In general, for a given significance level α, one uses the (1–α/2)quantile of the t-distribution as a cut-off. That is:
Reject H0 at the significance level α 
|t| > t1–α/2
Comments:
o The lower the significance level α, the larger is the nonrejection region, and the less likely is rejection of the null
hypothesis.
o It is possible that we can reject at the 5% level, but not at the
1% significance level. This is the case if t is between 2 and 2.6:
2 < t < 2.6 means rejection at the 5% level,
o The significance level α is also called the “Probability of a
Type 1 Error”. A Type 1 Error is the rejection of the null
hypothesis when it is in fact true. But the probability that this
happens is exactly α, by construction:
P( rejection of H0 at the level α | H0 is true )
= P( |t| > t1–α/2 | H0 is true ) = α
See the red box above. So far we have discussed α as a tail
probability, which it is: it is the probability of seeing a value of
the t-statistic more extreme than the cut-off t1–α/2, but this
probability under the null hypothesis is α.
(There is a notion of Type 2 Error, which is not-rejecting H0 when H0 is in
fact false. This is a more difficult concept and we will not explain it.)
 P-Values: The p-value is the achieved significance level. The idea
behind the p-value starts with the following question:
What would be the significance level for which the observed X
and t would be exactly on the cut-off?
That significance level is the p-value. The situation is depicted here:
Comments:
o The p-value is a measure of evidence in favor of H0: μ = μ0.
If the p-value falls below α, we say there is insufficient
evidence in favor of H0: μ = μ0, hence:
Reject H0 at the significance level α 
p-value < α
o The p-value is a random variable, even though it is calculated
as a hypothetical probability assuming H0: μ = μ0 is true.
o The p-value is a transformation of |t| to the 0-1 range:
p-value = 1  μ0 = X
p-value = 0  | μ0 – X | = ∞
o The p-value is the hypothetical probability of observing a value
of t more extreme than the one in hand. If this hypothetical
probability is small, it means the value of t in hand is extreme
under H0. Hence we reject H0.
o Why p-values are so popular: They allow testing a null
hypothesis at all conceivable significance levels. Once we
know the p-value, we know how to answer if someone asks for
a test at the 5% level, at the 1% level, at the 0.5% level… The
answer is always: if the p-value is below the significance level
α, we reject H0 at the significance level α.
o We see, therefore, that a p-value 0.02 allows us to reject at the
5% levels, but not at the 1% level.
 Confused? That’s ok. Here are handy rules for real life:
o Reject H0 at the 5% significance level if the p-value is below
0.05. This never fails.
o If the t-statistic is < –2 or > +2, expect rejection, but in borderline cases where the t-statistic is very near +2 or –2, recall that
the cut-offs ±2 are not exact, hence trust the p-value.
o Keep in mind that statistical testing is a “what if” game. It
starts with “what if μ = μ0?” and checks what the consequences
are in light of the data. Rejection of μ = μ0 means that this
assumption is not compatible with the data.
 Confidence Intervals with Coverage Probability 1-α:
Logically equivalent to rejection at the 5% significance level is μ0
falling outside the “exact” CI (provided by the software). The rough
CI = ( X ± 2 stderr) is usually correct but may fail in borderline cases
when | X – μ0| ≈ 2 stderr. The “exact” CI with coverage probability
1– α/2 is:
CI1–α = ( X – t1–α /2 · stderr, X + t1–α /2 · stderr )
Therefore, a rough 99% confidence interval is X ±2⅔ · stderr.
The general connection between α-level testing and (1–α)-CIs is:
Reject H0 at the significance level α  μ0  CI1–α
 Testing Means in JMP:
Analyze > Distribution > (select Y,Columns) > OK
(click tiny red triangle icon, next to variable name) Test Mean
> (enter the values μ0 or p0 to be tested in the upper field) > OK
Here is Example 3, the problem of testing fairness of a coin (H0:p=.5)
where 64 heads in 144 flips were observed (Sim Dice and Coin
Flips.JMP):
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
0.4444444
0.4986384
0.0415532
0.5265823
0.3623066
144
Test Statistic
Prob > |t|
Prob > t
Prob < t
t Test
-1.3370
0.1834
0.9083
0.0917
Test Mean=value
Hypothesized Value
Actual Estimate
df
Std Dev
0.5
0.44444
143
0.49864
.40 .45 .50 .55 .60
JMP gives you a picture of the null distribution with the area of the pvalue colored in blue. Note that it is centered at the hypothesized
population mean 0.5, shown also in the numeric output. We see the
mean or proportion twice: among moments and below the
hypothesized value.
o Our two-sided p-value is written as “Prob > |t|”. Its value is
0.1834. Since it is not below 0.05, we do not reject the null
hypothesis.
Our p-value is followed by two one-sided p-values for which we have no
use; they are associated with one-sided alternative hypotheses.
o The “Test Statistic” is the t-statistic (it can be the z-statistic if the
standard deviation is known). Its value –1.337 is between ±2,
hence again no rejection.
o The CI (0.362, 0.527) contains the hypothesized value 0.5, hence
yet again no rejection.
 Example 1: Recall the manufacturer’s target is an excess of 10% take
rate, and the survey says the rate of self-declared intent of purchase is
14% of the households. Since 10% is the critical border line, we take
H0: p=0.10 as the null hypothesis, and the question to be answered is
whether the observed proportion p̂ =0.14 lends evidence against H0.
To proceed, we need one more piece of information: the sample size,
which happens to be N= 500. At the end of Module 7 we saw that the
standard error estimate for the proportion is
stderr( p̂ ) = ( p̂ (1– p̂ ) / N )1/2 = (0.14·0.86/500)1/2 = 0.0155
hence the test statistic is
pˆ  p
0.14  0.10

 2.58 .
stderr( pˆ )
0.0155
Now this is fortunate: 2.58 is greater than 2. Hence we can reject the
assumption that the true population proportion is 10%.
 Example 2: The null assumption in the Penn student SAT problem is
H0: μ = 1420, the assertion made by the admission official. He/she
may have made the assertion based on the complete census of Penn
students; we wouldn’t know, it’s just a very specific assertion. Our
evidence is rather scant: a random sample of N=25 students with a
sample mean X =1380 and a standard deviation s=125. Hence the
standard error estimate is s/N1/2=125/5=25. The test statistic is
X  0
1380  1420

 1.6
stderr ( X )
25
The value 1.6 is clearly below 2, hence the assertion that the
population mean of SAT scores is 1420 is compatible with the data.
A problem is of course that the data is so small. With a larger sample
we’d have a better chance to refute the admission official.
 Example 4: Can a candidate with 56% of likely voters in his/her favor
brag that he/she has a majority? We need to know the sample size of
the survey. If it is N=961, then stderr = (.56 · .44 /961)½ = 0.016, and
the test statistic is (.56 – .50)/.016 = 3.75. The value p-value would be
0.0001874, which is smaller than all conventional significance levels,
and hence a pretty sure thing. It means that if the truth is still p=0.50,
then one would find a value as extreme as 56% in fewer than 2 out of
10,000 surveys of size N=961.
Download