Module 7: Hypothesis Testing I Statistics (OA3102) Professor Ron Fricker

advertisement
Module 7:
Hypothesis Testing I
Statistics (OA3102)
Professor Ron Fricker
Naval Postgraduate School
Monterey, California
Reading assignment:
WM&S chapter 10.1-10.5
Revision: 2-12
1
Goals for this Module
• Introduction to testing hypotheses
– Steps in conducting a hypothesis test
– Types of errors
– Lots of terminology
• Large sample tests for the means and
proportions
– Aka z-tests
– Type II error probability calculations
– Sample size calculations
Revision: 2-12
2
A Simple Example
• You hypothesize a coin is fair
– To test, take a coin and start flipping it
– If it is fair, you expect that about half of the flips
will be heads and half tails
– After a large number of flips, if you see either a
large fraction of heads or a large fraction of tails
you tend to disbelieve your hypothesis
• This is an informal hypothesis test!
• How to decide when the fraction is too large?
Revision: 2-12
3
Is the Coin Fair?
• Back to the in-class exercise from Module 1: flip a
coin 10 times and count the number of heads
• Assuming the coin is fair (and flips are independent),
the number of heads has a binomial distribution with
n=10 and p=0.5
• So, if our assumption is true, the distribution of the
number of heads is:
• And, if your assumption is not true, what do we
expect to see? Either too few or too many heads
Revision: 2-12
4
Setting Up a Test for the Coin
• Idea: Make a rule, based on the number of heads
observed, from which we will conclude either that our
assumption (p=0.5) is true or false
• Here’s one rule: Assume p=0.5 if 3 < x < 7
– Otherwise, conclude p≠0.5
• If p=0.5, what’s our chance of making a mistake?
~11%
• If p=0.8, what’s our chance of detecting the biased coin?
Revision: 2-12
~68%
5
CIs vs. Hypothesis Testing
• Previously we would have answered this
question with a confidence interval:
– Say we observed y n = 0.7 : we’re 95% confident
that the interval [0.5, 0.9] covers the true p
• Now we look to answer the question using a
hypothesis test:
– If the true probability of heads is p = 0.5 (i.e., the
coin is fair), how unlikely would it be to see 7
heads out of ten flips?
• If we see an outcome inconsistent with our
hypothesis (the coin is fair), then we reject it
Revision: 2-12
6
Why Do Hypothesis Tests?
• Confidence intervals provide more information
than hypothesis tests
• But often we need to test a particular theory
– E.g., “The mod decreases the mean down-time.”
• Sometimes confidence intervals are
hard/impossible
– When the theory you’re testing has several
dimensions
• E.g., regression slope and intercept
– When “intervals” don’t make sense
• E.g., are two categorical variables independent?
Revision: 2-12
7
Elements of a Statistical Test
• Null hypothesis
– Denoted H0
– We believe the null unless/until it is proven false
• Alternative hypothesis
– Denoted Ha
– It’s what we would like to prove
• Test statistic
– The test statistic is the empirical evidence from the data
• Rejection region
– If the test statistic falls in the rejection region, we say “reject
the null hypothesis” or “we’ve proven the alternative”
– Otherwise, we “fail to reject the null”
Revision: 2-12
8
Errors in Hypothesis Testing
true situation
conclusion
don’t reject
reject
Revision: 2-12
H0
H0
H0 true
no
error
type I
error
Ha true
type II
error
no
error
9
Probability of a Type I Error
• a=Pr(Type I error) =
Pr(H0 is rejected when it is true)
• Also called the significance level or the level
of the test
• Experimenter chooses prior to the test
– Conventions: a=0.10, 0.05, and 0.01
– Can increase or decrease depending on particular
problem
• Choice of a defines rejection region (or size
of “p-value”) that results in rejection of null
Revision: 2-12
10
Probability of a Type II Error
• b=Pr(Type II error) =
Pr(not rejecting H0 when it is false)
• It is a function of the actual alternative
distribution and the sample size
• 1-b called the power of a test
– It’s Pr(rejecting H0 when it is false)
• Ideal is both small a and b (i.e., high power),
but for a fixed sample size they trade off
– By convention, we control a by choice and b with
sample size (bigger sample, more power)
Revision: 2-12
11
Choosing the Null and Alternative
Based on the Severity of Error
• In hypothesis testing, we get to control the Type I
error
– So, if one error is more severe than the other, set the test up
so that it’s the Type I error
• E.g., in the US, we consider sending an innocent
person to jail more serious than letting a guilty person
go free
– Hence, the burden of proof of guilt is placed on the
prosecution at trial (“innocent until proven guilty”)
– I.e., the null hypothesis is a person is innocent, and the trial
process controls the chance of sending an innocent person
to jail
Revision: 2-12
12
Example 10.1
• Consider a political poll of n=15 people
– Want to test H0: p = 0.5 vs. Ha: p < 0.5, where
p is the proportion of the population favoring a
candidate
– The test statistic is Y, the number of people in the
sample favoring the candidate
• If the RR={y < 2}, find a
• Solution:
Revision: 2-12
13
Example 10.1 (continued)
Revision: 2-12
14
Example 10.2
• Continuing with the previous problem, if p=0.3
what is the probability of a Type II error (b)?
– I.e., what’s the probability of concluding that the
candidate will win?
• Solution:
Revision: 2-12
15
Example 10.2 (continued)
Revision: 2-12
16
Example 10.3
• Still continuing with the previous problem, if
p=0.1 then what is the probability of a Type II
error (b)?
• Solution:
Revision: 2-12
17
Example 10.3 (continued)
Revision: 2-12
18
Example 10.4
• Still continuing the problem, if RR={y < 5}:
– What is the level of the test (a)?
– If p = 0.3, what is b?
• Solution:
Revision: 2-12
19
Example 10.4 (continued)
Revision: 2-12
20
Terminology
• Null and alternative hypotheses
– We will believe the null until it is proven false
• Acceptance vs. rejection region
– The null is proven false if the test statistic falls in the
rejection region
• Type I vs. Type II error
– Type I: Rejecting the null hypothesis when it is really true
– Type II: Accepting the null hypothesis when it is really false
• Significance level or level of the test (a)
– Probability of a Type I error
• Pr(Type II error) = b and 1-b is called the power
– It’s a function of the actual alternative distribution
Revision: 2-12
21
Another Example
• As a program manager, you want to decrease
mean down-time m for a type of equipment
– Manufacturer suggests modification
– Goal is to see if mod actually does decrease m
• Implement modification on a sample of
equipment (n=25) and measure the downtime (say, per month)
– Currently, equipment down-time is 75 mins/month
– Standard deviation is s=9 mins
Revision: 2-12
22
Intuition Behind the Test
1. We will assume the status
quo is true…
a.
In this case, that there is
no change in the mean
down-time
s=9/5
2. …unless there is sufficient
evidence to contradict it
a.
In this case, meaning we
observe a much smaller
sample mean than we
would expect to see
assuming m=75
Revision: 2-12
X
m=75
23
Steps in Hypothesis Testing
1.
2.
3.
4.
5.
6.
Identify the parameter of interest
State null and alternative hypotheses
Determine form of test statistic
Calculate rejection region
Calculate test statistic
Determine test outcome by comparing
test statistic to rejection region
Revision: 2-12
24
1. Identify Parameter of Interest
• In hypothesis tests, we are testing the parameter of a
distribution
– E.g., m or s for a normal distribution
– E.g., p for a binomial distribution
– E.g., a and b for a gamma distribution
• So, the first step is to identify the parameter of
interest
– Often we’ll be testing the mean m of a normal, since the CLT
often applies to the sample mean
• E.g., in the equipment down-time example, we’re
interested in testing the mean down-time
– For the coin and election examples, we’re testing p
Revision: 2-12
25
2. State Null and Alternative Hypotheses
• H0: The null hypothesis is a specific theory
about the population that we wish to disprove
– We will believe the null until it is disproved
– Example: Mean down-time is equal to 75
• In notation, H 0: m = 75
• Ha: The alternative hypothesis is what we want
to prove
– What we will believe if the null is rejected
– Example: Mean down-time is less than 75
• In notation, H a: m  75
Revision: 2-12
26
Null and Alternative Hypotheses are
Fundamentally Different
• The null hypothesis is what you have assumed
– Generally, it’s the status quo or less desirable test outcome
– Failing to prove the alternative does not mean the null is true,
only that you don’t have enough evidence to reject it
• The alternative is proven based on empirical evidence
– It’s the desired test outcome and/or the outcome upon which
the burden of proof rests
– The significance level (a) is set so that the chance of incorrectly
“proving” the alternative is tolerably low
• Having proved the alternative is a much stronger
outcome than failing to reject the null
– Thus, structure the test so the alternative is what needs proving
Revision: 2-12
27
In the Example
• In the example, we want to test is whether the mod
decreases mean down-time
– So, the null hypothesis is the status quo and the alternative
carries the burden of proof to show a decrease
– We write this out as
H 0: m = 75
H a: m  75
• The other possibilities are
H 0: m = 75
H a: m  75
and
H 0: m = 75
H a: m  75
– What would you be testing with these hypotheses?
Revision: 2-12
28
Expressing the Null as an Equality
• We will express the null hypothesis as an
equality and the alternative as an inequality
– E.g., H 0: m = 75 versus H a: m  75
• In reality, the hypotheses divide the real line
into two separate regions
– E.g., H 0: m  75 versus H a: m  75
• However, the most powerful test occurs when
the null hypothesis is at the boundary of its
region
– Hence, we write the null as an equality
Revision: 2-12
29
3. Determine Test Statistic
(and its Sampling Distribution)
• The test statistic is (a function of) the sample
statistic corresponding to population
parameter you are testing
– Population and sample statistic examples:
• Population mean  Sample mean
• Difference of two population means 
Difference of two sample means
– It is sometimes “a function of” the sample statistic
as we may rescale the sample statistic
• We use the sampling distribution to determine
whether the observed statistic is “unusual”
Revision: 2-12
30
In the Example
• In the example, we are testing the mean m so the
obvious choice for the sample statistic is X
• In this case, it’s easier to make the test statistic the
rescaled sample statistic,
X  m0
Z=
,
s/ n
where m0 is the null hypothesis mean
• Why? Because, assuming the null hypothesis is true,
we know the sampling distribution of Z: Z~N(0,1)
– This is exactly true if X has a normal distribution and approximately
true via the CLT for large sample sizes
Revision: 2-12
31
4. Calculate Rejection Region
• Rejection region depends on the alternative
hypothesis
• Set the significance level a so that the
Pr(fall in rejection region | null hyp. is true) = a
• Means you will have to determine the
appropriate quantile or quantiles of the
sampling distribution
Revision: 2-12
32
The Way to Think About It
(for a “two-sided” test)
Rejection region – unlikely under the null (i.e., probability a)
If test statistic falls in this region, reject the null
Acceptance region – likely under the null hypothesis
If test statistic falls in this region, fail to reject null
Revision: 2-12
33
In the Example
• In our example, the test statistic Z~N(0,1)
• In addition, we know H a: m  75 , so this is a
“one-tailed” or “one-sided” test
• So, we need to find za so that Pr(Z < za) = a
• Choosing a=0.05:
– From Table 4 we get
Pr(Z < -1.645) = 0.05
– Using Excel: =NORMSINV(0.05)
– And in R: qnorm(0.05)
Revision: 2-12
34
Picturing the Rejection Region
Z~N(0,1) pdf
• For the rescaled test statistic,
the rejection region is in red
• Probability of falling in that
region, assuming the null is
true, is a=0.05
-1.645
0
X~N(75,81/25) pdf
• Here’s the equivalent test
without rescaling
• Probability of falling in that
region, assuming the null is
true, is still a=0.05
75 - 1.645 x 9/5 = 72
Revision: 2-12
75
35
5. Calculate the Test Statistic
• Now, plug the necessary quantities into the
formula and calculate the test statistic
• E.g., in the example, imagine we tested 25
pieces of equipment for a month, measuring
the total down-time for each:
• Calculating the sample mean gives X = 68.6
• Thus, the test statistic is
X  m0 68.6  75
z=
=
= 3.8
s/ n
9 / 25
Revision: 2-12
36
6. Determine the Outcome
• Now, compare the test statistic with the
rejection region
– If it falls within the rejection region you have
“rejected the null hypothesis”
• Equivalently, “proven the alternative”
– If it falls in the acceptance region, you have
“failed to reject the null hypothesis”
• Equivalently, “failed to prove the alternative”
Revision: 2-12
37
In the Example
• We observed z = 3.8 which falls in the
rejection region
• The picture:
Very unusual to see
this, if the null is true
z = 3.8
-1.645
0
• Thus, conclude that the mod is effective at
reducing mean down-time for the equipment
Revision: 2-12
38
One Way to Think About It
• The test statistic tells us our observation was 3.8
standard errors way from the hypothesized mean
– The calculation
X  m0 68.6  75
z=
=
= 3.8
s/ n
9 / 25
• How likely is this?
– If the null were true, it would be very unlikely
Pr(Z  3.8 | H 0 is true) =   3.8 = 0.000072
– Another way to think about it is about one chance in 14,000
to see something like this or more extreme (i.e., 1/0.000072)
– Pretty convincing evidence that the mod decreases mean
down-time
Revision: 2-12
39
Logic of Hypothesis Testing
It’s proof by contradiction:
– Suppose I am a
general/flag officer
– People would salute
me on Tuesdays
– No one ever salutes
me
– I must not be a
general/flag officer
Revision: 2-12
Null Hypothesis
What I expect if null
hypothesis is true
What I see is inconsistent
with what I expect
My initial assumption
must be wrong
40
Now, for the Example…
– Assume m is 75
– Sample mean (n=25)
normally distributed w/
mean 75, s.e.  9 / 25 = 1.8
– Observe a sample
mean of 68.6 that is
3.8 s.e.s below
assumed mean
– Real mean m must be
less than 75
Revision: 2-12
Null Hypothesis
What you expect to
see if null is true
Not very likely to
happen if null is true
(p =0.00007)
Null must be false
41
Large-Sample or z-Tests
• The statistic is (approximately) normally distributed
ˆ   0
• The rescaled test statistic is Z =
s ˆ
• The null hypothesis is H 0:  = 0
• Three possible alternative hypotheses and tests:
Alternative Hypothesis
Rejection Region for Level a Test
H a:    0
z  za
(upper-tailed test)
H a:    0
z   za
(lower-tailed test)
H a:    0
z  za / 2 or z   za / 2
Revision: 2-12
(two-tailed test)
42
Picturing z-Tests
Upper-tailed test
Lower-tailed test
Revision: 2-12
* Figure from Probability and Statistics for Engineering and the Sciences, 7th ed., Duxbury Press, 2008.
Two-tailed test
43
Example 10.5
• VP claims mean contacts/week is less than15
– Data collected on random sample of n=36 people
– Given y = 17 and s2=9, is there evidence to refute
the claim at a significance level of a=0.05?
• Specify the hypotheses to be tested:
Revision: 2-12
44
Example 10.5 (continued)
• Specify the test statistic and the rejection
region
Revision: 2-12
45
Example 10.5 (continued)
• Conduct the test and state the conclusion
Revision: 2-12
46
Large Sample Tests
for Population Proportion (p)
• If we have a large sample, then via the CLT, pˆ = y / n
has an approximate normal distribution
• For the null hypothesis H0: p = p0 there are three
possible alternative hypotheses
Alternative Hypothesis
Rejection Region for Level a Test
H a: p  p0
z  za
(upper-tailed test)
H a: p  p0
z   za
(lower-tailed test)
H a: p  p0
z  za / 2 or z   za / 2
where the test statistic is z =
Revision: 2-12
(two-tailed test)
pˆ  p0
p0 (1  p0 ) / n
47
What’s a “Large Sample”?
• Remember that the y in pˆ = y / n is a sum of
indicator variables
– If you sum enough, then the CLT “kicks in” and
you can assume p̂ has an approximately normal
distribution
• So use the same rule we used back in the
CLT lecture:
 max( p0 , q0 ) 
n  9

 min( p0 , q0 ) 
Revision: 2-12
48
Example 10.6
• Machine must be repaired is produces more
than 10% defectives
– A random sample of n=100 items has 15 defectives
– Is there evidence to support the claim that the
machine needs repairing at a significance level of
a=0.01?
• Specify the hypotheses to be tested:
Revision: 2-12
49
Example 10.6 (continued)
• Specify the test statistic and the rejection
region
Revision: 2-12
50
Example 10.6 (continued)
• Conduct the test and state the conclusion
Revision: 2-12
51
Example 10.7
• Reaction time study on men and women
conducted
– Data on independent random samples of 50 men
and 50 women collected giving
2
2
ymen = 3.6, smen
= 0.18, ywomen = 3.8, and swomen
= 0.14
– Is there evidence to suggest a difference in the true
mean reaction times between men and women at
the a=0.01 level?
• Specify the hypotheses to be tested:
Revision: 2-12
52
Example 10.7 (continued)
• Specify the test statistic and the rejection
region
Revision: 2-12
53
Example 10.7 (continued)
• Conduct the test and state the conclusion
Revision: 2-12
54
Calculating Type II Error Probabilities
• The probability of a Type II error (b) is the
probability a test fails to reject the null when
the alternative hypothesis is true
– Note that it depends on a particular alternative
hypothesis
– We can write it mathematically as
Pr(reject H0 | Ha is true with  =a)
– To determine, first figure out the rejection region
(a function of the null hypothesis), then calculate
the probability of falling in the acceptance region
when  =a
Revision: 2-12
55
Calculating Type II Error
Probabilities, continued
• For example, consider the test H 0:  = 0
versus H a:   0
– Then the rejection region is of the form
RR = ˆ : ˆ  k for some value of k


• So, the probability of a Type II error is

= Pr ˆ  k |  = 
b  a  = Pr ˆ not in RR when H a true with  = a

a
Revision: 2-12
 ˆ   a k   a
= Pr 

 sˆ
s ˆ
 



56
Example 10.8
• Returning to Example 10.5, find b if ma=16
– Remember that m0=15, a=0.05, n=36, y = 17 and s2=9
• Pictorially, we have:
Revision: 2-12
57
Example 10.8 (continued)
• Now solving:
Revision: 2-12
58
Example 10.8 (continued)
• An alternative way to solve:
Revision: 2-12
59
Example 10.8 (continued)
Revision: 2-12
60
Alternatively, the Power of a Test
• Remember, that’s the power of a test is the
probability a test will reject the null for a
particular alternative hypothesis
– It’s just 1-b
• Why is this important?
– Prior to doing a test, natural problem is that you
want to make sure you have sufficient power to
prove “interesting” alternatives
– Sometimes after a test results in a null result, you
might want to know the probability of rejecting at
the observed level
Revision: 2-12
61
Sample Size Calculations
• The sample size n for which a level a test
also has b at the alternative value ma is
 s z  z  2
b 
  a

  m a  m0 


n=
2
  s  za /2  zb  


ma  m0 




for a one-tailed test
for a two-tailed test (approximate sol'n)
– Here za and zb are the quantiles of the normal
distribution for a and b
Revision: 2-12
62
Example 10.9
• Returning to Exercise 10.5, assuming s2=9,
what sample size n is required to test H 0: m = 15
vs. H a: m = 16 with a=b=0.05?
• Solution:
Revision: 2-12
63
Another Example
• Consider a test: H0: m =30,000 vs Ha: m ≠ 30,000, where
– The desired significance level is a=0.01
– The population has a normal distribution with s =1,500
• Find the required sample size so that at ma=31,000,
the probability of a Type II error is 0.1:
 s  za 2  zb  
 =
n=
 ma  m0 
2
Revision: 2-12
64
Relationship Between Hypothesis
Tests and Confidence Intervals
• The large sample hypothesis test (z-test) is
based on the statistic
ˆ   0
Z=
s ˆ
where the acceptance region is


ˆ  0


RR =  za 2 
 za 2 
s ˆ




which we can write as
RR = ˆ  z s ˆ    ˆ  z s ˆ

Revision: 2-12
a 2

0
a 2


65
Relationship Between Hypothesis
Tests and Confidence Intervals
• Now, remember the general form for a twosided large sample confidence interval:
100 1  a  % CI = ˆ  za 2s ˆ ,ˆ  za 2s ˆ 
• Note the similarities to the acceptance region:
RR = ˆ  z s ˆ    ˆ  z s ˆ

a 2

0
a 2


• So, if 0  ˆ  za 2s ˆ ,ˆ  za 2s ˆ  we would fail to
reject the hypothesis test
– Can interpret 100(1-a)% CI as the set of all values
for 0 for which H0: =0 is “acceptable” at level a
Revision: 2-12
66
What We Covered in this Module
• Introduced hypothesis testing
– Steps in conducting a hypothesis test
– Types of errors
– Lots of terminology
• Large sample tests for the means and
proportions
– Aka z-tests
– Type II error probability calculations
– Sample size calculations
Revision: 2-12
67
Homework
• WM&S chapter 10
– Required: 2, 3, 20, 21, 27, 34, 38, 41, 42
– Extra credit: None
• Useful hints:
 Always first write out the null and alternative hypotheses.
I also recommend drawing the null distribution and then
highlighting the rejection region. This can be particularly
helpful when calculating b...
 Exercises 21 and 27: We didn’t do these types of problems
in class, but just use what you learned with two-sample
confidence intervals…
• The relevant hypotheses are
H 0: m1  m2 = 0
H 0: p1  p2 = 0
and
H a: m1  m2  0
H a: p1  p2  0
Revision: 2-12
68
Download