Chapter 12: Introduction to statistical inference

advertisement
Chapter 12: Testing hypotheses about single means (z and t)
Example: Suppose you have the hypothesis that UW undergrads have higher
than the average IQ than the US population. You know that IQ’s of the whole
population of are normally distributed with a mean of 100 and a standard
deviation of 15. How would you test your hypothesis?
The solution is to obtain a random sample of IQs from the UW population,
calculate the mean, and compare it to 100.
Let’s say we measure the IQs of 25 students and obtain a mean of 106 points.
Is a mean of 106 points really different from 100?
We need to compare this mean from our sample and compare it to the US
population mean of 100 and ask the question:
What if the population of UW students really has a mean IQ of 100 like the US
population. How unlikely would it be for us to make this observation by chance?
More specifically, how likely would it be for us to draw a mean that differs from
100 by more than 6 points by chance?
If it’s sufficiently unlikely, we’d consider this evidence in favor of our hypothesis that
UW students have higher IQs.
More formally, we call the thing we’re trying to prove wrong the null hypothesis
(H0), and the thing we’re trying to show to be true the alternative hypothesis (HA).
In our example, the null hypothesis is that there is not a difference between the
mean IQ scores of UW students and the US population. The alternative
hypothesis is that UW students have a higher mean IQ than the US population.
We compute a statistic from a sample and determine how probable our observed
statistic should occur if the null hypothesis is true. If this probability is sufficiently
low, we reject the null hypothesis.
Our criterion for the probability for rejection is called the ‘alpha (a) value’.
Choosing a value of alpha is both complicated and somewhat arbitrary (more on
this later). But typically values are a = 0.05 or a =0.01. An alpha value of 0.05
means that there less than a probability of .05 (1 in 20) that we’d observe our
sample statistic if the null hypothesis were true.
Here is a step-by-step recipe for hypothesis testing using our UW IQ example.
Step 1: Define the target population. This is the population that we want to
make an inference about. In this case, we want to make an inference about UW
undergrad IQs
Step 2: Specify the null hypothesis (H0). This is the hypothesis we hope to reject.
In our example, our null hypothesis is that UW students have a mean IQ of 100.
We write this as:
H0: mx = 100
Step 3: Specify the alternative hypothesis (HA). We must choose between a
directional (‘one-tailed’) or non-directional (‘two-tailed’) test here.
In our example we are expecting (hoping) for an IQ that is greater than the
population, so this is a directional , or one-tailed test.
HA: mx > 100
Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision.
This is the probability criterion for which we will reject the null hypothesis by
chance if it is actually true. We’ll chose a = .05 for this example.
Step 5: Decide on a sample size (n) and draw a random sample from our target
population.
In our example, our sample had 25 students.
Step 6: Calculate your statistic on your sample (the mean in our example).
In our example, we obtained a mean IQ of 106 points
Step 7: Convert your statistic into standard units with respect to your null
hypothesis. In our example, we’ll calculate the z-score with a standard error of the
mean:
 X   / n  15 / 25  3
z
( X  mhyp )
X

(106  100)
2
3
Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. For the
standard normal (z) distribution, the region of rejection is the upper tail containing a
proportion of area equal to a = .05. Looking this up in Table A (Column C), this
corresponds to a value of z = 1.645.
z=2
area = a = .05
-4
-3
-2
-1
0
z score
1
2
3
4
Our observed mean corresponds to z=2, which is within the region of rejection. This
means that our observation would be unlikely if the null hypothesis were true. We
therefore reject the null hypothesis. We say that “our study shows that UW students
have statistically significantly higher IQs than the US population using criterion value of
a=.05.”
What if we had chosen a criterion of a = .01 instead of .05?
This corresponds to a rejection region for values of z greater than 2.33.
Our observation of z=2 does not fall into this region, so in this case we would fail to
reject H0.
z=2
area = a = .01
-4
-3
-2
-1
0
z score
1
2
3
4
If our choice of criterion (a) seems arbitrary, that’s because it is. To give the reader
more information, we can report the probability of our observation under the null
hypothesis. In our example, this is the area under the curve above z=2, or
Pr(z>2) = .0228.
This value is often called the p-value, and we write p = .0228. Note that this p-value
falls between our two a values of 0.05 and 0.01.
Another example: Suppose we have a drug that we think can influence IQ values.
How would we test if this drug has an effect?
Step 1: Define the target population. We’ll be randomly sampling from the US
population this time.
Step 2: Specify the null hypothesis (H0). Like before, our null hypothesis is
H0: mx = 100
Step 3: Specify the alternative hypothesis (HA). By ‘influence’ we’re not specifically
predicting an increase (or decrease) in IQ. So we’ll use a two-tailed test and write:
HA: mx ≠ 100.
Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision.
We’ll chose a = .05 again .
Step 5: Decide on a sample size (n)
Let’s run our experiment on 100 subjects.
Step 6: Calculate your statistic on your sample (the mean in our example).
Suppose obtained a mean IQ of 96 from our 100 subjects.
Step 7: Convert your statistic into standard units with respect to your null
hypothesis. In our example, we’ll calculate the z-score with a standard error of the
mean:
 X   X / n  15 / 100  1.5 z  ( X  m )  (96  100)  2.67
X
1.5
Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We
want to find the values of z that have an area of a/2=.025 in each tail. This
corresponds to a values of z= ±1.96
z = -2.67
area = a/2 = .025
area = a/2 = .025
-4
-3
-2
-1
0
z score
1
2
3
4
Our observed mean corresponds to z=-2.67, which is within the region of rejection.
We therefore reject the null hypothesis. We conclude that our drug has a significant
influence on IQ values at a criterion level of a=.05.
p-values: Calculating a p-value for a two-tailed test corresponds to calculating the
area under the standard normal in both the positive and negative directions away
from the absolute value of our observed z.
z=-2.67
z=+2.67
area = .0038
-4
area = .0038
-3
-2
-1
0
z score
1
2
3
4
The area below z=-2.67 is .0038, which is the same as the area above z=+2.67. So
our p-value is p=.0038 x 2 = .0076.
The p-value is the probability of rejecting H0 when it is actually true.
Note that if we had decided ahead of time to use a one-tailed test, with an
alternative hypothesis of HA: mx >100 our region of rejection for a=.05 would
include values of z greater than 1.675.
In this case, we would have failed to reject H0 and would conclude that our drug did
not significantly increase IQs at a criterion level of a=.05.
z=-2.67
area = a = .05
-4
-3
-2
-1
0
z score
1
2
3
4
The t-distribution: when we don’t know 
What if we don’t know the standard deviation of the population from which we
obtained our sample? This is a much more common situation. How do we estimate
this value?
Common sense says that we’d use the standard deviation of our sample as an
estimate of the population’s standard deviation (and therefore use the standard error
of the mean of our sample as an estimate of the population’s standard error of the
mean).
This is generally correct, but we have to make two changes:
1) We need to change our formula for the standard deviation to use n-1 instead of n.
sx 
2
(
X

X
)

n 1
2) To get our estimate of the population’s standard error of the mean, we still divide
by the square root of our sample size:
sX
sX 
n
2) Our standardized measure no longer comes from a normal distribution. Instead, it’s
called a ‘t-distribution’
t
X  uhyp
sX
What happened to our normal distribution? Note that now the mean and the standard
error of the mean both vary for different samples. This increases the probability of
very high and low values which fattens the distribution compared to normal.
normal distribution (z) (n=∞)
n=12
n=4
n=2
-3
-2
-1
0
t
1
2
3
Unlike our standard normal distribution, our t-distributions are a ‘family’ of curves,
one for each sample size (n).
We label each family member not by sample size but by ‘degrees of freedom (df)’,
which is equal to n-1 for the examples we’re doing here (comparing a single mean to
an expected population mean).
normal distribution (z) (n=∞)
df =11
df=3
df=1
-3
-2
-1
0
t
1
2
3
Example: The mean height of the 72 women in our class is 64.5 inches with a
standard deviation of 3.28 inches. Is this significantly taller than 64 inches,
which is the average height of a woman in the US?
Example: The mean height of the 72 women in our class is 64.5 inches with a
standard deviation of 3.28 inches. Is this significantly taller than 64 inches,
which is the average height of a woman in the US?
Step 1: Define the target population. We are interested in the heights of the
women in our class.
Step 2: Specify the null hypothesis (H0).
Our null hypothesis is H0: mx = 64
Step 3: Specify the alternative hypothesis (HA).
We’ll use a one-tailed test, since we’re asking if our mean is taller HA: mx > 64.
Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision.
We’ll chose a = .05 again .
Step 5: Decide on a sample size (n)
We have a sample of 72 women
Step 6: Calculate your statistics on your sample (the mean in our example).
Our sample mean is 64.5 inches and our sample standard deviation is 3.28 inches
Step 7: Convert your statistic into standard units with respect to your null
hypothesis.
Since we don’t know the population standard deviation, we’ll use our sample
standard deviation and the t-distribution with 72-1 =71 degrees of freedom:
s X 3.28
sX 

 .3866
n
72
t
X  uhyp
sX
64.5  64

 1.29
.3866
Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We will use
table D which contains rejection regions for the t-distribution. This is the area in one tail
for a = .05 and df = 71. The nearest df is 70. Our region of rejection is for values of t
greater than 1.667.
t= 1.29
area = a = .05
-3
-2
-1
0
t
1
2
3
Our observed mean is not within the region of rejection. We therefore fail to reject
the null hypothesis. We conclude that the average height of women in our class is
not significantly different from that of the US population at a criterion of a=.05.
Calculating p-values using table D is pretty crude since we only have a limited set of
alpha values to choose from. It turns out that our value of t = 1.29 with df = 71 is
very close to the critical value of t for a=.10. For this example, our p-value is close
to 0.1. But we can always use our t-statistic calculator in the Excel spreadsheet.
This means that if we drew a random sample of 72 heights from women the US
population, there is about a 10% chance that we’d observe a mean as high or
higher than the mean of our class. Our mean is therefore above average, but not
exceptionally so.
t= 1.29
area: 0.1006
-3
-2
-1
0
1
t (df = 71)
2
3
Example: The 21 men in our class have a mean height of 70.3 inches with a standard
deviation of 2.61 inches. Is this significantly different from 69.5 inches, the average
height of a man in the US?
Step 1: Define the target population. We are interested in the heights of the men
in our class.
Step 2: Specify the null hypothesis (H0).
Our null hypothesis is H0: mx = 69.5
Step 3: Specify the alternative hypothesis (HA).
We’ll use a two-tailed test, since we’re asking if our mean is different HA: mx ≠ 69.5.
Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision.
We’ll chose a = .05 again .
Step 5: Decide on a sample size (n)
We have a sample of 21 men
Step 6: Calculate your statistics on your sample (the mean in our example).
Our sample mean is 70.3 inches and our sample standard deviation is 2.61 inches
Step 7: Convert your statistic into standard units with respect to your null
hypothesis.
Since we don’t know the population standard deviation, we’ll use our sample
standard deviation and the t-distribution with 21-1=20 degrees of freedom:
sX
2.61
sX 

 .5695
n
20
t
X  uhyp
sX
70.3  69.5

 1.40
.5837
Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We will
use table D which contains rejection regions for the t-distribution. This is the area
covering two tails for a = .05 and df = 28.
The critical t-value for two tails with a = .05 is the same as the critical t-value for one
tail with a = .025/2 = 0.025. This is because for two tails, our total area of .05 needs
to be split into two halves.
Our region of rejection is for
values of t greater than 2.09 or
less than -2.09.
1.4
area =0.025
-3
-2.09
-2
area =0.025
-1
0
t (df=20)
1
2.09
2
3
Our observed mean is not within the region of rejection. We therefore fail to reject
the null hypothesis. We conclude that the average height of men in our class is not
significantly different from that of the US population at a criterion of a=.05.
Looking at table D, our observed value of t= for df = 20 falls outside the rejection
region for an alpha value of 0.5 (two-tailed).
This means that our p-value is less than 0.5.
The true p-value from our t-test calculator is .0884+.0884 = 0.1768
1.4
area =0.0884
-3
-2
area =0.0884
-1.4
-1
0
t (df=20)
1.4
1
2
3
Example: in the news.
"Freshman 15" weight gain is a myth, new study finds
Reuters - The idea that college freshmen gain an average of 15 pounds in their first
year of school is a myth -- the average is really between 2.4 pounds for women and
3.4 pounds for men, the co-author of a new study said Tuesday.
"Not only is there not a 'Freshman 15,' there doesn't appear to be even a 'college 15'
for most students," said Jay Zagorsky, research scientist at Ohio State University's
Center for Human Resource Research and co-author of a study on college weight
gain.
Here’s a table of weight gain (in pounds) from the actual publication: Zagorsky &
Smith, Social Science Quarterly, 2011
Male Freshman
Female Freshman
Mean
3.1 pounds
3.5
sd
10.1
10.3
n
2536
2151
Male Freshman
Female Freshman
Mean
3.1 pounds
3.5
sd
10.1
10.3
n
2536
2151
Let’s look at the women. We
don’t know the population
standard deviation, so we’ll use a
t-test.
H0: mx = 15
HA: mx ≠ 15
sX 
t
Our observed value of t falls way into the
rejection region, so we conclude that college
freshmen do not gain 15 lbs.
sX
10.3

 .2221
n
2151
X  uhyp
sX
If we use a = .01, then with n-1 = 2150
degrees of freedom, our critical value of t for
a nondirectional (two-tailed) test is +/- 2.81.

3.5  15
 30.6
.2221
“Our results indicate that the “Freshman 15” is
a media myth. While freshmen do gain
weight, the observed average increase of 2.5
to 3.5 pounds falls far short of the ominous 15
pounds.”
Download