Independent (two) Sample t-test

advertisement
Objectives 7.2 Inference for comparing means of two populations where the
samples are independent
p 
Two-sample t significance test (we give three examples)
p 
Two-sample t confidence interval
p 
http://onlinestatbook.com/2/tests_of_means/
difference_means.html
Standard errors p 
We have learnt that standard errors are crucial in constructing both
confidence intervals and also statistical testing. Do not get mixed up
between standard error of an estimator and standard deviation of the
sample.
p 
The amount of variation in the sample (the average (squared) distance
between each estimate and the population mean) is measure by its
standard error, which is
(amount of variation in the sample)
s
s.e. = p =
(square root of the sample size)
n
q 
q 
q 
You can imagine that the unknown population mean should be in some
proximity to the known sample mean. The proximity is measured by the
standard error.
The sample mean tends to get closer in proximity/precision to the
population mean as you increase the sample size.
As we continue with the course, the standard errors will become more
complex, but the underlying ideas are the same.
Comparisons everywhere! p 
You will see comparisons being made all over the place. Just look at
some of the products you have at home:
p 
Dentex floss picks are clinically proven to remove more plaque than
regular floss.
p 
What does this mean – how on earth do they prove this?
p 
This is an example of where they prove the results statistically.
p 
It is done via clinical trials, by collecting data:
p 
q 
Aim: to see if it is possible to prove that on average the amount of
plaque removed using floss picks is more than the average amount of
plaque removed using regular floss.
They state their hypothesis as H0: µFP-µF ≤0 against HA: µFP – µF >0.
where µFP = mean floss removed using Floss pick and µF = mean plaque
removed using regular loss.
p 
A simple random group of individuals are chosen and the amount of
plaque removed using both regular floss and floss picks is measured
(we will discuss the grouping little later on). It is found that based on
100 individuals the average amount removed using
floss sticks is 3mg whereas the average amount removed using just
regular floss is 2.8mg.
q 
q 
q 
Does this automatically prove that floss sticks are better?
No it doesn’t. The products makers cannot use this as their statistical proof,
because this difference could be explained by random chance (variation
between samples). This is when they need to know the reliability of the
estimator (measure by the standard error) and use a statistical test to
calculate the chance of observing a difference of 3mg – 2.8mg by random
chance. The probability under the null, that there is no difference, is
calculated. This is the p-value.
If the p-value is large (over a pre-determined significance level, say 5%),
then we cannot reject the null. This means that there isn’t any evidence in
the collected data that that floss picks are working.
p 
p 
On the other hand, if the p-value is less than a pre-determined
significance level, then we can say there is evidence to suggest that
floss picks are removing more plaque than regular floss. In other words,
because the p-value is small, the data does not appear to be consistent
with the null hypothesis (in other words the chance of obtaining a sample
which behaves in this way when globally there is no difference between
dentex and floss). Therefore we conclude that the null is incorrect and
we choose the alternative hypothesis.
It would appear that that Dentex found a statistically significant
difference and thus can make their claims. Their p-value is small.
p 
The nitty gritty of what is done.
p 
The above conclusions require a standard error, which needs to be
calculated from the data. However, the standard error and the exact
method that us used to do the test depends on how the data was
collected.
Designing the :loss study p 
There are two ways the data could have been collected:
p 
Either one simple random sample of individuals is taken (a hundred in the
example discussed). For each individual (probably on separate days to ensure
independence) is asked to use a floss stick and regular floss and the amount
of plaque removed for each treatment is measured.
This is an example of a matched pair study where the same individual is
used in both treatments. In this case a matched paired-test is done –
which we covered in the previous lectures.
q 
q 
The advantage of this design is that it avoids confounding because the
same individual is used for both experiments.
q 
The disadvantage is that it takes time and effort because we need to do
it over several days.
Alternatively a simple random sample is taken and randomly split into two
groups. Some are asked to use floss and others are asked to use floss picks.
The individuals in both groups are completely independent of each other and
there isn’t any matching.
q 
The advantage it that it is quick to do this experiment.
q 
Disadvantage: larger standard errors.
Independent samples inference p 
The purpose of most studies is to compare the effects of different
treatments or conditions.
p 
Using matching to design an experiment is very useful way to make
comparisons between populations since it tends to reduce
confounding factors.
p 
p 
If we have reason to believe that there is matching between subjects,
then we should use a matched paired t-test.
However, in many situations it is impossible to have any matching
between the samples.
p 
If we want to see whether a drug works. In this case we need a SRS of
patients to give the dug and also a SRS to give the placebo.
p 
In both the situations discussed above the samples are completely
independent of each other – there isn’t any matching. In this
situation we need to use an independent t-test, which we describe
below.
p 
Often the subjects are observed separately under the different
conditions, resulting in samples that are independent. That is, the
subjects of each sample are obtained and observed separately
from, and without any regard to, the subjects of the other samples.
p 
As in the matched pairs design, subjects should be randomized –
assigned to the samples at random, when the study is
observational.
p 
By the end of the class you should be able to identify which test to
apply give the situation.
p 
q 
You should look to see if there is any matching in the data, if there is
matching never do an independent sample t-test (this will give the
wrong standard errors and can lead to unreliable results).
If the samples appear to completely independent of each other use an
independent sample t-test.
Example 1: Heights p 
To motivate the independent sample t-test, we consider a problem
that we already know the answer to:
p 
In general do males students tend to be taller than female students?
p 
In terms of a hypothesis test we to see if there is evidence to support:
H0 : µM - µF ≤ 0 against HA : µM - µF > 0.
q 
A matched design is possible, by random sampling male and female
student siblings. But it is usually not feasible to obtain this data. In
addition we exclude the sub-population of people with same-sex or
no siblings.
q 
Instead, a random sample of students was drawn and an
independent sample t-test is done.
q 
Statcrunch instructions: Stat -> T-stat –> Two Sample -> With data.
Then place the relevant columns in each box and uncheck the box
that says pooled variance. You have the option of doing a test
(one or two sided) or constructing a confidence interval.
In this sample there were 27 males and 37 females, there is
clearly no matching. The difference in sample means is 0.45. To
see whether H0 : µM - µF ≤ 0 against HA : µM - µF > 0 we do
the test in Statcrunch.
In this data set, from the boxplot
female heights tend to be less than
male heights.
We see that the p-value is less 0.01% (we do the test at the 5% level) which means
there is strong evidence to suggest that males are on average taller than females.
t-value =
0.466 0
= 7.27
0.064
We can use the same output to construct a 99% confidence interval for the mean
difference. The only difference is that the degrees of freedom is unusual – it is
48.29%. However, we do exactly the same as before, we either look-up tables
(provided by me in the exam paper) or use software such as the Statcrunch output
[0.466 ± 2.68 ⇥ 0.064] = [0.29, 0.64]
Thus with 99% confidence we believe the mean difference lies between 0.29 to 0.64
feet.
Example 2: Diets p 
p 
p 
We want to know whether there is any difference between two different
diets. 20 randomly sampled people are randomly placed into two groups
of 10. The first group goes on Diet I and the second group on Diet 2. The
weight loss for each group is given below
Superficially there appears to be a matching in the data.
Don’t be fooled by this, the people in both group were
randomly allocated (we can see this from how the data was
collected) and there is no matching between, say, 2.9 and
3.5.
Thus we need to use an independent sample t-procedure.
As we have no reason to believe one diet is better than
another, our hypothesis of interest is:
H0 : µ1 - µ2 = 0 against HA : µ1 - µ2 ≠ 0
The boxplot gives the impression
there are differences between,
but are these statistically
significant (can be explained by
sampling differences)?
The 95% confidence interval is [-2.23,0.598]. This tells us with 95% confidence the
mean difference between the diets is somewhere in this interval.
If we test the hypothesis H0 : µ1 - µ2 = 0 against HA : µ1 - µ2 ≠ 0
The t-transform is
0.82 0
t-value =
= 1.22
0.6692
Using statcrunch we see that the smallest area is the area to the LEFT of -1.22, this
Is 12%. Thus the p-value for the two-sided test is 24%.
From the data there is no evidence to suggest there is any difference between the
means of the diets. In other words, there is no evidence to suggest that there is a
difference in weight loss between the two diets.
Example 3: Does calcium interact with iron absorption? p 
It is believed that too much calcium in a diet can reduce the absorption of
iron. To test this 20 randomly sampled people were put into two groups of
10. One group was given a calcium high diet and the iron absorption
recorded. The other group was given a calcium low diet and iron
recorded. The difference from their previous level is given below (this is
why you see some negative numbers).
p 
The data and summary statistics is given below:
We observe that for this group there those in a calcium low
group absorb more iron, is this statistically significant?
p 
The hypothesis of interest is H0 : µCH - µCL ≥ 0 against
HA : µCH - µCL < 0.
The hypothesis given in the output above is opposite of what we want. However, from
this output we immediately see that the p-value for H0 : µCH - µCL ≥ 0 against
HA : µCH - µCL < 0 is the area to the LEFT of -3.19 which is 1-0.9974 = 0.26%. As this
p-value is less than 5% there is evidence to reject the null and conclude that high
calcium decreases iron absorption (compared with low calcium).
The 95% confidence interval for the mean difference is
[ 1.991 ± 0.623 ⇥ 2.1]
Example 4: Calf treatments p 
Comparing the weights of calves and different treatments
We start by seeing if there is evidence to suggest there is a difference between
treatments A and B. This means we are testing H0: µA – µB =0
against HA: µA – µB ≠0. We use the independent sample t-test as both samples are
completely independent of each other (the calves were randomly allocated to each
group). We have to be a bit weary, as the sample size is small so using the tdistribution may not give completely reliable p-values.
Note: To analyze the calf data in Statcrunch you need to split each group into their own columns.
To do this go to Data -> Arrange -> Split -> Select Column data you want to analyze (for example
Wt 8) and Select the group you want (for example TRT)
Treatment A vs D p 
You may have thought the conclusions of the previous test were
quite clear, since the sample means of 138.9 and 139.54 are quite
close – but the most important factor is that the standard error is
quite large. So the closeness of the sample means and the
largeness of the standard error meant that is quite easy to explain
this difference by random chance, and there is no evidence to
suggest there is a true difference in the populations.
p 
From the summary statistics, the difference between treatment A
and B appears quite large (7.7), can this difference be explained by
random chance? We test the hypothesis H0: µA – µD =0 against
HA: µA – µD ≠0.
p 
The mean difference may be -7.7 but the p-value is 34%, this tells us there is
over a 1/3 chance of observing a difference of 7.7 in the sample means when
there is in fact no difference in population means. This is quite large – over the
5% significant level, so there is no evidence to reject the null
p 
We now construct a 95% confidence interval. To do this we use statcrunch to find
the critical value of a t-distribution with 19.09 df
The 95% confidence interval for the
difference in mean weights for the
treatments in
[-7.7 – 2.09×8.09,-7.7+2.09×8.09] =
[-24,9.2]. This is an interval where we
believe the mean difference should lie –
and explains why we were not able to
reject the null, despite 7.7 being
subjectively large.
The difference this interval is wide is that
the standard error is large, due to small
sample size and large standard deviation
of calf weights.
The idea: The difference in sample means p 
We illustrate the idea with the female and male height example
p 
For every sample the difference in sample means X̄M X̄F will
vary.
X̄F will have a normal
p  If the sample size is large enough X̄M
distribution (thanks to the central limit theorem).
p 
The normal distribution will be centered about the true mean µM - µF
(population male mean minus population female mean) and but will have
a complicated standard error:
r
q 
2
M
27
+
2
F
34
Where σM = standard deviation of heights and σF = standard deviation of
female heights.
p 
Therefore, just like in the one-sample case, in order to do the test
we simply take the z-transform under the null that the mean male
and female height is the same (µM - µF = 0).
5.91
z=q 2
5.46
M
27
+
2
F
34
q 
At this point we encounter a problem. We do not know the population standard
deviations σM and σF .
q 
But we see from the summary statistics that we do have estimates for them.
Thus we can replace the true population standard deviations by its estimates.
And obtain the transformation:
5.91
t= q
0.272
27
5.46
+
0.212
34
The distribution of this ratio? p 
Having exchanged the unknown true standard deviations with their
estimators (calculated from the data) it seems reasonable to
suppose that extra variability has been added to this ratio and we
need to correct for it by changing from a normal distribution to
another distribution. Previously for the one sample case, the new
distribution which took into account of this variability was the tdistribution.
p 
In the two sample case, the ratio
X̄M
t= q 2
sM
27
X̄F
+
s2F
34
This ratio has approximately a t-distribution with a very strange number of
degrees of freedom….
2
2
2
! s1 s2 $
# + &
" n1 n2 %
This is why using software is important, you don’t df =
2
2
1 ! s12 $
1 ! s22 $
want to calculate this stuff!
# & +
# &
n1−1" n1 % n2 −1" n2 %
p 
We are testing H0 : µM - µF = 0 against HA : µM - µF > 0 and have
the t-transform
5.91
t= q
0.272
27
5.46
+
0.212
= 7.11
34
Which we know has 48.045 degree of freedom. Now going to Statcrunch ->
Stat -> Calculators -> T we get
The area to the right of 7.11 for
a t-distribution with 48.045
degrees of freedom is tiny. So
at both the 5% and 1%
significance level we would
reject the null.
This means there is plenty of
evidence to reject the null and
conclude the mean height of
males is greater than females.
Remember If the sample sizes are both over 15, and the data not too skewed,
using the t-distribution reasonable.
Summary of Analysis: Signi:icant effect Remember: Significance means the evidence of the data is sufficient to
reject the null hypothesis (at our stated level α). Only data, and the
statistics we calculate from the data, can be statistically “significant”.
We can say that the sample means are “significantly different” or that the
observed effect is “significant”. But the conclusion about the population
means is simply “they are different.”
The observed effect of 0.46 between male and female height is significant
so we conclude that the true effect µM---µF is greater than zero.
Having made this conclusion, or even if we have not, we can always
estimate the difference using the confidence interval [0.33,0.58].
Standard errors p 
q 
In the one-sample case the standard error is
s(standard deviation of population)
p
=
n(sample size)
r
s2
n
In the independent two-sample case the standard error is
s
s21 (variance of population one) s22 (variance of population two)
+
n(sample size)
m(sample size)
These two different standard errors are for different situations but the ideas are the
same. Remember, that a smaller standard error leads to more reliable estimators.
Therefore if we are designing the experiment to decrease the sample size we
observe that:
q  For the one-sample case, we can decrease the standard error by increasing
the sample size (it is usually impossible to decrease the standard deviation)
q  For the two-sample case, we can decrease the standard error by increasing
the size of both samples (again it is usually impossible to decrease the
standard deviation of the populations).
Choosing the sample size p 
q 
q 
We now consider how to distribute the sample sizes in the case that
the standard deviations for both samples are about the same. In this
case the standard error is:
r
r
2
2
s
s
1
1
+
=s
+
n
m
n m
Remember the standard deviation is fixed we cannot change this value.
Suppose that we only have enough funds to include 200 subjects in our
experiment, how to distribute them amongst the two groups:
q  It makes no sense to have on subject in group 1 and 199 in group 2. For
example, if we are comparing male and female heights, this would be using
one male height to estimate the mean height of males and 199 females
heights to estimate the mean height of females. Clearly this is wrong, and we
r
can understand why from the standard which is
1
1
s
+
= 1.002s
1
199
q 
On the other hand if we distributed them evenly, 100 and 100, the standard
error is a lot smaller r
1
1
s
+
= 0.141s
100 100
Which type of test? One sample, paired samples or two independent samples? p  Comparing vitamin content of bread p  Is blood pressure altered by use of
immediately after baking vs. 3 days
an oral contraceptive? Comparing
later (the same loaves are used on
a group of women not using an
day one and 3 days later).
oral contraceptive with a group
taking it.
p 
Comparing vitamin content of bread
immediately after baking vs. 3 days
p 
p 
Review insurance records for
later (tests made on independent
dollar amount paid after fire
loaves).
damage in houses equipped with a
Average fuel efficiency for 2005
vehicles is 21 miles per gallon. Is
average fuel efficiency higher in the
new generation “green vehicles”?
fire extinguisher vs. houses
without one. Was there a
difference in the average dollar
amount paid?
Cautions about the two sample t-­‐test or interval p 
Using the correct standard error and degrees of freedom is critical.
p 
As in the one sample t-test, the method assumes simple random
samples.
p 
Likewise, it also assumes the populations have normal distributions.
p 
p 
p 
Skewness and outliers can make the methods inaccurate (that is, having
confidence/significance level other that what they are supposed to have).
The larger the sample sizes, the less this is a problem.
It also is less of a problem if the populations have similar skewness and
the two samples are close to the same size.
p 
“Significant effect” merely means we have sufficient evidence to say
the two true means are different. It does not explain why they are
different or how meaningful/important the difference is.
p 
A confidence interval is needed to determine how big the effect is.
Summary: Distribution of two sample means p 
In order to do statistical inference, we must know a few things about the
sampling distribution of our statistic.
p 
The sampling distribution of
σ12
n1
p 
+
x1 − x2 has standard deviation
σ 22 (Mathematically, the variance of the difference is the sum of
.
n2 the variances of the two sample means.)
This is estimated by the standard error
SE =
s12
n1
+
s22
.
n2
If the sample sizes are both over 15, and the data not too skewed, using the
t-distribution reasonable.
t=
p 
Then the two-sample t statistic is
p 
( x1 − x2 ) − (µ1 − µ2 )
s12
n1
s22
.
+n
2
This statistic has an approximate t-distribution on which we will base our
inferences. But the degrees of freedom is complicated …
Two-­‐sample t con:idence interval Recall that we have two independent samples and we use the
difference between the sample averages ( x1 − x2) to estimate (µ1 − µ2).
s12 s22
+ .
This estimate has standard error SE =
n1 n2
p  The margin of error for a confidence interval of µ1 − µ2 is
2
2
s
s
m = t * × 1 + 2 = t * × SE
n1 n2
p 
We find t* is found using the computer. The confidence interval is then
computed as
( x1 − x2 ) ± m.
The interpretation of “confidence” is the same as before: it is the
proportion of possible samples for which the method leads to a true
statement about the parameters.
Two-­‐sample t signi:icance test The null hypothesis is that both population means µ1 and µ2 are equal,
thus their difference is equal to zero.
H0: µ1 = µ2 ⇔ H0: µ1 − µ2 = 0 .
Either a one-sided or a two-sided alternative hypothesis can be tested.
Using the value (µ1 − µ2) = 0 given in H0, the test statistic becomes
t=
( x1 − x2 ) − 0
s12
n1
+
s22
n2
.
To find the P-value, we look up the appropriate probability of the
t-distribution using the df given by Statcrunch or me.
Summary for testing μ1 = μ2 with independent samples p 
p 
The hypotheses are identified before collecting/observing data.
To test the null hypothesis H0: µ1 = µ2, use t = x1 − x2 .
s12
n1
p 
The P-value is obtained from the t-distribution (or t-table) with the
unpooled degrees of freedom (computed).
p 
For a one-sided test with Ha: µ1 < µ2, P-value = area to left of t.
p 
For a one-sided test with Ha: µ1 > µ2, P-value = area to right of t.
p 
p 
+
s22
n2
For a two-sided test with Ha: µ1 ≠ µ2, P-value = twice the value for a
one-sided test.
If P-value < α then H0 is rejected and Ha is accepted. Otherwise,
H0 is not rejected even if the evidence seems to prefer Ha.
p 
p 
Report the P-value as well as your conclusion.
You must decide what α you will use before the study or else it is
meaningless.
Summary for making con:idence interval for μ1 − μ2 with independent samples p 
The single value estimate is x1 − x2 .
s12
n1
s22
.
n2
p 
This has standard error
p 
The margin of error for an interval with confidence level C is
*
m=t ×
s12
n1
+
s22
n2
+
,
where t* is from the critical value for the level C.
p 
p 
p 
The confidence interval is then ( x1 − x2 ) ± m.
You must decide what C you will use before the study or else it is
meaningless.
For both hypothesis tests and confidence intervals, the key is to
use the correct standard error (what is being estimated and how
the data are obtained).
Statistics in the media p 
Look at this article and the data they describe:
p 
http://www.economist.com/news/science-andtechnology/21676754-curious-result-hintspossibility-dementia-caused-fungal
p 
p 
What is the data that Dr. Carrasco has?
If we did a independent sample t-test to see whether those with Alzeheimer’s had more fungal
cells than those who did not Alzheimer’s what would be the p-value (give a rough estimate)?
Accompanying problems associated with this Chapter p 
Quiz 14
p 
Homework 7 (Questions 5,6 and 7)
Download