Assumptions

advertisement
Psych 5500/6500
Assumptions Underlying the
t Test for Independent Means
Fall, 2008
1
Assumptions
We will look at four assumptions, the first three are
mathematical in nature:
1. Independence of scores.
2. Both populations are normally distributed.
3. The two populations have the same variance.
And we will add one more:
4. The mean is meaningful.
2
Assumption of the
Independence of Scores
This is a critical assumption in that violating
it has severe consequences on the validity
of the analysis.
3
Independence (cont.)
To discuss it, let’s set up a simple experiment.
Group 1 Group 2
Fred
Sally
Ethel
George
The assumption of independence for this t test is
that Fred’s score is independent of Ethel’s
(within-group independence) and is also
independent of Sally’s and George’s (betweengroup independence).
4
Independence (cont.)
Group 1
Fred
Ethel
Group 2
Sally
George
Technically, ‘independence’ means that how
far above or below Fred’s score is from the
mean of his group, cannot be predicted by
knowing the same about Ethel’s or Sally’s
or George’s scores.
5
Assumption of Normality
This t test is based upon the assumption that both populations
are normally distributed. The effects of failure to meet this
assumption are somewhat different here than they were for
the t test of a single group mean, as the sampling
distribution of interest now, the one whose shape will be
influenced by whether the assumptions are true or not, is
that of Y1  Y2
t test for a single group
t test for two independent groups
6
Determining Normality
In a previous lecture we examined ways of
determining whether or not a population is
normally distributed based upon a sample.
These techniques apply here. For each
group we consider whether or not the
population from which it was sampled is
normally distributed.
7
Effects of Non-normality
Boneau, C. A. (1960). The effects of violations of
assumptions underlying the t test. Psychological
Bulletin, 37, 49-64.
Deviations from normality in the populations are
not serious if the two populations are of the
same shape, or nearly so. There seems to be
little problem if both populations are
symmetrical. If they are skewed then there is a
problem if they are of different skewness or have
different variances, as this will cause the
sampling distribution to be skewed. This
problem lessens under all but the most extreme
cases as the size of the groups approaches 30. 8
There are limitations to Boneau’s approach. He only
examined two-tailed tests, and as we have seen in a
previous lecture the deviations of alpha from .05 due
to skewness are much more severe with one-tailed
tests. Also, Boneau didn’t examine the effect of
violations of assumptions on the power of the
experiments. In looking over the recent literature it
appears that most authors advocate that when N is
relatively small and the data are clearly non-normal
in a way that involves skewness, that you consider
either 1) normalizing the data through
transformations (covered soon) or 2) using a test
less affected by violations of normality (e.g. a nonparametric test, also covered soon).
9
Assumption of Equal Variances
The t test for independent groups assumes
that both populations have the same
variance σ²1= σ²2 (homogeneity of
variance) rather than different variances
σ²1 σ²2 (heterogeneity of variance).
10
Detecting Heterogeneity
The easiest way to detect heterogeneity is to
simply compare the variances (or standard
deviations) of the two groups. Of course, any
difference in the variances could just be due to
chance (i.e. the population variances are the
same but the sample variances are not) There
are statistical tests that can be used to
determine from the data in the two groups
whether or not the population variances differ.
H0: σ1²=σ2²
Ha: σ1²σ2²
11
Tests for Heterogeneity
Two of the popular tests are the F ratio and
Levene’s F. We will be covering these tests later
in the semester when we have enough
information to understand how they work (SPSS
automatically provides Levene’s when you do a t
test). The tests have the familiar problem that
we can ‘reject H0’ (prove the variances differ)
but we can’t ‘accept H0’ (prove the variances are
the same). There is also the related issue of low
power when the N’s are small and
hypersensitivity to differences in variance when
the N’s are very large.
12
Graphing Options
There are at least a couple of good graphing
options to see how much the variances of
the two groups differ:
1. Scatter plots
2. Box plots
We covered both of these in the lecture on
normality, we will turn to them again but
this time we will have them show both
groups on the same graph, which makes
it easier to see if the two groups differ in
terms of variability of scores.
13
Graphing Options
The next slide presents the data we will be
graphing. To understand these graphic
techniques we will first look at the control
group by itself, then look at how both
groups can be displayed on the same
graph.
14
Data
Control Group
5
6
4
1
7
9
5
13
Treatment Group
7
9
6
11
10
13
12
6
6
15
Scatter Plot
First we will look at a scatter plot of the data in the
control group. I particularly like this type of
graph when the N of the samples is rather small.
The disadvantage of the way SPSS does this
type of graph is that when a score occurs more
than once, the circles representing that score
overlap, and you can’t tell from the graph how
many times the score occurred. For example, in
the following graph it is easy to see the spread
of scores, but the fact that a score of ‘5’ occurred
twice is concealed.
16
Scatter Plot of Control Group

12.5
10.0

Y
7.5


5.0


2.5

control
GroupName
17
We can see in the scatter plot on the previous
slide that the scores were more or less
grouped around the center value, with one
rather low score and one rather high score.
Next we will have SPSS plot both groups on the
scatter plot (next slide). Now we can see the
spread of both groups, notice that the mean
of the treatment group looks to be higher than
the mean of the control group, and that the
treatment group might have less variance
than the control group.
18
Scatter Plot Both Groups


12.5



10.0
Y
7.5
5.0








2.5

control
treatment
GroupName
19
Box Plot
Now we will look at box plots. Again we will
start by looking at just the control group
and then look at displaying both groups on
the same graph. I have decided to repeat
a couple of slides from when we covered
box plots before to remind you how they
work.
20
Elements of the Box Plot
The next slide shows the elements of a box plot. The plot divides
the scores into quartiles (i.e. the lowest 25% of the scores, the
second 25%, the third 25% and the highest 25%). Usually the
lower bound marks the lowest score and the upper bound
marks the highest score, but SPSS does something slightly
different. The height of the box (the distance from the 25th
percentile to the 75th percentile) is called the 'interquartile
range', to simplify the following description we will call this one
'step'. SPSS draws the upper boundary at the highest score
that is not more than 1.5 steps above the box. Any point above
that is marked as either an 'outlier' (if it is between 1.5 and 3
'steps' above the box) or as an 'extreme score' (if it is more
than 3 'steps' above the box). The same thing is done for the
lower boundary.
21
Box Plot
22
The next slide shows the box plot of the
control group in our example. There are
no outliers or extreme scores, so the
upper and lower boundaries represent
the highest and lowest score.
Note that we can see that the sample is
slightly positively skewed (the spread of
scores above the median is greater than
the spread of scores below the median).
23
Box Plot Control Group
12.5
10.0
Y
7.5
5.0
2.5
control
GroupName
24
Two Groups
We can ask SPSS to display the box plots for
both groups on the same graph (see the next
slide). We can see that the medians of the
two groups are quite different (and the means
probable will be too), that the groups differ in
how spread out their scores are, and we can
see to what degree each group is skewed.
Note the bottom whisker is missing from the
treatment group, this is because the lowest
score is also the 25th percentile (look back at
the data, there are not many scores and the
lowest three scores are all 6’s, making ‘6’
both the lowest score as well as the 25th
percentile).
25
Box Plot Both Groups
12.5
10.0
Y
7.5
5.0
2.5
control
treatment
GroupName
26
What to do About Heterogeneity
Fortunately, Monte Carlo studies have
shown that the assumption of
homogeneity can be violated with little
effect on the validity of the t test so long as
the two groups have the same or very
similar sizes (i.e. when N1 N2). If the N’s
differ greatly, however, then heterogeneity
can seriously affect the validity of the t
test. We will take a look at what to do in
that case.
27
Student’s t test & Welch’s t test
The standard t test, the one we have already
covered, is sometimes known as
Student’s t test. It is based upon the
assumption of homogeneity of variances.
A t test that does not depend upon that
assumption is Welch’s t test, sometimes
known as the t test for unequal
variances.
28
Standard Error: Student’s t
A difference in the two t tests is evident in
their respective formulas for computing the
standard error. We will begin by repeating
the formula from Student’s t:
est.σ
2
pooled
(N1  1)est.σ  (N2  1)est.σ

N1  N 2  2
2
1
2
2
then...
est.σ Y1  Y2
est.σ
est.σ
 1
1 
 
 est.σ 


N1
N 2 29
 N1 N 2 
2
p
2
p
2
p
Standard Error: Welch’s t
Welch’s t doesn’t assume that the two
estimates are of the same variance, and
thus doesn’t pool them:
Standard error Welch’s t:
est. σ Y1  Y2
2
1
2
2
est.σ 2p
est.σ 2p
est.σ
est.σ


N1
N2
Standard error Student’s t:
est.σ Y1  Y2 
N1

N2
30
Welch’s t formula (t’)
t 'obt 
Y  Y   μ 
1
2
Y1  Y2 
est.σ Y1  Y2 

Y  Y   μ 
1
2
2
1
Y1  Y2 
est.σ
est.σ

N1
N2
2
2
When N1=N2 the standard error comes out to be the same in Student’s t and in
Welch’s t’, and thus the tobt is the same as well.
31
Welch’s t Degrees of Freedom
The new standard error, however, leads to a different
degrees of freedom formula. When the N’s are
equal and the est.σ²’s are equal then this formula
gives the same df’s as Student’s t, but when the N’s
differ or the est.σ²’s differ then this leads to a
smaller df than in Student’s t (and the Welch df is
often not a whole number).
2
 1
u 



N1 N 2 

df 
2
1
u
 2
2
N1 N1  1 N 2 N 2  1
2
2
2
1
est.σ
whereu 
est.σ
32
Type 1 Error Rate (Monte Carlo)
From Coombs et al. (1996) as adapted and cited by Ruxton (2006)
N1
N2
σ1
σ2
Stud. t Welch t’
11
11
1
1
.052
.051
11
11
4
1
.064
.054
11
21
1
1
.052
.051
11
21
4
1
.155
.051
11
21
1
4
.012
.046
25
25
1
1
.049
.049
25
25
4
1
.052
33
.048
Which to Use?
1. If σ²1= σ²2 then you may use either Student’s or
Welch’s t test (but they may differ in power)
2. If σ²1 σ²2 and N1 N2 then then you may use
either Student’s or Welch’s t test (but they may
differ in power).
3. If σ²1 σ²2 and N1  N2 (differ more than a little)
then you have to use Welch’s.
The challenge is that we don’t know the values of
σ²1 and σ²2, we only have the population
variance estimates from the samples (groups)
which will almost always differ at least a little.
34
Strategies
In my review of the literature I have run
across two suggested strategies:
1. Use the more familiar Student’s t test
unless there is sufficient evidence that the
two population variances differ, in which
case use Welch’s t’.
2. Always use Welch’s t’, then you don’t
have to worry about violating the
assumption of homogeneity of variance.
35
Strategy 1
With this strategy you usually first do a Levene’s
test to see if their is sufficient evidence that
the population means are not equal. If
Levene’s test is statistically significant (i.e. it’s
p.05) then you have proof the variances are
different and so you use Welch’s t’, otherwise
use Student’s t. The logical problems
associated with this (possible lack of power or
hypersensitivity, and the inability to prove H0
is true) were discussed earlier.
36
Strategy 1 (cont.)
There is also a problem associated with
using one statistical test (e.g. Levene’s) to
determine which other statistical test is
appropriate to use (e.g. Student’s vs.
Welch’s). When tests are linked in that
way the issue of the overall probability of
making an error in the analysis becomes
complex. See Ruxton (2006) for
references if you would like to know more.
37
Strategy 2
The second strategy is to simply always use
Welch’s t’, as it doesn’t depend upon the
assumption that the variances are equal.
38
Power Considerations
Playing around with SPSS this is what I found
(keeping the difference between the two
group means the same in all cases):
1. When N1= N2 and est.σ²1= est.σ²2 then both
t tests had the same p value.
2. When N1= N2 and est.σ²1  est.σ²2 then
Student’s t had the lower p value.
3. When N1  N2 and est.σ²1= est.σ²2 then
Student’s t had the lower p value.
4. When N1  N2 and est.σ²1  est.σ²2 then
Welch’s t’ had the lower p value.
39
Selecting a Strategy
It is a complicated choice, the advantage of
Strategy 2 is that you never have to worry about
whether the variances are the same or not. But,
for example, if you have greatly different N’s,
and the population variances are indeed the
same, then Student’s t would be appropriate and
it would have more power.
Whichever strategy you choose the choice must
be a priori, you can’t look at the analysis and
then select the approach that had the lower p
value.
40
Final Assumption: The Mean is
Meaningful
Quote from one of my mentors (a behaviorist
involved in single-subject research): “I study the
behavior of individuals, you study the behavior
of means.”
I’d like to end on a bit of a philosophical note.
There is some reason to question whether or not
the study of means is meaningful. There are
decades’ (a century’s?) worth of precedence in
saying that the mean is a meaningful thing to
study, so I’ll take the time only to present the
case against it.
41
Santa Claus
Point one: the mean is exactly as real as Santa
Claus, they are both important cultural ideas that
influence our behavior but neither really exist.
There are no means out there in the territory,
they exist only in our models.
Counterpoint: if we take anthropologist Gregory
Bateson’s definition of ideas, then we would say
that ideas are far more important in models of
nature than are such ‘real’ things as physical
forces and objects.
42
Individual Differences
Point two: the focus on means takes us away
from something much more important, individual
differences.
To explain this point I will pick on studies that exam
gender differences. We could look at, for
example, a study that shows that males have
better spatial abilities than females. The point
here is that the statement I just made is
fundamentally inaccurate in a very important
way, I should have said that the mean score of
males is higher than the mean score of females.
We often forget to include that information in our
statements but its presence is crucial.
43
The t test for independent means may lead you to say that the
mean of the females is lower than the mean of the males, but
look at the hypothetical graph, many many females have
spatial ability scores that are above the mean score of the
males and many many males have scores that are below the
mean of the females. The mean is a huge generalization,
and like all generalizations is useful in that it makes the world
much simpler but at the expense of hiding all the specific
cases in which it does not apply.
44
Counterpoint: analyses that focus on
differences between means might limit us
to just understanding the behavior of
means. Multiple regression (a topic we
cover in the section on the Model
Comparison Approach), however, provides
an approach for focusing more on
individual differences.
45
Recommended Reading
Boneau, C. A. (1960). The effects of violations of
assumptions underlying the t test. Psychological
Bulletin, 37, 49-64.
Legendre, P. & Borcard, D. Appendix: t-test with Welch
correction, in Statistical comparison of univariate tests of
homogeneity of variances.
http://biol10.biol.umontreal.ca/BIO2041e/Correction_Wel
ch.pdf
Ruxton, G. D. (2006). The unequal variance t-test is an
underused alternative to Student's t-test and the Mann–
Whitney U test. Behavioral Ecology, 17, 688 - 690.
Welch BL. 1938. The significance of the difference between
two means when the population variances are unequal.
Biometrika, 29, 350–62.
46
Download