Topic 09

advertisement
Topic 9 – Multiple Comparisons
Multiple Comparisons
of Treatment Means
Reading: 17.7-17.8
1
Overview

Brief Review of One-Way ANOVA

Pairwise Comparisons of Treatment Means

Multiplicity of Testing

Linear Combinations & Contrasts of
Treatment Means
2
Review: One-Way ANOVA

Analysis of Variance (ANOVA) models
provide an efficient way to compare multiple
groups. In a single factor ANOVA,

The Model F-test will test the equality of all
group means at the same time.

If this test is significant, then our next goal is to
identify specific differences. This is our big topic
for this lesson.
3
Review: Cell Means Model

Basic ANOVA Model is:
Yij  i   ij

2

~
N
0,

where ij
 
Notation:

“i” subscript indicates the level of the factor
i  1,2,3,..., a

“j” subscript indicates observation number within
the group
j  1,2,3,..., ni
4
Review: Factor Effects Model
Yij    i   ij

 i  1, 2,..., k

 j  1, 2,..., ni
 ij ~ N  0,  2 

i
0
Relationship to Cell Means: i    i
5
Review: Notation

DOT indicates “sum”

BAR indicates “average” or “divide by
cell/sample size”

Y is the mean for all observations

Yi is the mean for the observations in
Level i of Factor A.

Sometimes we omit the “dots” for brevity,
but the meaning is the same.
6
Review: Components of Variation

Variation between groups gets “explained”
by allowing the groups to have different
means. This variation contributes to MSR.

Variation within groups is unexplained, and
contributes to MSE.

The ratio F = MSR / MSE forms the basis for
testing the hypothesis that all group means
are the same.
7
Review: Components of Variation

Of course the individual components would
sum to zero, so we must square them. It
turns out that all cross-product terms cancel,
and we have:
å (Y
2
2
2
- Y gg) = å (Y i g - Y gg) + å (Y ij - Y i g)
i, j
i, j
i, j
1444442
444443 1444442
444443 1444442
444443
ij
SST
SSA
SSE
BETWEEN
WITHIN
GROUPS
GROUPS
8
Review: ANOVA Table
Source
SS
DF
MS
F
Factor A
SSA
a–1
MSA
Error
SSE
N–a
MSA
MSE
MSE
Total
SST
N–1
9
Review: Model F Test

Null Hypothesis (Cell Means)
H 0 : 1  2 

 a
Alternative Hypothesis
H a : There exists some pair of
population means not equal.


If we conclude the alternative, then it makes sense to
try to determine specific differences.
For Factor Effects model: H 0 :  i  0 for all i
10
Further Comparisons
The F-test is Significant...
...What Next?
11
Pairwise Comparisons

Generally our next step is that we want to
find out more specifics about the actual
differences between treatment groups.


Which groups are actually different?
We can compare two groups by looking at
the difference between means.
H 0 : i   j
H a : i   j
12
Pairwise Comparisons (2)



Can rewrite null hypothesis as i   j  0
and so proceed to look at the difference
between means.
Estimate difference by Yi  Y j . (Note that
to this point, it’s the same as a two-sample
T test)
A critical value and standard error are all we
need for a confidence interval.
13
Variance for Difference

Recall that the variance associated to the
2

/n .
mean of any given sample is

So if we take the difference in means for two
of our samples, the variance will be
Var Yi  Y j  

2
ni

2
nj
Remember we have assumed equal sample
variances, but we don’t know  2.
14
SE for Difference in Means

Estimate  by the MSE and then take the square
root in order to get the SE:
2
1 1
SE Yi  Y j   MSE   
n n 
j 
 i

If the cell sizes happen to be equal:
2MSE
SE Yi  Y j  
n
15
Confidence Interval

So the confidence interval will be
Y
i
 Y j   tCRIT
1 1
MSE   
n n 
j 
 i

Is the use of a t critical value appropriate?

What critical value should be used?
16
Multiple Comparisons

We need to compare all of the treatment
means. How many comparisons is this?

Suppose we decide to just look at the
“largest” difference? Does this mean we
don’t need to adjust for multiple
comparisons?
17
Multiple Comparisons (2)


The fact that we are effectively doing a large
number of pairwise comparisons means...

Each test takes a 5% chance of making a
Type I error (showing a difference where in
reality none exists).

The overall Type I error rate (chance of at
least one Type I error) will be much larger
than 5%
Effectively, the testing procedure becomes
biased in favor of rejecting at least one H0
18
Valid Approaches

How do we adjust for this multiplicity issue?

Least Significant Differences Procedure
(unadjusted!) – Relies on a significant F-test

Bonferroni Adjustment – turns out to be too
conservative for all pairwise comparisons

Tukey Adjustment – best for all pairwise
comparisons, usually best for our class because
we usually will compare all pairs

Dunnett Adjustment – Appropriate for comparing
each treatment to a control (fewer tests).
19
Least Significant Differences
No adjustment

The LSD procedure goes as follows:

Verify that the model F test is significant to
confirm the existence of differences.

Unadjusted differences are used (t tests). At a
minimum, the means that are the furthest apart
are presumed to be different.

Note: Our textbook mislabels this (what they
call LSD is actually Bonferroni adjusted LSD).
20
LSD (2)

So if we use the LSD procedure, then we
are NOT making any formal adjustment


Type I error IS inflated by the number of tests
Some things we can do:


Use strict requirements on the F test: use
α=0.005 instead of α=0.05

Additionally we could strengthen the
requirements on the T-tests: using α=0.01
Neither is a formal adjustment, Type I error is
uncontrolled
21
LSD – Why it Works

Some p-values are stronger than others.
When the F-test is “very” significant, we can
be more sure that some groups do have
different means and LSD will find those

We are informally adjusting for multiplicity by
“strengthening” our requirements for alpha.

Works great when we are exploring, maybe
to be followed by a more rigorous study

Not too concerned about Type I errors
22
Example

Means




Treatment 1 Mean = 13
Treatment 2 Mean = 27
Treatment 3 Mean = 14
Treatment 4 Mean = 24

Overall F-test – Significant, p < 0.001

Pairwise Tests



1v2: <0.001
1v3: 0.8721
1v4: <0.001



2v3: <0.001
2v4: 0.0473
3v4: <0.001
23
Example (2)

There are two clear groups here (1,3) and
(2,4). Between these groups the differences
are clear.

Because the p-value for 2v4 is so
borderline, we should not consider these to
be different.
24
Lines Plot (Example)

A convenient way to represent this
information is via a “lines” plot.
Treatment
TRT 2
TRT 4
TRT 3
TRT 1
Mean
27
24
14
13
Grouping
A
A
B
B
25
Lines Plot (2)

There can be overlapping groups. For
example, we might wind up with something
like:
Treatment
TRT 2
TRT 4
TRT 5
TRT 3
TRT 1
TRT 6
Mean
27
24
19
A B
14
13
1
Grouping
A
A
B
B
C
26
Bonferroni Adjustment

Still uses a t critical value, but we formally
adjust our T-tests and use a Bonferroni t


There are (a)(a – 1)/2 pairwise tests. Divide
alpha by this number for the pairwise
comparisons (can be expensive)

6 treatments, 15 pairs: effective α=0.00333

8 treatments, 28 pairs: effective α=0.00178.
We are formally adjusting the t-critical value
to avoid Type I error inflation.
27
Bonferroni (2)

The advantage here is that you don’t need
to worry about the F-test. (It is possible that
you can have significant T-tests without a
significant F-test!)

Bonferroni works the best when:

you are only interested in a few of the
comparisons (not all pairs are being compared,
don’t have to break up α as much!)

you have planned your tests in advance (you
know which ones you want to compare before
the analysis)
28
Comparison LSD vs. Bonferroni

Control of the Type I Error Rate?

Power?
29
Tukey’s Method

Concept: The pairwise comparisons are dependent
(they involve the same means). We can take
advantage of that dependence to get more power
than a Bonferroni adjustment (with the same alpha).

The change is in the critical value. Instead of a Tdistribution, we use the studentized range
distribution (Q)

Critical values in Table A-6 (similar to F-tables); to
actually get a usable critical value “Q” we must
divide q from the table by 2 .
Q
q , a , n  a
2
30
Tukey’s Method (2)

Our CI becomes:
Yi  Yj   Q MSE

1
ni
 n1j


This CI will be narrower than the Bonferroni
intervals, but still wider than the LSD intervals since
it does take care of the overall Type I error rate.

The Tukey method can only be used for pairwise
comparisons of means


It also works better when cell sizes are equal!
It is best for all pairwise comparisons!
31
Tukey vs. Bonferroni

Remember the only thing that changes is the critical
value!

Tukey is always better if you are doing ALL
pairwise comparisons


If you only need a small number (planned in advance),
Bonferroni can be superior
So by comparing the critical values you can see
which method is advantageous (you’ll do this in the
homework)

Bonferroni t vs. Tukey Q crit. values

The smaller critical value gives more power!
32
Minimum Significant Differences

Because of the structure of the confidence
interval, zero will be included in the interval
if and only if the difference in means is less
than:
CRIT  MSE n1i  n1j



Or if the cell sizes are the same:
2MSE
CRIT 
n
33
Minimum Significant Difference (2)

This is the half-width of the CI, and is called
the minimum significant difference

Any two means that differ by a larger value
will be considered statistically different.

Note that this value will generally be shown
in the SAS output and it depends upon the
comparison method in use.
34
Example



Suppose that you have six treatment groups and
the treatment means are:

TRT 1: 52

TRT 2: 76

TRT 3: 58

TRT 4: 54

TRT 5: 83
TRT 6: 46
Suppose we want to compare all 6 treatments,
which adjustment is appropriate? ______

From this adjustment, we calculate the Minimum
Significant Difference as 10. Which groups are
significantly different? Construct a “LINES” plot
35
Example (2)

First sort the means (increasing or
decreasing order):
Treatment
TRT 5
TRT 2
TRT 3
TRT 4
TRT 1
TRT 6
Mean
83
76
58
54
52
46
Grouping
36
Example (2)

Now, starting at the top, form the first group
(remember the Tukey-MSD is 10).
Treatment
TRT 5
TRT 2
TRT 3
TRT 4
TRT 1
TRT 6
Mean
83
76
58
54
52
46
Grouping
A
A
B
37
Example (3)

Continue down the table (algorithmically):
Treatment
TRT 5
TRT 2
TRT 3
TRT 4
TRT 1
TRT 6
Mean
83
76
58
54
52
46
Grouping
A
A
B
B C
B C
C
38
Example (4)

Notice that when a group ends, you simply
drop down to the next group mean and start
comparing again

It is not unusual at all to have some overlap
between groups, so you may have to
backward check groups above

Remember this process only works for cell
sizes that are the same (or very similar).
WHY?
39
Dunnett’s Method

Specifically designed for comparing each treatment
to a control group! Based on another distribution
(similar to Tukey) that reflects the dependence
between these a-1 tests.

Like Tukey for “all pairwise comparisons”, Dunnett
is the most powerful method for “treatment vs.
control” comparisons.

Our book does not have these critical values, but it
is easy to use Dunnett in SAS (and it will provide
you with the minimum significant difference as well).
40
Example

Suppose in our previous example, treatment 6 was
a control. We should have used Dunnett’s instead
of Tukey.

We calculate the Dunnet MSD as 7
Treatment
TRT 5
TRT 2
TRT 3
TRT 4
TRT 1
CONTROL

Mean
83
76
58
54
52
46
Which groups are now different?
41
Summary: Pairwise Comps.

For pairwise comparison of treatments:

Dunnett is the most powerful if considering
treatments versus control.

Tukey is the most powerful if considering ALL
pairwise comparisons.

Bonferroni should only be used if you have a
relatively small number of pre-planned
comparisons of interest

LSD is appropriate for exploratory studies (to be
followed up by a more well-planned study).
42
SAS Code & Output

MEANS statement is added to PROC GLM
in order to compare levels for a variable
listed in the CLASS statement.
proc glm data=bloodtype;
class type;
model resp=type /solution;
means type /tukey lines;
43
Other Options / Formatting

BON – use Bonferroni instead of Tukey (will produce
full output, but you should want only part of it, right?)

ALPHA = ??? changes your significance level

CLM calls for CI’s for the means (BON would apply)

CLDIFF calls for the CI’s for differences

DUNNETT <‘xxx’> uses Dunnett’s method where xxx
is the name of the control group

DUNNETTU / DUNNETTL if you want one-sided
comparisons (strictly better or worse than control)
44
Output (Tukey, Lines)

Blood Type Example
Alpha
0.05
Error Degrees of Freedom
8
Error Mean Square
2.083333
Critical Value of Studentized Range 4.52880
Minimum Significant Difference
3.774
Means with the same letter are not significantly different.
GROUP
Mean
N
type
A
A
A
33.667
3
B
32.667
3
AB
B
27.667
3
A
C
22.667
3
O
45
Output (CLDIFF, BON)
NOTE: This test controls the Type I experimentwise error rate,
but it generally has a higher Type II error rate than Tukey's
for all pairwise comparisons.
Alpha
0.05
Error Degrees of Freedom
8
Error Mean Square
2.083333
Critical Value of t
3.47888
Minimum Significant Difference
4.0999
Comparisons significant at the 0.05 level are indicated by ***.
46
Output (CLDIFF, BON)
type
Comparison
B - AB
B - A
B - O
AB - B
AB - A
AB - O
A - B
A - AB
A - O
O - B
O - AB
O - A
Difference
Between
Means
1.000
6.000
11.000
-1.000
5.000
10.000
-6.000
-5.000
5.000
-11.000
-10.000
-5.000
(2)
Simultaneous 95%
Confidence Limits
-3.100
5.100
1.900
10.100
6.900
15.100
-5.100
3.100
0.900
9.100
5.900
14.100
-10.100
-1.900
-9.100
-0.900
0.900
9.100
-15.100
-6.900
-14.100
-5.900
-9.100
-0.900
***
***
***
***
***
***
***
***
***
***
47
Different Sample Sizes

For illustration, we just delete one of the
points from Type B.

Sample sizes are now 3, 2, 3, 3.

What will happen to the CI’s?
48
Different Sample Sizes
Tukey's Studentized Range (HSD) Test for resp
Alpha
0.05
Error Degrees of Freedom
7
Error Mean Square
2
Critical Value of Studentized Range 4.68124
Minimum Significant Difference
4.0541
Harmonic Mean of Cell Sizes
2.666667
NOTE: Cell sizes are not equal.
Means with the same letter are not significantly different.
GROUP
Mean
N
type
A
A
A
33.000
2
B
32.667
3
AB
B
27.667
3
A
C
22.667
3
O
49
Confidence Limits
Tukey's Studentized Range (HSD) Test for resp
Alpha
0.05
Error Degrees of Freedom
7
Error Mean Square
2
Critical Value of Studentized Range 4.68124
type
Comparison
B - AB
B - A
B - O
AB - A
AB - O
A - O
Difference
Between
Means
0.333
5.333
10.333
5.000
10.000
5.000
Simultaneous 95%
Confidence Limits
-3.940
4.607
1.060
9.607
6.060
14.607
1.178
8.822
6.178
13.822
1.178
8.822
***
***
***
***
***
50
Confidence Limits

Confidence Limits involving Type B are of
width 8.55. Those not involving Type B are
of width 7.64. Why?
51
Questions?
52
Beyond Pairwise Comparisons
We may want to compare “groupings” of
the means rather than individual means.
This involves linear combinations and
contrasts of means.
53
Linear Combination of Means

A linear combination of means is the sum
of means that have been multiplied by
constants.

Constants may be anything you like.
Sometimes some of them will be zero.

If the constants sum to zero – then we call the
linear combination a contrast.

You should note that any pairwise comparison is
a contrast.
54
Linear Combination (2)

Consider the fixed effects model
Yij     i   ij  i   ij

It is not difficult to conduct a hypothesis test
related to any linear combination of means
that we choose:
L   ci i
55
Linear Combinations (Examples)
H 0 : 1  2  3  4
H 0 : 31  2  3  4
H 0 : 2 1  7 3  0.69 4  0
a
H 0 : 1a  i  a
i 1
H 0 : i   j
56
Linear Combinations (Example)

Take one example:
H 0 : 1  2  3  4

Let’s put it in “standard” form
H 0 :11  12  13  14  0

Do the constants sum to zero? What does
this mean?

Contrasts are “fair” comparisons

Not all linear combinations are contrasts
57
Linear Combinations (Examples)
H 0 : 1  2  3  4
H 0 : 31  2  3  4
H 0 : 2 1  7 3  0.69 4  0
a
H 0 : 1a  i  a
i 1
H 0 : i   j

Which of these are contrasts?
58
Construction of the t-test

Our statistic under H0 has a T distribution with N – k
(error) degrees of freedom
t0 
Lˆ   ciYi
Lˆ  L0
 
Var Lˆ
 
Var Lˆ  Var   ciYi 
  ci2Var Yi 
 MSE 
ci2
ni
59
Linear Combinations (Example)

Take one example:
H 0 :11  12  13  14  0
L0  0
Lˆ  1Y1  1Y2 1Y3 1Y4
t0 
Lˆ  L0
 
Var Lˆ
2
2
2
2


(1)
(1)
(

1)
(

1)
ˆ
Var L  MSE 




n
n
n
n
2
3
4
 1

 
60
Linear Combinations
(Pairwise Example)

Another example:
H 0 : i   j  0
L0  0
Lˆ  Yi  Y j
t0 
Yi  Y j
1 1
MSE   
n n 
j 
 i
 (1) 2 (1) 2 
1 1
ˆ
Var L  MSE 

  MSE   
 n
nj 
 i
 ni n j 
 
61
Why T test instead of overall F test?

With T tests, you can address specific hypotheses
that you are interested in rather than just testing the
overall equality of means.

A note on the F-test: The ANOVA F-test in reality
jointly tests all possible contrasts

It decreases the power that we would get if we only test
those of interest to the experiment

This is why on occasion individual T tests may test
significant while the overall F test does not.

Anova F test just can’t look close enough to see what is
going on!
62
Multiplicity Issues

Because we are often looking at multiple
tests or confidence intervals, if we use
standard t-critical values the Overall Type I
Error Rate (as we’ve seen in the past) will
not be well controlled.

Another issue is that not all of the linear
combinations you test will be independent.

This actually turns out to be a good thing,
because it is possible to take advantage of the
dependencies in developing, e.g., Tukey or
Dunnett Adjustments).
63
Multiplicity Issues (2)

Another issue of particular importance is
data snooping. This is often done in an
exploratory study where we want to search
for differences. In this case, we’ll probably
decide what to test after seeing the sample
means.

By doing this, we effectively perform all possible
tests, and as we’ve discussed before the testing
procedure becomes biased in favor of rejecting
the null for at least one test.
64
Can we data snoop?

It turns out that in some cases, we can “data
snoop” in a fair and reasonable manner.

We’ve already seen that the Tukey adjustment
may be used to perform all pairwise comparisons
(we sacrifice a bit of power for control of alpha).

It is possible to expand this to all-possiblecontrasts using a Scheffe adjustment.
65
Scheffé’s Method

Scheffe’s method obtains a critical value S that may
be used to set up simultaneous CI’s for all
contrasts. Again, we sacrifice power for control of
the significance level.

The critical value is based on the F distribution:
S

 a  1 F ,a1, N a
CI is given by:
c Y
i i
 S MSE 
ci2
ni
66
Scheffé’s Method (2)

Remember to apply Scheffe, you MUST have a
contrast. That is you must have:
c
i
0

Chosen when you have unplanned contrasts

Also chosen AND recommended even for pairwise
comparisons if you have vastly different cell sizes
67
Comparison Of Methods
 LSD Procedure will always have the most
power (but won’t control Type I errors)
 Usually for exploratory studies to be followed
by a more well planned experiment
 Bonferroni will be most powerful for a few
pre-planned comparisons while controlling
the Type I Error Rate
 Tukey will be the most powerful for all
pairwise comparisons while controlling the
Type I Error Rate
68
Comparison of Methods

Dunnett will be the most powerful for
comparing treatments to a control while
controlling for the Type I error rate.

Scheffe will usually the least powerful! But it
will control the Type I error rate for ALL
CONTRASTS

Allows data snooping!

Also useful if cell sizes are vastly different
69
General Form of Test / CI

A confidence interval for any linear
combination may be obtained by
considering:
c Y
i i

 CRIT MSE 
ci2
ni
As long as we make an appropriate choice
for the critical value, everything else is
identical.
70
Contrasts in SAS

Consider testing whether B/AB groups are
the same as A/O groups in the blood type
example.
L1   A   AB   B  O
L2  12  A  12  AB  12  B  12 O
proc glm data=bloodtype;
class type;
model resp=type ;
contrast 'L1' type 1 -1 -1 1;
contrast 'L2' type 0.5 -0.5 -0.5 0.5;
estimate 'L1' type 1 -1 -1 1;
estimate 'L2' type 0.5 -0.5 -0.5 0.5;
71
SAS Output
Contrast
L1
L2
Parameter
L1
L2
DF
1
1
Contrast SS
192.00000
192.00000
Estimate
-16.000
-8.000
Mean Square
192.00000
192.00000
Std Error
1.66667
0.83333
F Value
92.16
92.16
t Value
-9.60
-9.60
Pr > F
<.0001
<.0001
Pr > |t|
<.0001
<.0001
72
SAS Output (comments)

Only difference is that one set of estimates
is double the other. Test statistics and Pvalues are the same.

Scheffe is not (and for some reason cannot
be) applied. So if this is an unplanned
comparison, you would need to utilize the
estimate and SE, along with the appropriate
Scheffe critical value, to develop your CI.
73
Example

Suppose we did want to use Scheffe.

Get F-critical value on 3 and 8 DF from the
tables: 4.07
 a  1 F   3 4.07  3.49

Take S 

The Scheffe adjusted CI is
16  3.49 1.667 
 21.8, 10.2 

Conclude the groupings are different.
74
Summary: Scheffe

Scheffe’s main advantage is that it works for
UNPLANNED contrasts. It effectively
“allows” data snooping.

It is still better to have a few PLANNED
contrasts or linear combinations. Then you
can apply Bonferroni with a little bit more
power.
75
Questions?
76
CLG Activity
77
Download