One-way ANOVA and Block Designs

advertisement
7 - One-way Analysis of Variance (ANOVA)
Suppose we wish to compare k population means ( k  2 ). This situation can arise in two
ways. If the study is observational, we are obtaining independently drawn samples from
k distinct populations and we wish to compare the population means for some numerical
response of interest. If the study is experimental, then we are using a completely
randomized design to obtain our data from k distinct treatment groups. In a completely
randomized design the experimental units are randomly assigned to one of k treatments
and the response value from each unit is obtained. The mean of the numerical response
of interest is then compared across the different treatment groups.
There two main questions of interest:
1) Are there at least two population means that differ?
2) If so, which population means differ and how much do they differ by?
More formally:
H o : 1   2  ...   k
H a : at least two population means differ, i.e.  i   j for some i  j.
If we reject the null then we use comparative methods to answer question 2 above.
Assumptions:
1. Samples are drawn independently (completely randomized design)
2. Population variances are equal, i.e.  12   22     k2 .
3. Populations are normally distributed.
We examine checking these assumptions in the examples that follow.
The test procedure compares the variation in observations between samples to the
variation within samples. If the variation between samples is large relative to the
variation within samples we are likely to conclude that the population means are not all
equal. The diagrams below illustrate this idea...
Between Group Variation >> Within Group Variation
(Conclude population means differ)
Between Group Variation  Within Group Variation
(Fail to conclude the population means differ)
The name analysis of variance gets its name because we are using variation to
decide whether the population means differ.
101
Hypothetical Example: Comparing k = 9 Treatments
102
Within Group Variation
To measure the variation within groups we use the following formula:
(n1  1) s12  (n2  1) s22    (nk  1) sk2
s 
n1  n2    nk  k
2
W
Note: This is an extension of pooled variance estimate from the two-sample t-Test
2
for independent samples ( s p ) to more than two populations.
This is an estimate of the variance common to all k populations (  2 ).
Between Group Variation
To measure the variation between groups we use the following formula:
n1 ( x1  x ) 2  n2 ( x2  x ) 2    nk ( xk  x ) 2
s 
k 1
where,
2
B
xi  sample mean response for the i th treatment group
x  grand mean of the response from all treatment groups
The larger the differences between the k sample means for the treatment groups are, the
larger s B2 will be.
If the null hypothesis is true, i.e. the k treatments have the same population mean, we
expect the between group variation to the same as the within group variation ( s B2  sW2 ).
If s B2  sW2 we have evidence against the null hypothesis in support of the alternative, i.e.
we have evidence to suggest that at least two of the population means are different.
Test Statistic
F
n = overall sample size
s B2
~ F-distribution with numerator df = k -1 and denominator df = n – k
sW2
The larger F is the strong the evidence against the null hypothesis, i.e. Big FReject Ho.
103
EXAMPLE 1 - Weight Gain in Anorexia Patients
Data File: Anorexia.JMP
These data give the pre- and post-weights of patients being treated for anorexia nervosa.
There are actually three different treatment plans being used in this study, and we wish to
compare their performance in terms of the mean weight gain of the patients. The patients
in this study were randomly assigned to one of three therapies.
The variables in the data file are:
group - treatment group (1 = Family therapy, 2 = Standard therapy, 3 = Cognitive
Behavioral therapy)
Treatment – treatment group by name (Behavioral, Family, Standard)
prewt - weight at the beginning of treatment
postwt - weight at the end of the study period
Weight Gain - weight gained (or lost) during treatment (postwt-prewt)
We begin our analysis by examining comparative displays for the weight gained across
the three treatment methods. To do this select Fit Y by X from the Analyze menu and
place the grouping variable, Treatment, in the X box and place the response, Weight
Gain, in the Y box and click OK. Here boxplots, mean diamonds, normal quantile plots,
and comparison circles have been added.
Things to consider from this graphical display:

Do there appear to be differences in the mean weight gain?

Are the weight changes normally distributed?

Is the variation in weight gain equal across therapies?
104
Checking the Equality of Variance Assumption
To test whether it is reasonable to assume the population variances are equal for these
three therapies select UnEqual Variances from the Oneway Analysis pull down-menu.
Equality of Variance Test
H o :  12   22     k2
H a : the population variances are not
all equal
We have no evidence to conclude that the variances/standard deviations of the weight
gains for the different treatment programs differ (p >> .05).
ONE-WAY ANOVA TEST FOR COMPARING THE THERAPY MEANS
To test the null hypothesis that the mean weight gain is the same for each of the therapy
methods we will perform the standard one-way ANOVA test. To do this in JMP select
Means, Anova/t-Test from the Oneway Analysis pull-down menu. The results of the
test are shown in the Analysis of Variance box.
s B2  307.22
sW2  56.68
The between group variation is 5.42 times larger than the
within group variation! This provides strong evidence against
the null hypothesis (p = .0065).
The p-value contained in the ANOVA table is .0065, thus we reject the null hypothesis at
the .05 level and conclude statistically significant differences in the mean weight gain
experienced by patients in the different therapy groups exist.
MULTIPLE COMPARISONS
Because we have concluded that the mean weight gains across treatment method are not
all equal it is natural to ask the secondary question:
Which means are significantly different from one another?
We could consider performing a series of two-sample t-Tests and constructing confidence
intervals for independent samples to compare all possible pairs of means, however if the
number of treatment groups is large we will almost certainly find two treatment means as
being significantly different. Why? Consider a situation where we have k = 7 different
treatments that we wish to compare. To compare all possible pairs of means (1 vs. 2, 1
k (k  1)
 21 two-sample t-Tests. If
vs. 3, …, 6 vs. 7) would require performing a total of
2
we used   P(Type I Error)  .05 for each test we expect to make 21(.05)  1 Type I
Error, i.e. we expect to find one pair of means as being significantly different when in
fact they are not. This problem only becomes worse as the number of groups, k , gets
larger.
105
Experiment-wise Error Rate
Another way to think about this is to consider the probability of making no Type I Errors
when making our pair-wise comparisons. When k = 7 for example, the probability of
making no Type I Errors is (.95) 21  .3406 , i.e. the probability that we make at least one
Type I Error is therefore .6596 or a 65.96% chance. Certainly this unacceptable! Why
would conduct a statistical analysis when you know that you have a 66% of making an
error in your conclusions? This probability is called the experiment-wise error rate.
Bonferroni Correction
There are several different ways to control the experiment-wise error rate. One of the
easiest ways to control experiment-wise error rate is use the Bonferroni Correction. If
we plan on making m comparisons or conducting m significance tests the Bonferroni
Correction is to simply use 
as our significance level rather than  . This simple
m
correction guarantees that our experiment-wise error rate will be no larger than  . This
correction implies that our p-values will have to be less than  rather than  to be
m
considered statistically significant.
Multiple Comparison Procedures for Pair-wise Comparisons of k Population Means
When performing pair-wise comparison of population means in ANOVA there are
several different methods that can be employed. These methods depend on the types of
pair-wise comparisons we wish to perform. The different types available in JMP are
summarized briefly below:




Compare each pair using the usual two-sample t-Test for independent samples. This
choice does not provide any experiment-wise error rate protection! (DON’T USE)
Compare all pairs using Tukey’s Honest Significant Difference (HSD) approach. This
is best choice if you are interested comparing each possible pair of treatments.
Compare with the means to the “Best” using Hsu’s method. The best mean can either
be the minimum (if smaller is better for the response) or maximum (if bigger is better for
the response).
Compare each mean to a control group using Dunnett’s method. Compares each
treatment mean to a control group only. You must identify the control group in JMP by
clicking on an observation in your comparative plot corresponding to the control group
before selecting this option.
Multiple Comparison Options in JMP
106
EXAMPLE 1 - Weight Gain in Anorexia Patients (cont’d)
For these data we are probably interested in comparing each of the treatments to one
another. For this we will use Tukey’s multiple comparison procedure for comparing all
pairs of population means. Select Compare Means from the Oneway Analysis menu
and highlight the All Pairs, Tukey HSD option. Beside the graph you will now notice
there are circles plotted. There is one circle for each group and each circle is centered at
the mean for the corresponding group. The size of the circles are inversely proportional
to the sample size, thus larger circles will drawn for groups with smaller sample sizes.
These circles are called comparison circles and can be used to see which pairs of means
are significantly different from each other. To do this, click on the circle for one of the
treatments. Notice that the treatment group selected will be appear in the plot window
and the circle will become red & bold. The means that are significantly different from
the treatment group selected will have circles that are gray. These color differences will
also be conveyed in the group labels on the horizontal axis. In the plot below we have
selected Standard treatment group.
107
The results of the pair-wise comparisons are also contained the output window.
The matrix labeled Comparison for all pairs using Tukey-Kramer HSD identifies pairs of
means that are significantly different using positive entries in this matrix. Here we see
only treatments 2 and 3 significantly differ.
The next table conveys the same information by using different letters to represent
populations that have significantly different means. Notice treatments 2 and 3 are not
connected by the same letter so they are significantly different.
Finally the CI’s in the Ordered Differences section give estimates for the differences in
the population means. Here we see that mean weight gain for patients in treatment 3 is
estimated to be between 2.09 lbs. and 13.34 lbs. larger than the mean weight gain for
patients receiving treatment 2 (see highlighted section below).
108
EXAMPLE 2 – Motor Skills in Children and Lead Exposure
These data come from a study of children living in close proximity to a lead smelter in El
Paso, TX. A study was conducted in 1972-73 where children living near the lead smelter
were sampled and their blood lead levels were measured in both 1972 and in1973. IQ
tests and motor skills tests were conducted on these children as well. In this example we
will examine difference in finger-wrist tap scores for three groups of children. The
groups were determined using their blood lead levels from 1972 and 1973 as follows:



Lead Group 1 = the children in this group had blood levels below 40
micrograms/dl in both 1972 and 1973 (we can think of these children as a “control
group”).
Lead Group 2 = the children in this group had lead levels above 40 micrograms/dl
in 1973 (we can think of these children as the “currently exposed” group).
Lead Group 3 = the children in this group had lead levels above 40 microgram/dl
in 1972, but had levels below 40 in 1973 (we can think of these children as the
“previously exposed” group).
The response that was measured (MAXFWT) was the maximum finger-wrist tapping
score for the child using both their left and right hands to do the tapping (to remove hand
dominance). These data are contained in the JMP file: Maxfwt Lead.JMP.
Select Fit Y by X and place Lead Group in the X,Factor box and MAXFWT in the
Y,Response box. After selecting Quantiles and Normal Quantile Plots we obtain the
following comparative display.
The wrist-tap scores appear to be approximately normally distributed. The variance in
the wrist-tap scores may appear to be a bit larger for Lead Group 1, however because this
is the largest group we expect to see more observations in the extremes. The interquartile
ranges for the three lead groups appear to be approximately equal. We can check the
equality of variance assumption formally by selecting the UnEqual Variance option.
109
Formally Checking the Equality of the Population Variances
None of the equality of variance
tests suggests that the population
variances are unequal (p > .05).
ONE-WAY ANOVA FOR COMPARING THE MEAN WRIST-TAP SCORES
ACROSS LEAD GROUP
To test the null hypothesis that the mean finger wrist-tap score is the same for each of the
lead exposure groups we will perform the standard one-way ANOVA test. To do this in
JMP select Means, Anova from the Oneway Analysis pull-down menu. The results of
the test are shown in the Analysis of Variance box.
The p-value contained in the ANOVA table is .0125, thus we reject the null hypothesis at
the .05 level and conclude that statistically significant differences in the mean finger wrist
tap scores of the children in the different lead exposure groups exist (p < .05).
Multiple Comparisons using Tukey’s HSD
Only the mean finger wrist-tap scores of lead
groups 1 and 2 significantly differ, i.e.
children who currently have high lead levels
have a significantly smaller mean than the
children who did not have high lead levels in
either year of the study.
110
Tukey’s HSD Pair-wise Comparisons (cont’d)
The results of the pair-wise comparisons are also contained the output window.
The matrix labeled Comparison for all pairs using Tukey-Kramer HSD identifies pairs of
means that are significantly different using positive entries in this matrix. Here we see
only lead groups 1 and 2 significantly differ.
The next table conveys the same information by using different letters to represent
populations that have significantly different means. Notice lead groups 1 and 2 are not
connected by the same letter so they are significantly different.
Finally the CI’s in the Ordered Differences section give estimates for the differences in
the population means. Here we see that the mean finger wrist tap score for children who
currently have a high blood lead level is estimated to be between .83 and 14.18 taps
smaller than the mean finger wrist tap score for children who did not have high lead
levels in either year of the study.
111
Nonparametric Approach: Kruskal-Wallis Test
If the normality assumption is suspect or the sample sizes from each of the k populations
are too small to assess normality we can use the Kruskal-Wallis Test to compare the size
of the values drawn from the different populations. There are two basic assumptions for
the Kruskal-Wallis test:
1) The samples from the k populations are independently drawn.
2) The null hypothesis is that all k populations are identical in shape, with the
only potential difference being in the location of the typical values (medians).
Hypotheses:
H o : All k populations have the same median or location.
H a : At least one of the populations has a median different from other others
or more simply stated…
At least one population is shifted away from the others.
To perform to the test we rank all of the data from smallest to largest and compute the
rank sum for each of the k samples. The test statistic looks at the difference between the
R 
 N 1
average rank for each group  i  and average rank for all observations 
 . If
2
n


i
 
there are differences in the populations we expect some groups will have an average rank
much larger than the average rank for all observations and some to have smaller average
ranks.
2
k
 R N 1
12
 ~  k21 (Chi-square distribution with df = k-1)
H
ni  i 

N ( N  1) i 1  ni
2 
The larger H is the stronger the evidence we have against the null hypothesis that the
populations have the same location/median. Large values of H lead to small p-values!
112
Example – Performance of Hearing Impaired Young Children
Julia Davis in her paper “Performance of Young Hearing-Impaired Children on a Test of Basic Concepts”
in the Journal of Speech & Hearing Research she investigated the performance of hard-of-hearing school
children on a task involving knowledge of 50 basic concepts considered necessary for satisfactory
academic achievement during kindergarten, first grade, and second grade. The raw scores made by 24
hard-of-hearing school children (K to 3rd grade) on the Boehm Test of Basic Concepts are given below by
age. Do these data provide sufficient evidence to indicate that the scores differ with age?
Age 6: 17
20
24
34
34
38
Age 7: 23
25
27
34
38
47
Age 8: 22
23
26
32
34
34
36
38
38
42
48
50
In JMP
Enter these data into two columns, one
denoting the age group the other containing
the Boehm test score for each student.
113
Kruskal-Wallis Test Results
R1  53.5, R2  74, R3  172.5 and H  2.4205 (p-value = .2981).
There is insufficient evidence to suggest
across these age groups (p = .2981).
that the typical Boehm test score of basic concepts differs
Example 2 – Role of Films in Childhood Aggression (Film Aggression.JMP)
In a study of the potential effects on children of films portraying aggression, a group of psychologists
randomly assigned a group of third-grade students to view one of three films with varying degrees of
aggressive content. Group I saw a film with no aggressive content, the content of Group II’s film was
moderately aggressive, and Group III saw a film with highly aggressive content. After the children viewed
the films, investigators observed each group separately for a period of one hour and kept a record of the
number of aggressive acts engaged in by each child. The resulting data are given below. Is there evidence
that the typical number of aggressive acts differed significantly across the three groups?
Group I
15
18
13
19
25
20
Group II
28
32
26
22
30
24
Group III
21
40
12
42
39
36
17
10
16
23
Conclusion:
114
Multiple Comparisons for Kruskal-Wallis
To determine which groups significantly differ we can use the procedure outlined below. To determine if
group i significantly differs from group j we compute
z ij 
Ri R j

ni n j
and then compute p-value = P( Z  z ij ) .
N ( N  1)  1 1 

n n 
12
j 
 i
If the p-value is less then

where m  # of pair-wise comparisons to be made which would typically be
2m
k 
  if all pair-wise comparisons are of interest. For this example, we can make a total of m =
 2
.05
 3
 .00833 .
   3 pair-wise comparisons so we compare our p-values to
2(3)
 2
Comparing Group III vs. Group I
z13 
15.667  6.90
= 2.66  P(Z > 2.66) = .0039 < .00833 so conclude the number of aggressive
22(23)  1 1 
  
12  6 11 
acts performed by children in the two groups significantly differ. In particular, we conclude that children
that have viewed the film with the largest amount of violence performed significantly more acts of
aggression.
115
116
8 - Randomized Complete Block (RCB) Designs
Example 1 – Comparing Methods of Determining Blood Serum Level
Data File: Serum-Meth.JMP
The goal of this study was determine if four different methods for determining blood
serum levels significantly differ in terms of the readings they give. Suppose we plan to
have 6 readings from each method which we will then use to make our comparisons.
One approach we could take would be to find 24 volunteers and randomly allocate six
subjects to each method and compare the readings obtained using the four methods. This
is called a completely randomized design, and the results would be analyzed using oneway ANOVA. There is one major problem with this approach, what is it?
Instead of taking this approach it would clearly be better to use each method on the same
subject. This removes subject to subject variation from the results and will allow us to
get a clearer picture of the actual differences in the methods. Also if we truly only wish
to have 6 readings for each method, this approach will only require the use of 6 subjects
versus the 24 subjects the completely randomized approach discussed above requires,
thus reducing the “cost” of the experiment.
The experimental design where each patient’s serum level is determined using each
method is called a randomized complete block (RCB) design. Here the patients serve as
the blocks; the term randomized refers to the fact that the methods will be applied to the
patients blood sample in a random order, and complete refers to the fact that each method
is used on each patients blood sample. In some experiments where blocking is used it is
not possible to apply each treatment to each block resulting in what is called an
incomplete block design. These are less common and we will not discuss them in this
class.
The table below contains the raw data from the RCB experiment to compare the serum
determination methods.
Method
Subject
1
2
3
4
1
360 435 391 502
2
1035 1152 1002 1230
3
632 750 591 804
4
581 703 583 790
5
463 520 471 502
6
1131 1340 1144 1300
117
Visualizing the need for Blocking
Select Fit Y by X from the Analyze menu and place Serum Level in the Y, Response box
and Method in the X, Factor box. The resulting comparative plot is shown below. Does
there appear to be any differences in the serum levels obtained from the four methods?
This plot completely ignores the fact that the same six blood samples were used for each
method. We can incorporate this fact visually by selecting Oneway Analysis >
Matching Column... > then highlight Patient in the list. This will have the following
effect on the plot.
Now we can clearly see that ignoring the fact the same six blood samples were used for
each method is a big mistake!
On the next page we will show how to correctly analyze these data.
118
Correct Analysis of RCB Design Data in JMP
First select Fit Y by X from the Analyze menu and place Serum Level in the Y,
Response box, Method in the X, Factor box, and Patient in the Block box. The results
from JMP are shown below.
Notice the Y axis is “Serum Level – Block Centered”. This means that what we are
seeing in the display is the differences in the serum level readings adjusting for the fact
that the readings for each method came from the same 6 patients. Examining the data in
this way we can clearly see that the methods differ in the serum level reported when
measuring blood samples from the same patient.
The results of the ANOVA clearly show we have strong evidence that the four methods
do not give the same readings when measuring the same blood sample (p < .0001).
The tables below give the block corrected mean for each method and the block means
used to make the adjustment.
119
As was the case with one-way ANOVA (completely randomized) we may still wish to
determine which methods have significantly different mean serum level readings when
measuring the same blood sample. We can again use multiple comparison procedures.
Select Compare Means... > All Pairs, Tukey’s HSD.
We can see that methods 4 & 2 differ significantly from methods 1 & 3 but not each
other. The same can be said for methods 1 & 3 when compared to methods 4 & 2. The
confidence intervals quantify the size of the difference we can expect on average when
measuring the same blood samples. For example, we see that method 4 will give
between 90.28 and 255.05 higher serum levels than method 3 on average when
measuring the same blood sample. Other comparisons can be interpreted in similar
fashion.
120
Example 2 – Remotivation of Psychiatric Patients (Remotivation.JMP)
A remotivation team in a psychiatric hospital conducted an experiment to compare five
methods for remotivating patients. Patients were group according to level of initial
motivation. Patients in each group were randomly assigned to the five methods. At the
end of the experimental period the patients were evaluated by a team composed of a
psychiatrist, a psychologist, a nurse, and a social worker, none of whom was aware of the
method to which patients were assigned. The team assigned each patient a composite
score as a measure of his or her level of motivation. The results were as follows
Level of
Initial
Motivation
Nil
Very Low
Low
Average
A
58
62
67
70
B
68
70
78
81
C
60
65
68
70
D
68
80
81
89
E
64
69
70
74
Do these data provide sufficient evidence to indicate a difference in mean scores among
the remotivation methods?
Oneway Analysis of Motivation Score By Remotivation Method
90
Motivation Score
85
80
75
70
65
60
55
A
B
C
D
E
Remotivation Method
Here we have connected observations from the same initial motivation level by a line
segment. What does this plot suggest about the effectiveness of or need for blocking on
initial motivation?
121
In JMP, select Analyze > Fit Y by X and place the response, motivation score in the Y,
Response box, the remotivation method (treatment) in the X, Factor box, and initial
motivation level (blocking variable) in the Block box as shown below.
The resulting block adjusted display clearly shows the difference in the remotivation
methods.
Analysis of Variance Table
122
Multiple Comparisons Using Tukey’s HSD
Discussion:
How might the study have been improved?
123
ONE-WAY ANOVA REGRESSION MODEL
In statistics we often use models to represent the mechanism that generates the observed
response values. Before we introduce the one-way ANOVA model we first must
introduce some notation.
Let,
ni  sample size from group i , (i  1,..., k )
Yij  the j th response value for i th group , ( j  1,..., ni )
  mean common to all groups assuming the null hypothesis is true
 i  shift in the mean due to the fact the observation came from group i
 ij  random error for the j th observatio n from the i th group .
Note: The test procedure we use requires that the random errors are normally distributed
with mean 0 and variance  2 . Equivalently the test procedure requires that the response
is normally distributed with a common standard deviation  for all k groups.
One-way ANOVA Model:
Yij     i   ij i  1,...,k and j  1,...,ni
The null hypothesis equivalent to saying that the  i are all 0, and the alternative says that
at least one  i  0 , i.e. H o :  i  0 for all i vs. H a :  i  0 for some i . We must decide
on the basis of the data whether we evidence against the null. We display this graphically
as follows:
Estimates of the Model Parameters:
̂  Y ~ the grand mean
ˆi  Yi  Y ~ the estimate ith treatment effect
Yˆ  ˆ  ˆ  Y ~ these are called the fitted values.
ij
i
i
Note:
Yij  ˆ  ˆi  ˆij
Yij  Y  (Yi  Y )  (Yij  Yi )
ˆij  Yij  Yˆij  Yij  Yi ~ these are called the residuals.
124
Using the Fit Model Approach to One-way ANOVA in JMP
Select Analyze > Fit Model and set up the dialog box as shown below.
Next turn off the following options in the Response pull-down menu…
125
The result output is shown below with the two options off is shown below
Saving the residuals as a variable to the spread sheet allows one to more accurately assess
normality of the response by examining their distribution directly.
Then use Analyze > Distribution and a normal quantile plot to assess normality
126
Download