Lecture 6

advertisement
Design and Analysis of Experiments
Professor Daniel Houser
Noteset 6 - Repeated measures.
I.
Introduction
Repeated measures designs refers to situations in which some experimental units
are observed in more than one treatment condition. Complete repeated measures
designs, our focus, occur when each experimental unit is observed under every
treatment. It is important to point out that repeated measures designs are not
meant to handle the following situations.
(a) Multiple measurements taken on the same fixed quantity. For
example, the weight of a particular sub-atomic particle. In this case
the value of multiple measurements arises because the effect of
measurement error is averaged out over a large number of
observations, thus increasing the precision of the weight’s estimate.
More will be said about this below.
(b) Multiple measurements taken on each experimental unit but within a
single treatment. For example, in a simple CRD fertilizer experiment,
one might measure the height of each plant each day. The
measurements within a single plant will likely be positively correlated
across time (if it is taller than average one day it will likely be taller
than average the next) but valuable information might be gained if
these daily values are compared, for example, to weather patterns. It
might be found that one type of fertilizer does best overall while
another seems to provide the best foundation for growth during
particularly dry periods.
When the sequence in which treatments are applied is unimportant, as may be the
case in some industrial manufacturing experiments, repeated measures designs are
often just special cases of randomized block designs. This is also the case when a
particular sequence is of interest and is used by all subjects– as is the case in
many learning experiments. In these cases the goal is to increase the sensitivity of
the experiment to differences between treatments by controlling for variation due
to inherent differences between individuals. The idea is that each subject is a
block, and therefore acts as their own control. The response of a subject to each
treatment is measured in relation to their overall mean response to all treatments.
Sequence and order dependencies pose special problems. If there is a fairly large
number of experimental units then randomizing the order in which they
experience treatments, perhaps subject to the constraint that all sequences of
interest are observed the same number of times, might control for such effects.
This and other ways to handle such problems are discussed below.
1
The main difference between completely randomized and repeated measures
designs is that in the former it is often reasonable to assume independence across
all observations, while in the latter case observations are likely to be dependent.
If this dependence is not taken into account then inference about treatment effects
may be biased or may fail to reveal the true significance of treatment contrasts.
II.
Standard ANOVA for repeated measures without order dependencies.
In some cases repeated measures designs can be analyzed using the same type of
ANOVA used in randomized complete block designs. This section reviews that
ANOVA, provides the necessary and sufficient condition for its validity, and
provides two models that, if valid, allow a standard ANOVA analysis.
A. Experiment’s data and analysis:
Person
1
…
n
1
X11
…
X n1
Treatment
…
…
…
…
Total
T1
…
Tk
Mean
T1
…
Tk
k
X 1k
…
X nk
Total
P1
…
Pn
G
Mean
P1
…
Pn
G
The total sum of squares for this data is
SStotal   ( X ij  G )2 .
i
j
The total sum of squares can be decomposed in two useful ways. One is the
standard way for block designs. This is:
(1)
SStotal  SStreatment  SSb. people  SSres
where
k
SStreat  n  (T j  G )2 , (df=k  1)
j 1
n
SSb. people  k  ( Pi  G ) 2 , (df=n  1)
i 1
2
SSres   ([ X ij  G]  [T j  G]  [ Pi  G])2 , (df=(n-1)(k  1))
i
j
This is the standard randomized block design decomposition we saw in lecture
four. A second way to decompose the total sum of squares is:
(2)
SStotal  SSb. people  SS w. people
where
SSw. people   ( X ij  Pi )2 , (df=n(k  1)).
j
i
This is similar to the CRD sum of squares decomposition, where we saw that the
total sum of squares was equal to the sum of within and between sums of squares.
From (1) and (2) it follows that
SS w. people  SStreat  SSres .
This is summarized by the following standard ANOVA.
Source of variation
Between people
SS
df
n-1
n
k  ( Pi  G )2
i 1
Within people
 ( X
j
Between treatments
ij
 Pi )2
n( k-1)
i
k-1
k
n  (T j  G )2
j 1
Residual
([ X ij  G ]  [T j  G ]  (n-1)(k-1)
 [ P  G ])
i
Total
j
 ( X
i
2
i
ij
 G )2
nk-1
j
We would like to test the null hypothesis that different treatments have no effect
on outcomes by running the standard F-test. That is, under the null hypothesis, we would
like it to be the case that:
3
SStreat /( k  1)
~ F ( k  1, ( n  1)( k  1)).
SSres /[( n  1)(k  1)]
B. Ensuring the validity of the F-test.

A necessary and sufficient condition for the validity of the standard F-test is
that the experiment’s covariance matrix is circular in form.
An experiments covariance matrix  X has the variance of each treatment’s
observations along the main diagonal, and the covariance between the
observations in the off-diagonal cells. Hence, in the case of a repeated measures
design that includes k treatments and n experimental units,
n
ˆ X j) 
( j, j )  var( X j ); ˆ ( j, j )  var(
(X
i 1
ij
 T j )2
.
n 1
n
ˆ X j , X j' ) 
( j, j ')  cov( X j , X j ' ); ˆ ( j, j ')  cov(
(X
i 1
ij
 T j )( X ij '  T j ' )
n 1
We say that the covariance matrix is circular if:
( j, j )  ( j ', j ')  2( j, j ')   , (j  j ',   0).
Hence, a circular covariance matrix ensures that (a) the variance of the difference
between any two X variates is a constant and (b) the average variance and the
average covariance differ by a constant.
Notice that any scalar multiple of a k X k identity matrix is circular. In fact, this
sort of matrix is called “spherical.”
C. Two models that can generate circular covariance matrices.
The strictly additive model.
X ij     i   j   ij . Here,
  grand mean.
 i  random individual effect,  i ~ N (0,  2 ).
 j  fixed treatment effect,  j  0.
 ij  homoskedastic error term,  ij ~ N (0,  2 ).
4
Note that the distribution of the error is the same for all treatments and
uncorrelated with everything else in the model.
Since the random individual effects and the error terms are assumed to be
uncorrelated, it follows that:
( j, j )   2   2 , all j.
( j, j ')   2 , all j  j '.
Hence, this covariance matrix is circular and the usual ANOVA analysis applies.
A model with interaction that can generate a circular covariance matrix.
X ij     i   j   ij   ij
where everything is as above, except that
 ij is a random person by treatment interaction with
 
ij
0
j
so that
 ij ~ N (0,
k 1 2
  )
k
Because of the restriction that within each person the interaction effects must sum
to zero, it follows that the interaction effects are not independent, but rather
1
cov( ij ,  ij ' )    2 , although independence across subjects still holds:
k
cov( ij ,  i ' j )  0.
The expected mean square statistics for this model are as follows.
Source of Variation
Between people
Df
n-1
MS
SSb. people /( n  1)
E(MS)
 2  k 2
Between treatments
k-1
SStreat /(k  1)
 2   2
 n  2j /( k  1).
Residual
SSres /[(n  1)(k  1)]  2   2
(n-1)(k-1)
5
If the covariance matrix is circular then the usual F-statistic is exact. However, this
model does not necessarily lead to a circular covariance matrix. If not then the F-test is
only approximate.
D. The Box correction
In the interaction model the circularity assumption does not necessarily hold, but
if it did the usual F-test would be appropriate and exact. In such situations Box
(1954) has come up with a way to modify the degrees of freedom for the F-test so
that even if circularity is violated in an arbitrary way the test is extremely
accurate.
Continue to assume that there are k treatments in the experiment, so that the
experiment’s covariance matrix is k X k. If the true covariance matrix were
known, then a measure of the extent to which it departed from circularity could be
constructed as follows.

k 2 ( jj   .. ) 2
( k  1) ( jj '   j .   . j   .. )
, where
 j .    ji ,  . j    ij ,
i
i
 jj  main diagonal mean,
 ..    ji ,  ..    ji / k 2 .
i
j
i
j
Of course,  ij indicates the entry in row i and column j in the covariance matrix.
1
   1.0 with circularity being achieved when it takes
k 1
value one, and the maximum departure from circularity occurring when
  1/( k  1).
It can be shown that
Typically, one will need to estimate  using the sample covariance matrix. This
can be done as follows:
ˆ 
k 2 ( s jj  s.. ) 2
[(k  1)( s 2jj '  2k  s 2j .  k 2 s..2 )
where the s represent the elements of the estimated covariance matrix.
6
This estimate is biased. An alternative biased estimator that seems to perform
better is
n(k  1)ˆ  2
 *  min(1,
).
(k  1)[n  1  (k  1)ˆ ]
Box (1954) has shown that the F statistic in the repeated measures designs is well
approximated by
F ((k  1)ˆ, (n  1)(k  1)ˆ ).
Hence, in the event that the covariance matrix is not circular one can conduct the
standard F-test by simply adjusting the degrees of freedom appropriately.
If the covariance matrix is not circular, but no adjustment to the degrees of
freedom is made, the test will have positive bias. This means that it will be “too
easy” to reject the null hypothesis of no treatment effects. Another way to think
about this is that the experimenter may think they are rejecting the null at a 5%
confidence level when the true level of significance is, say, 15%. Note that if the
null is not rejected under this test, then one need not do any more work since
adjusting the degrees of freedom only makes rejecting the null more difficult.
That is, the unadjusted test has the most power.
Since it is difficult to know what the true form of the covariance matrix is, and
since in small sample estimates of the covariance matrix will be very imprecise,
some argue that one should take a conservative approach and test the null under
the “worst case” scenario. If the covariance matrix deviates maximally from
1
circularity, then ˆ 
so the F statistic should be compared to an F(1,n-1)
k 1
distribution. This test is conservative in the sense that the true significance level
of the test will not be greater than the significance level implied by the F(1,n-1)
distribution. Thus, if we find evidence of treatment effects we can be confident
that they do exist.
If the unadjusted F leads to rejection of the null, and the conservative F does not,
then one has no choice but to estimate the correction factor and report the results
from that test.
III. Comments on repeated measures designs that address order and
sequencing effects.
Earlier we considered examples where the order in which treatments are
encountered are expected to have an effect on outcomes. This problem arises
particularly when we are concerned that “learning” resulting from experience with
one treatment will affect outcomes on other treatments.
7
A “brute-force” way to handle order or sequence effects is to block them. That is,
if there are k distinct treatments, then there are M=k! distinct orders in which the
treatments can be encountered. One might model this by assuming a strictly
additive model:
yijm     i   j   m   ijm
where  m is the effect due to sequence m, or one might specify a model that
allows for interactions (modeled as random effects) between the treatments and
order, such as
yijm     i   j   jm   ijm
the ANOVA analysis of these models follows the arguments set out above
directly as long as the assumptions of those models are met and the design is
complete, in the sense that all possible sequences are observed and that there is an
equal number of subjects within each sequence. Note that if there is only one
subject within each sequence then in the strictly additive model the sequence
effect is subsumed within the individual effect, while in the interactive model the
sequence effect is subsumed within the residual. Finally, it should be pointed out
that this same sort of argument applies to factorial experiments, where in this case
each distinct sequence becomes an additional factor (called, it turns out, a
“sequence factor.”)
The main drawback to this sort of blocking is that it can be very expensive – both
in pecuniary terms and with respect to the experimenters time. The reason is that
k! is rather large unless k is fairly small, say less than four or five. As a practical
matter, one often needs to use a different approach. One such approach is based
on a Latin square.
Example:
Suppose one is doing research that makes use of FMRI technology. Suppose one
wants to shed light on the areas of the brain that are important in solving certain
types of economic problems. Suppose there are three games of interest: a trust
game, a dictator game and a punishment game, a1 , a2 , a3 . Moreover, there are
three “players” against which each game might be played: a human, computer1
or computer2, denoted b1 , b2 , b3 . The difference between the two computers is
that one plays a stochastic strategy, while the other plays in a known,
nonstochastic way. Hence, there are a total of 9 experimental conditions of
interest. If we are concerned about order effects, and if we want each subject to
be measured in each condition, we must run at least 9!=362,880 subjects through
the experiment.
8
Suppose that we are concerned that subjects will become tired or bored if they
stay in the FMRI too long, so we decide that each subject will experience only
three of the conditions. Still, we are also concerned that the sequence in which
they experience the conditions may affect their responses (in this case, brain
activation.) This still requires at least 9*8*7=504 subjects if we want to block out
sequence effects. An alternative way to control for these effects using a Latin
square, and that requires fewer subjects, is as follows.
One assigns n subjects to each of three groups randomly. Each group
plays the same conditions. For example, group 1 plays a1b3 (trustcomputer2), a2b1 (dictator-human) and a3b2 (punishment-computer1).
To try to remove possible order effects, the sequence in which each
subject plays the three treatments is randomized. For instance, one could
assign 6 players to each group, and then randomly assign each of the six
possible sequences to one member of the group. In this form, the
experiment requires only 18 subjects.
Group/Game
a1
b3
b1
b2
G1
G2
G3
a3
b2
b3
b1
a2
b1
b2
b3
The model corresponding to this is:
E ( X ijkm )     k   m ( k )   i   j   ij
where  is a group effect,  is an individual effect, and the others are the
effects from a, b and their interaction.
The ANOVA is:
Source of variation
Between Subject
Groups
Degrees of freedom
E(mean square)
(np-1)/(p-1)
 2  p 2  np 2
p(n-1)
 2  p 2
np(p-1)/(p-1)
 2  np 2
B
p-1
 2  np 2
AB
(p-1)(p-2)
2
 2  np 
Error
p(n-1)(p-1)
 2
Subjects within groups
Within Subjects
A
9
If the model is valid then, for example, testing the null hypothesis that there is no
effect due to the opponent in the game can be accomplished by running the usual
F-test, MS(B)/MS(Error)~F(p-1,p(n-1)(p-1)). As before, a conservative version
of this test can be run by comparing the F-statistic to the F(1,p(n-1)) distribution.
IV.
Closely related issues.
A. A test for skill homogeneity
There may be cases in which it is useful to assess whether the group of subjects is
homogeneous with respect to a certain observable characteristic. The question is
how to form a measure of that characteristic for each person, and then test
whether the measures seem different.
Suppose that one is interested in some characteristic  , say, a subject’s ability to
perform a certain type of task (typing, for example). Each subject has true ability
 i but one can only measure this ability with error:
X ij   i  ij
where  is pure measurement error. Assuming that each subject’s skill level is
measured k times, the following table results.
…
…
…
N
1
X11
…
X n1
Total
T1
Person/measure
1
…
…
k
X 1k
…
X nk
Total
P1
…
Pn
…
Tk
G
Mean
P1
…
Pn
A test for homogeneity of skill levels can be conducted by forming the within and
between sum of squares and comparing the ratio of mean squares to the
F(n-1,n(k-1)) distribution.
B. A simple test for learning when the data are dichotomous.
Example:
Suppose that we are interested in whether subjects “learn” to play a Nash
equilibrium. We have 10 subjects play 5 games against a robot that plays an
optimal strategy. Suppose this leads to the following data set, where a “1”
indicates that subject played a Nash strategy.
10
Subject
1
2
3
4
5
6
7
8
9
10
Total
Game1
0
0
0
0
0
0
0
1
1
1
3
Game2
0
0
0
1
0
1
0
0
1
1
4
Game3
0
1
1
1
0
0
1
0
1
1
6
Game4
0
1
1
1
0
1
1
1
1
1
8
Game5
0
0
1
1
1
1
1
1
1
1
8
Total
0
2
3
4
1
3
3
3
5
5
29
Intuitively, the test statistic compares the variability across games to the
variability within subjects. In general, if there are n subjects and k games, the
appropriate test statistic is (Cochran (1950)):
Q
n(k  1) SS game
SS w. people
Under the null hypothesis that there is no learning, the Q statistic has a chi-square
distribution with (k-1) degrees of freedom.
In this example, Q=10.95 while the 5% critical value for the chi-square(4)
distribution is 9.5. Hence, we find at the 5% significance level that learning did
occur.
11
Download