Design and Analysis of Experiments Professor Daniel Houser Noteset 6 - Repeated measures. I. Introduction Repeated measures designs refers to situations in which some experimental units are observed in more than one treatment condition. Complete repeated measures designs, our focus, occur when each experimental unit is observed under every treatment. It is important to point out that repeated measures designs are not meant to handle the following situations. (a) Multiple measurements taken on the same fixed quantity. For example, the weight of a particular sub-atomic particle. In this case the value of multiple measurements arises because the effect of measurement error is averaged out over a large number of observations, thus increasing the precision of the weight’s estimate. More will be said about this below. (b) Multiple measurements taken on each experimental unit but within a single treatment. For example, in a simple CRD fertilizer experiment, one might measure the height of each plant each day. The measurements within a single plant will likely be positively correlated across time (if it is taller than average one day it will likely be taller than average the next) but valuable information might be gained if these daily values are compared, for example, to weather patterns. It might be found that one type of fertilizer does best overall while another seems to provide the best foundation for growth during particularly dry periods. When the sequence in which treatments are applied is unimportant, as may be the case in some industrial manufacturing experiments, repeated measures designs are often just special cases of randomized block designs. This is also the case when a particular sequence is of interest and is used by all subjects– as is the case in many learning experiments. In these cases the goal is to increase the sensitivity of the experiment to differences between treatments by controlling for variation due to inherent differences between individuals. The idea is that each subject is a block, and therefore acts as their own control. The response of a subject to each treatment is measured in relation to their overall mean response to all treatments. Sequence and order dependencies pose special problems. If there is a fairly large number of experimental units then randomizing the order in which they experience treatments, perhaps subject to the constraint that all sequences of interest are observed the same number of times, might control for such effects. This and other ways to handle such problems are discussed below. 1 The main difference between completely randomized and repeated measures designs is that in the former it is often reasonable to assume independence across all observations, while in the latter case observations are likely to be dependent. If this dependence is not taken into account then inference about treatment effects may be biased or may fail to reveal the true significance of treatment contrasts. II. Standard ANOVA for repeated measures without order dependencies. In some cases repeated measures designs can be analyzed using the same type of ANOVA used in randomized complete block designs. This section reviews that ANOVA, provides the necessary and sufficient condition for its validity, and provides two models that, if valid, allow a standard ANOVA analysis. A. Experiment’s data and analysis: Person 1 … n 1 X11 … X n1 Treatment … … … … Total T1 … Tk Mean T1 … Tk k X 1k … X nk Total P1 … Pn G Mean P1 … Pn G The total sum of squares for this data is SStotal ( X ij G )2 . i j The total sum of squares can be decomposed in two useful ways. One is the standard way for block designs. This is: (1) SStotal SStreatment SSb. people SSres where k SStreat n (T j G )2 , (df=k 1) j 1 n SSb. people k ( Pi G ) 2 , (df=n 1) i 1 2 SSres ([ X ij G] [T j G] [ Pi G])2 , (df=(n-1)(k 1)) i j This is the standard randomized block design decomposition we saw in lecture four. A second way to decompose the total sum of squares is: (2) SStotal SSb. people SS w. people where SSw. people ( X ij Pi )2 , (df=n(k 1)). j i This is similar to the CRD sum of squares decomposition, where we saw that the total sum of squares was equal to the sum of within and between sums of squares. From (1) and (2) it follows that SS w. people SStreat SSres . This is summarized by the following standard ANOVA. Source of variation Between people SS df n-1 n k ( Pi G )2 i 1 Within people ( X j Between treatments ij Pi )2 n( k-1) i k-1 k n (T j G )2 j 1 Residual ([ X ij G ] [T j G ] (n-1)(k-1) [ P G ]) i Total j ( X i 2 i ij G )2 nk-1 j We would like to test the null hypothesis that different treatments have no effect on outcomes by running the standard F-test. That is, under the null hypothesis, we would like it to be the case that: 3 SStreat /( k 1) ~ F ( k 1, ( n 1)( k 1)). SSres /[( n 1)(k 1)] B. Ensuring the validity of the F-test. A necessary and sufficient condition for the validity of the standard F-test is that the experiment’s covariance matrix is circular in form. An experiments covariance matrix X has the variance of each treatment’s observations along the main diagonal, and the covariance between the observations in the off-diagonal cells. Hence, in the case of a repeated measures design that includes k treatments and n experimental units, n ˆ X j) ( j, j ) var( X j ); ˆ ( j, j ) var( (X i 1 ij T j )2 . n 1 n ˆ X j , X j' ) ( j, j ') cov( X j , X j ' ); ˆ ( j, j ') cov( (X i 1 ij T j )( X ij ' T j ' ) n 1 We say that the covariance matrix is circular if: ( j, j ) ( j ', j ') 2( j, j ') , (j j ', 0). Hence, a circular covariance matrix ensures that (a) the variance of the difference between any two X variates is a constant and (b) the average variance and the average covariance differ by a constant. Notice that any scalar multiple of a k X k identity matrix is circular. In fact, this sort of matrix is called “spherical.” C. Two models that can generate circular covariance matrices. The strictly additive model. X ij i j ij . Here, grand mean. i random individual effect, i ~ N (0, 2 ). j fixed treatment effect, j 0. ij homoskedastic error term, ij ~ N (0, 2 ). 4 Note that the distribution of the error is the same for all treatments and uncorrelated with everything else in the model. Since the random individual effects and the error terms are assumed to be uncorrelated, it follows that: ( j, j ) 2 2 , all j. ( j, j ') 2 , all j j '. Hence, this covariance matrix is circular and the usual ANOVA analysis applies. A model with interaction that can generate a circular covariance matrix. X ij i j ij ij where everything is as above, except that ij is a random person by treatment interaction with ij 0 j so that ij ~ N (0, k 1 2 ) k Because of the restriction that within each person the interaction effects must sum to zero, it follows that the interaction effects are not independent, but rather 1 cov( ij , ij ' ) 2 , although independence across subjects still holds: k cov( ij , i ' j ) 0. The expected mean square statistics for this model are as follows. Source of Variation Between people Df n-1 MS SSb. people /( n 1) E(MS) 2 k 2 Between treatments k-1 SStreat /(k 1) 2 2 n 2j /( k 1). Residual SSres /[(n 1)(k 1)] 2 2 (n-1)(k-1) 5 If the covariance matrix is circular then the usual F-statistic is exact. However, this model does not necessarily lead to a circular covariance matrix. If not then the F-test is only approximate. D. The Box correction In the interaction model the circularity assumption does not necessarily hold, but if it did the usual F-test would be appropriate and exact. In such situations Box (1954) has come up with a way to modify the degrees of freedom for the F-test so that even if circularity is violated in an arbitrary way the test is extremely accurate. Continue to assume that there are k treatments in the experiment, so that the experiment’s covariance matrix is k X k. If the true covariance matrix were known, then a measure of the extent to which it departed from circularity could be constructed as follows. k 2 ( jj .. ) 2 ( k 1) ( jj ' j . . j .. ) , where j . ji , . j ij , i i jj main diagonal mean, .. ji , .. ji / k 2 . i j i j Of course, ij indicates the entry in row i and column j in the covariance matrix. 1 1.0 with circularity being achieved when it takes k 1 value one, and the maximum departure from circularity occurring when 1/( k 1). It can be shown that Typically, one will need to estimate using the sample covariance matrix. This can be done as follows: ˆ k 2 ( s jj s.. ) 2 [(k 1)( s 2jj ' 2k s 2j . k 2 s..2 ) where the s represent the elements of the estimated covariance matrix. 6 This estimate is biased. An alternative biased estimator that seems to perform better is n(k 1)ˆ 2 * min(1, ). (k 1)[n 1 (k 1)ˆ ] Box (1954) has shown that the F statistic in the repeated measures designs is well approximated by F ((k 1)ˆ, (n 1)(k 1)ˆ ). Hence, in the event that the covariance matrix is not circular one can conduct the standard F-test by simply adjusting the degrees of freedom appropriately. If the covariance matrix is not circular, but no adjustment to the degrees of freedom is made, the test will have positive bias. This means that it will be “too easy” to reject the null hypothesis of no treatment effects. Another way to think about this is that the experimenter may think they are rejecting the null at a 5% confidence level when the true level of significance is, say, 15%. Note that if the null is not rejected under this test, then one need not do any more work since adjusting the degrees of freedom only makes rejecting the null more difficult. That is, the unadjusted test has the most power. Since it is difficult to know what the true form of the covariance matrix is, and since in small sample estimates of the covariance matrix will be very imprecise, some argue that one should take a conservative approach and test the null under the “worst case” scenario. If the covariance matrix deviates maximally from 1 circularity, then ˆ so the F statistic should be compared to an F(1,n-1) k 1 distribution. This test is conservative in the sense that the true significance level of the test will not be greater than the significance level implied by the F(1,n-1) distribution. Thus, if we find evidence of treatment effects we can be confident that they do exist. If the unadjusted F leads to rejection of the null, and the conservative F does not, then one has no choice but to estimate the correction factor and report the results from that test. III. Comments on repeated measures designs that address order and sequencing effects. Earlier we considered examples where the order in which treatments are encountered are expected to have an effect on outcomes. This problem arises particularly when we are concerned that “learning” resulting from experience with one treatment will affect outcomes on other treatments. 7 A “brute-force” way to handle order or sequence effects is to block them. That is, if there are k distinct treatments, then there are M=k! distinct orders in which the treatments can be encountered. One might model this by assuming a strictly additive model: yijm i j m ijm where m is the effect due to sequence m, or one might specify a model that allows for interactions (modeled as random effects) between the treatments and order, such as yijm i j jm ijm the ANOVA analysis of these models follows the arguments set out above directly as long as the assumptions of those models are met and the design is complete, in the sense that all possible sequences are observed and that there is an equal number of subjects within each sequence. Note that if there is only one subject within each sequence then in the strictly additive model the sequence effect is subsumed within the individual effect, while in the interactive model the sequence effect is subsumed within the residual. Finally, it should be pointed out that this same sort of argument applies to factorial experiments, where in this case each distinct sequence becomes an additional factor (called, it turns out, a “sequence factor.”) The main drawback to this sort of blocking is that it can be very expensive – both in pecuniary terms and with respect to the experimenters time. The reason is that k! is rather large unless k is fairly small, say less than four or five. As a practical matter, one often needs to use a different approach. One such approach is based on a Latin square. Example: Suppose one is doing research that makes use of FMRI technology. Suppose one wants to shed light on the areas of the brain that are important in solving certain types of economic problems. Suppose there are three games of interest: a trust game, a dictator game and a punishment game, a1 , a2 , a3 . Moreover, there are three “players” against which each game might be played: a human, computer1 or computer2, denoted b1 , b2 , b3 . The difference between the two computers is that one plays a stochastic strategy, while the other plays in a known, nonstochastic way. Hence, there are a total of 9 experimental conditions of interest. If we are concerned about order effects, and if we want each subject to be measured in each condition, we must run at least 9!=362,880 subjects through the experiment. 8 Suppose that we are concerned that subjects will become tired or bored if they stay in the FMRI too long, so we decide that each subject will experience only three of the conditions. Still, we are also concerned that the sequence in which they experience the conditions may affect their responses (in this case, brain activation.) This still requires at least 9*8*7=504 subjects if we want to block out sequence effects. An alternative way to control for these effects using a Latin square, and that requires fewer subjects, is as follows. One assigns n subjects to each of three groups randomly. Each group plays the same conditions. For example, group 1 plays a1b3 (trustcomputer2), a2b1 (dictator-human) and a3b2 (punishment-computer1). To try to remove possible order effects, the sequence in which each subject plays the three treatments is randomized. For instance, one could assign 6 players to each group, and then randomly assign each of the six possible sequences to one member of the group. In this form, the experiment requires only 18 subjects. Group/Game a1 b3 b1 b2 G1 G2 G3 a3 b2 b3 b1 a2 b1 b2 b3 The model corresponding to this is: E ( X ijkm ) k m ( k ) i j ij where is a group effect, is an individual effect, and the others are the effects from a, b and their interaction. The ANOVA is: Source of variation Between Subject Groups Degrees of freedom E(mean square) (np-1)/(p-1) 2 p 2 np 2 p(n-1) 2 p 2 np(p-1)/(p-1) 2 np 2 B p-1 2 np 2 AB (p-1)(p-2) 2 2 np Error p(n-1)(p-1) 2 Subjects within groups Within Subjects A 9 If the model is valid then, for example, testing the null hypothesis that there is no effect due to the opponent in the game can be accomplished by running the usual F-test, MS(B)/MS(Error)~F(p-1,p(n-1)(p-1)). As before, a conservative version of this test can be run by comparing the F-statistic to the F(1,p(n-1)) distribution. IV. Closely related issues. A. A test for skill homogeneity There may be cases in which it is useful to assess whether the group of subjects is homogeneous with respect to a certain observable characteristic. The question is how to form a measure of that characteristic for each person, and then test whether the measures seem different. Suppose that one is interested in some characteristic , say, a subject’s ability to perform a certain type of task (typing, for example). Each subject has true ability i but one can only measure this ability with error: X ij i ij where is pure measurement error. Assuming that each subject’s skill level is measured k times, the following table results. … … … N 1 X11 … X n1 Total T1 Person/measure 1 … … k X 1k … X nk Total P1 … Pn … Tk G Mean P1 … Pn A test for homogeneity of skill levels can be conducted by forming the within and between sum of squares and comparing the ratio of mean squares to the F(n-1,n(k-1)) distribution. B. A simple test for learning when the data are dichotomous. Example: Suppose that we are interested in whether subjects “learn” to play a Nash equilibrium. We have 10 subjects play 5 games against a robot that plays an optimal strategy. Suppose this leads to the following data set, where a “1” indicates that subject played a Nash strategy. 10 Subject 1 2 3 4 5 6 7 8 9 10 Total Game1 0 0 0 0 0 0 0 1 1 1 3 Game2 0 0 0 1 0 1 0 0 1 1 4 Game3 0 1 1 1 0 0 1 0 1 1 6 Game4 0 1 1 1 0 1 1 1 1 1 8 Game5 0 0 1 1 1 1 1 1 1 1 8 Total 0 2 3 4 1 3 3 3 5 5 29 Intuitively, the test statistic compares the variability across games to the variability within subjects. In general, if there are n subjects and k games, the appropriate test statistic is (Cochran (1950)): Q n(k 1) SS game SS w. people Under the null hypothesis that there is no learning, the Q statistic has a chi-square distribution with (k-1) degrees of freedom. In this example, Q=10.95 while the 5% critical value for the chi-square(4) distribution is 9.5. Hence, we find at the 5% significance level that learning did occur. 11