Chapter 14: Repeated Measures Analysis of Variance (ANOVA) First of all, you need to recognize the difference between a repeated measures (or dependent groups) design and the between groups (or independent groups) design. In an independent groups design, each participant is exposed to only one of the treatment levels and then provides one response on the dependent variable. However, in a repeated measures design, each participant is exposed to every treatment level and provides a response on the dependent variable after each treatment. Thus, if a participant has provided more than one score on the dependent variable, you know that you're dealing with a repeated measures design. Comparing the Independent Groups ANOVA and the Repeated Measures ANOVA The fact that the scores in each treatment condition come from the same participants has an important impact on the between-treatment variability found in the MSBetween (MSTreatment). In an independent groups design, the variability in the MSBetween arises from three sources: treatment effects, individual differences, and random variability. Imagine, for instance, a single-factor independent groups design with three levels of the factor. As seen below, the three group means vary. Mean a1 3 5 2 6 4 3 3.83 a2 7 6 9 7 8 7 7.33 a3 9 8 9 7 9 8 8.33 As you should recall, the variability among the group means determines the MSBetween. In this case, MSBetween = 33.5, which is the variance of the group means (5.583) times the sample size (6). Why do the group means differ? One source of variability — individual differences — emerges because the scores in each group come from different people. Thus, even with random assignment to conditions, the group means could differ from one another because of individual differences. And the more variability due to individual differences in the population, the greater the variability both within groups and between groups. Another source of variability — random effects — should play a fairly small role. Nonetheless, because there will be some random variability, it could influence the three group means. Finally, you should imagine that your treatment will have an impact on the means, which is the treatment effect that you set out to examine in your experiment. Given the sources of variability in the MSBetween, you need to construct a MSError that involves individual differences and random variability. Thus, your F ratio would be: F = Treatment Effect + Individual Differences +Random Variability Individual Differences +Random Variability Ch. 14 Repeated Measures ANOVA - 1 When treatment effects are absent, your F ratio would be roughly 1.0. As the treatment effects increased, your F ratio would grow larger than 1.0. In the case of these data, the F-ratio would be fairly large, as seen in the StatView source table below: ANOVA Table for Score A Residual DF Sum of Squares Mean Square F-Value P-Value Lambda Power 2 67.000 33.500 25.769 <.0001 51.538 1.000 15 19.500 1.300 Means Table for Score Effect: A Count Mean Std. Dev. Std. Err. a1 6 3.833 1.472 .601 a2 6 7.333 1.033 .422 a3 6 8.333 .816 .333 Imagine, now, that you have the same three conditions and the same 18 scores, but now presume that they come from only 6 participants in a repeated measures design. Even though the MSBetween would be identical, in a repeated measures design that variability is not influenced by individual differences. Thus, the MSBetween of 33.5 would come from treatment effects and random effects. In order to construct an appropriate F ratio, you now need to develop an error term that contains only random variability. The logic of the procedure we will use is to take the error term that would be constructed were these data from an independent groups design (and would include individual differences and random variability) and remove the portion due to individual differences, which leaves behind the random variability that we want in our error term. Conceptually, then, our F ratio would be comprised of the components seen below: F = Treatment Effect + Random Variability Random Variability Remember, however, that even though the components in the numerator of the F ratio differ in the independent groups and repeated measures ANOVAs, the computations are identical. That is, regardless of the nature of the design, the formula for SSBetween is: SSTreatment =  T 2 G2 n N Ch. 14 Repeated Measures ANOVA - 2 And the formula for dfBetween is: dfTreatment = k -1 Furthermore, you’ll still need to compute the SSError for the independent groups ANOVA (which is just the sum of the SS for each condition) and the dfError for the independent groups ANOVA (which is just n-1 for each condition times the number of conditions). However, because this “old” error term contains both individual differences and random variability, we need to estimate and remove the contribution of individual differences. We estimate the contribution of individual differences using the same logic as we use when computing the variability among treatments. That is, we treat each participant as the level of a factor (think of the factor as “Subject” or “Participant”). If you think of the computation this way, you’ll immediately notice that the formulas for SSBetween and SSSubject are identical, with the SSBetween working on columns while the SSSubject works on rows. The actual formula would be: SSSubject =  P 2 G2 k N If you’ll look at our data again, to complete your computation you would need to sum across each of the participants and then square those sums before adding them and dividing by the number of treatments. Mean a1 3 5 2 6 4 3 3.83 a2 7 6 9 7 8 7 7.33 a3 9 8 9 7 9 8 8.33 P 19 19 20 20 21 18 Your computation of SSSubject would be: SSSubject = 19 2 + 19 2 + 20 2 + 20 2 + 212 + 18 2 117 2 2287 = - 760.5 = 1.83 3 18 3 You would then enter the SSSubject into the source table and subtract it from the SSWithin (which is the error term from the independent groups design). As seen in the source table below, when you subtract that SSSubject, you are left with SSError = 17.67. The SS in the denominator of the repeated measures design will always be less than that found in an independent groups design for the same scores. Ch. 14 Repeated Measures ANOVA - 3 Source Between Within Groups Subject Error Total SS 67 19.5 1.83 17.67 86.5 df 2 15 5 10 17 MS 33.5 F 18.93 1.77 Of course, you need to apply the same procedure to the degrees of freedom. The dfWithinGroups for the independent groups design must be reduced by the dfSubject. The dfSubject is simply: df Subjects = n -1 Just as you should note the parallel between the SSBetween and the SSSubject, you should also note the parallel between the dfBetween and the dfSubject. Because you remove the dfSubject, the df in the error term for the repeated measures design will always be less than the df in the error term for an independent groups design for the same scores. Furthermore, it will always be true that the dfError in a repeated measures design is the product of the dfBetween and the dfSubject. For completeness, below is the source table that StatView would generate for these data using a repeated measures ANOVA: ANOVA Table for A DF Sum of Squares Mean Square Subject 5 1.833 .367 Category for A 2 67.000 33.500 10 17.667 1.767 Category for A * Subject F-Value P-Value Lambda Power 18.962 .0004 37.925 .999 Means Table for A Effect: Category for A Count Mean Std. Dev. Std. Err. a1 6 3.833 1.472 .601 a2 6 7.333 1.033 .422 a3 6 8.333 .816 .333 You should note the differences between the source tables that you would generate doing the analyses as shown in your Gravetter & Wallnau textbook and that generated by StatView. First of all, the SS and df columns are reversed. But more important, you need to note that the first row is the Subject effect, the second row is the Treatment Effect (called Category for A) and the third row is the Error Effect (Random), which appears as Category for A * Subject. Thus, the F ratio appears in the second row, but is the expected ratio of the MSBetween and the MSError. Ch. 14 Repeated Measures ANOVA - 4 You should also note a perplexing result. Generally speaking, the repeated measures design is more powerful than the independent groups design. Thus, you should expect that the F ratio would be larger for the repeated measures design than it is for the independent groups design. For these data, however, that’s not the case. Note that for the independent groups ANOVA, F = 25.8 and for the repeated measures ANOVA, F = 18.9. (For the repeated measures analysis, the difference between the StatView F and the calculator-computed F is due to rounding error.) What happened? Think, first of all, of the formula for the F ratio. The numerator is identical, whether the analysis is for an independent groups design or a repeated measures design. So for any difference in the F ratio to emerge, it has to come from the denominator. Generally speaking, as seen in the formula below, larger F ratios would come from larger dfError and smaller SSError. F= MSTreatment SSError df Error But, for identical data, the dfError will always be smaller for a repeated measures analysis! So, how does the increased power emerge? Again, for identical data, it’s also true that the SSError will always be smaller for a repeated measures analysis. As long as the SSSubject is substantial, the F ratio will be larger for the repeated measures analysis. For these data, however, the SSSubject is actually fairly small, resulting in a smaller F ratio. Thus, the power of the repeated measures design emerges from the presumption that people will vary. That is, you’re betting on substantial individual differences. As you look at the people around you, that presumption is not all that unreasonable. Use the source table below to determine the break-even point for this data set. What SSSubject would need to be present to give you the exact same F ratio as for the independent groups ANOVA? Source Between Within Groups Subject Error Total SS 67 19.5 86.5 df 2 15 5 10 17 MS 33.5 F 25.8 So, as long as you had more than that level of SSSubject you would achieve a larger F ratio using the repeated measures design. Testing the Null Hypothesis and Post Hoc Tests for Repeated Measures ANOVAs You would set up and test the null hypothesis for a repeated measures design just as you would for an independent groups design. That is, for this example, the null and alternative hypotheses would be identical for the two designs: Ch. 14 Repeated Measures ANOVA - 5 H0: m1 = m2 = m3 H1: Not H0 To test the null hypothesis for a repeated measures design, you would look up the FCritical with the dfBetween and the dfError found in your source table. That is, for this example, FCrit(2,10) = 4.10. If you reject H0, as you would in this case, you would then need to compute a post hoc test to determine exactly which of the conditions differed. Again, the computation of Tukey’s HSD would parallel the procedure you used for an independent groups analysis. In this case, for the independent groups design, your Tukey’s HSD would be: HSD = 3.67 1.3 = 1.71 6 For the repeated measures design, your Tukey’s HSD would be: HSD = 3.88 1.77 = 2.1 6 Ordinarily, of course, your HSD would be smaller for the repeated measures design, due to the typical reduction in the MSError. For this particular data set, given the lack of individual differences, that’s not the case. A Computational Example RESEARCH QUESTION: Does behavior modification (response-cost technique) reduce the outbursts of unruly children? EXPERIMENT: Randomly select 6 participants, who are tested before treatment, then one week, one month, and six months after treatment. The IV is the duration of the treatment. The DV is the number of unruly acts observed. STATISTICAL HYPOTHESES: DECISION RULE: H0: mBefore = m1Week = m1Month = m6Months H1: Not H0 If FObt ≥ FCrit, Reject H0. FCrit(3,15) = 3.29 Ch. 14 Repeated Measures ANOVA - 6 DATA: P1 P2 P3 P4 P5 P6 X T (SX) SX2 SS Before 8 4 6 8 7 6 6.5 39 265 11.5 1 Week 2 1 1 3 4 2 2.3 14 1 Month 1 1 0 4 3 1 1.5 10 6 Months 1 0 2 1 2 1 1 7 38 28 11 5.3 11.3 P 12 6 10 16 16 10 SUM 70 342 2.8 30.9 SOURCE TABLE: SOURCE SS Formula T G2 Ân N SS 2 Between Within grps SSS in each group Between subjs P 2 G2 Âk-N Error (SSWithin Groups – SSBetween subjects) Total ÂX2 - G2 N DECISION: POST HOC TEST: INTERPRETATION: EFFECT SIZE: Ch. 14 Repeated Measures ANOVA - 7 df MS F Suppose that you continued to assess the amount of unruly behavior in the children after the treatment was withdrawn. You assess the number of unruly acts after 12 months, 18 months, 24 months and 30 months. Suppose that you obtain the following data. What could you conclude? P1 P2 P3 P4 P5 P6 T (SX) SX 12 Months 1 2 1 3 2 1 10 2 20 SOURCE 18 Months 2 2 3 4 2 2 15 24 Months 2 3 3 4 3 4 19 30 Months 5 4 4 6 5 4 28 41 63 134 SS Formula T G2 Ân-N SS 2 Between Within grps SSS in each group P 2 G2 k N Between subjs  Error (SSWithin Groups – SSBetween subjects) Total ÂX 2 - G2 N DECISION: POST HOC TEST: INTERPRETATION: EFFECT SIZE: Ch. 14 Repeated Measures ANOVA - 8 df P 10 11 11 17 12 11 72 MS F An Example to Compare Independent Groups and Repeated Measures ANOVAs Independent Groups ANOVA T (SX) SX 2 A2 2 3 3 3 11 A3 3 4 4 5 16 A4 4 5 6 6 21 22 31 66 113 6 2 SS s A1 1 1 2 4 8 2 SOURCE .75 .25 2 2.75 .92 .67 SS 56 (G) df MS 11.5 F Between Error Total Repeated Measures ANOVA A1 SOURCE A2 A3 Exactly the same as above SS A4 df Between Within Groups Between Subjs Error Total Ch. 14 Repeated Measures ANOVA - 9 MS F Repeated Measures Analyses: The Error Term In a repeated measures analysis, the MSError is actually the interaction between participants and treatment. However, that won't make much sense to you until we've talked about two-factor ANOVA. For now, we'll simply look at the data that would produce different kinds of error terms in a repeated measures analysis, to give you a clearer understanding of the factors that influence the error term. These examples are derived from the example in your textbook (G&W, p. 464). Imagine a study in which rats are given each of three types of food rewards (2, 4, or 6 grams) when they complete a maze. The DV is the time to complete the maze. As you can see in the graph below, Participant1 is the fastest and Participant6 is the slowest. The differences in average performance represent individual differences. If the 6 lines were absolutely parallel, the MSError would be 0, so an F ratio could not be computed. So, I've tweaked the data to be sure that the lines were not perfectly parallel. Nonetheless, if performance was as illustrated below, the MSError would be quite small. The data are seen below in tabular form and then in graphical form. P1 P2 P3 P4 P5 P6 Mean 2 s 2 grams 1.0 2.0 3.0 4.0 5.0 6.0 3.5 3.5 4 grams 1.5 2.5 3.5 5.0 6.5 7.5 4.42 5.44 6 grams 2.0 3.5 5.0 6.0 7.0 9.0 5.42 6.24 P 4.5 8.0 11.5 15.0 18.5 22.5 Participant1 Participant2 Participant3 Participant4 Participant5 Participant6 Small MSError 10 Speed of Response 8 6 4 2 0 2 4 6 Amount of Reward (grams) The ANOVA on these data would be as seen below. Note that the F-ratio would be significant. Ch. 14 Repeated Measures ANOVA - 10 ANOVA Table for Reward Subject Category for Reward Category for Reward * Subject DF Sum of Squares Mean Square 5 74.444 14.889 2 11.028 5.514 10 1.472 .147 F-Value P-Value Lambda Power 37.453 <.0001 74.906 1.000 Means Table for Reward Effect: Category for Reward Count Mean Std. Dev. Std. Err. Reward 2g 6 3.500 1.871 .764 Reward 4g 6 4.417 2.333 .952 Reward 6g 6 5.417 2.498 1.020 Moderate MSError Next, keeping all the data the same (so SSTotal would be unchanged), and only rearranging data within a treatment (so that the s2 for each treatment would be unchanged), I've created greater interaction between participants and treatment. Note that the participant means would now be closer together, which means that the SSSubject is smaller. In the data table below, you'll note that the sums across participants (P) are more similar than in the earlier example. P1 P2 P3 P4 P5 P6 Mean 2 s 2 grams 1.0 2.0 3.0 4.0 5.0 6.0 3.5 3.5 4 grams 1.5 3.5 2.5 6.5 5.0 7.5 4.42 5.44 6 grams 3.5 5.0 2.0 6.0 9.0 7.0 5.42 6.24 P 6.0 10.5 7.5 16.5 19.0 20.5 Participant1 Participant2 Participant3 Participant4 Participant5 Participant6 Moderate MSError 10 Speed of Response 8 6 4 2 0 2 4 6 Amount of Reward Note that the F-ratio is still significant, though it is much reduced. Note, also, that the MSTreatment is the same as in the earlier example. Ch. 14 Repeated Measures ANOVA - 11 ANOVA Table for Reward DF Sum of Squares Mean Square 5 63.111 12.622 2 11.028 5.514 10 12.806 1.281 Subject Category for Reward Category for Reward * Subject F-Value P-Value Lambda Power 4.306 .0448 8.612 .606 Means Table for Reward Effect: Category for Reward Count Mean Std. Dev. Std. Err. Reward 2g 6 3.500 1.871 .764 Reward 4g 6 4.417 2.333 .952 Reward 6g 6 5.417 2.498 1.020 Large MSError Next, using the same procedure, I'll rearrange the scores even more, which will produce an even larger MSError. Note, again, that the SSSubject grows smaller (as the Participant means grow closer to one another) and the SSError grows larger. 4 grams 3.5 6.5 7.5 1.5 2.5 5.0 4.42 5.44 6 grams 6.0 9.0 3.5 5.0 7.0 2.0 5.42 6.24 Participant1 Participant2 Participant3 Participant4 Participant5 Participant6 Large MSError 10 8 Speed of Response P1 P2 P3 P4 P5 P6 Mean 2 s 2 grams 1.0 2.0 3.0 4.0 5.0 6.0 3.5 3.5 6 4 2 0 2 4 6 Amount of Reward Ch. 14 Repeated Measures ANOVA - 12 P 10.5 17.5 14.0 10.5 14.5 13.0 ANOVA Table for Reward DF Sum of Squares Mean Square 5 11.778 2.356 2 11.028 5.514 10 64.139 6.414 Subject Category for Reward Category for Reward * Subject F-Value P-Value Lambda Power .860 .4524 1.719 .155 Means Table for Reward Effect: Category for Reward Count Mean Std. Dev. Std. Err. Reward 2g 6 3.500 1.871 .764 Reward 4g 6 4.417 2.333 .952 Reward 6g 6 5.417 2.498 1.020 Varying Individual Differences It is possible to keep the MSError constant, while increasing the MSSubject, as the two examples below illustrate. As you see in the first example, the SSSubject is fairly small and the MSError is quite small. Small Individual Differences Participant1 Participant2 Participant3 Participant4 Participant5 Participant6 10 Speed of Response 8 6 4 2 0 2 4 6 Amount of Reward (grams) ANOVA Table for Reward Subject Category for Reward Category for Reward * Subject DF Sum of Squares Mean Square 5 54.125 10.825 2 15.250 7.625 10 .250 .025 F-Value P-Value Lambda Power 305.000 <.0001 610.000 1.000 Means Table for Reward Effect: Category for Reward Count Mean Std. Dev. Std. Err. Reward 2g 6 4.500 1.871 .764 Reward 4g 6 5.500 1.871 .764 Reward 6g 6 6.750 1.969 .804 Next, I've decreased the first two participants' scores by a constant amount and increased the last two participants' scores by a constant amount. Because the interaction between participant and treatment is the same, the MSError is unchanged. However, because the means for the 6 participants are more different than before, the SSSubject increases. Ch. 14 Repeated Measures ANOVA - 13 Moderate Individual Differences Participant1 Participant2 Participant3 Participant4 Participant5 Participant6 12 10 Speed of Response 8 6 4 2 0 2 4 6 Amount of Reward (grams) ANOVA Table for Reward Subject Category for Reward Category for Reward * Subject DF Sum of Squares Mean Square 5 114.125 22.825 2 15.250 7.625 10 .250 .025 F-Value P-Value Lambda Power 305.000 <.0001 610.000 1.000 Means Table for Reward Effect: Category for Reward Count Mean Std. Dev. Std. Err. Reward 2g 6 4.500 2.739 1.118 Reward 4g 6 5.500 2.739 1.118 Reward 6g 6 6.750 2.806 1.146 Ch. 14 Repeated Measures ANOVA - 14 StatView for Repeated Measures ANOVA: G&W 465 First, enter as many columns (variables) as you have levels of your independent variable. Below left are the data, with each column containing scores for a particular level of the IV. The next step is to highlight all 3 columns and then click on the Compact button. You’ll then get the window seen below on the right, which allows you to name the IV (as a compact variable). Note that your data window will now reflect the compacting process, with Stimulus appearing above the 3 columns. To produce the analysis, choose Repeated Measures ANOVA from the Analyze menu. Move your compacted variable to the Repeated measure box on the left, as seen below left. Then, click on OK to produce the actual analysis, seen below right. ANOVA Table for Stimulus DF Sum of Squares Mean Square Subject 4 6.000 1.500 Category for Stimulus 2 30.000 15.000 Category for Stimulus * Subject 8 28.000 3.500 F-Value P-Value Lambda Power 4.286 .0543 8.571 .568 Means Table for Stimulus Effect: Category for Stimulus Count Mean Std. Dev. Std. Err. Neutral 5 3.000 .707 .316 Pleasant 5 6.000 2.121 .949 Aversive 5 3.000 1.871 .837 Note that these results are not quite significant, so you would not ordinarily compute a post hoc test. Nonetheless, just to show you how to compute a post hoc test, choose Tukey/Kramer from the Post-hoc tests found on the left, under ANOVA. That will produce the table seen below. It’s no surprise that none of the comparisons are significant, given that the overall ANOVA did not produce any significant results. Tukey/Kramer for Stimulus Effect: Category for Stimulus Significance Level: 5 % Mean Diff. Crit. Diff. Neutral, Pleasant -3.000 3.380 Neutral, Aversive 0.000 3.380 Pleasant, Aversive 3.000 3.380 Ch. 14 Repeated Measures ANOVA - 15 Practice Problems Drs. Dewey, Stink, & Howe were interested in memory for various odors. They conducted a study in which 6 participants were exposed to 10 common food odors (orange, onion, etc.) and 10 common non-food odors (motor oil, skunk, etc.) to see if people are better at identifying one type of odorant or the other. The 20 odors were presented in a random fashion, so that both classes of odors occurred equally often at the beginning of the list, at the end of the list, etc. (Thus, this randomization is a strategy that serves the same function as counterbalancing.) The dependent variable is the number of odors of each class correctly identified by each participant. The data are seen below. Analyze the data and fully interpret the results of this study. SX (T) SX SS 2 Food Odors 7 8 6 9 7 5 42 304 Non-Food Odors 4 6 4 7 5 3 29 151 10 Ch. 14 Repeated Measures ANOVA - 16 10.8 Suppose that Dr. Belfry was interested in conducting a study about the auditory capabilities of bats, looking at bats’ abilities to avoid wires of varying thickness as they traverse a maze. The DV is the number of times that the bat touches the wires. (Thus, higher numbers indicate an inability to detect the wire.) Complete the source table below and fully interpret the results. Ch. 14 Repeated Measures ANOVA - 17 Dr. Richard Noggin is interested in the effect of different types of persuasive messages on a person’s willingness to engage in socially conscious behaviors. To that end, he asks his participants to listen to each of four different types of messages (Fear Invoking, Appeal to Conscience, Guilt, and Information Laden). After listening to each message, the participant rates how effective the message was on a scale of 1-7 (1 = very ineffective and 7 = very effective). Complete the source table and analyze the data as completely as you can. Ch. 14 Repeated Measures ANOVA - 18 Dr. Beau Peep believes that pupil size increases during emotional arousal. He was interested in testing if the increase in pupil size was a function of the type of arousal (pleasant vs. aversive). A random sample of 5 participants is selected for the study. Each participant views all three stimuli: neutral, pleasant, and aversive photographs. The neutral photograph portrays a plain brick building. The pleasant photograph consists of a young man and woman sharing a large ice cream cone. Finally, the aversive stimulus is a graphic photograph of an automobile accident. Upon viewing each photograph, the pupil size is measured in millimeters. An incomplete source table resulting from analysis of these data is seen below. Complete the source table and analyze the data as completely as possible. Means Table for Stimulus Effect: Category for Stimulus Count Mean Std. Dev. Std. Err. Neutral 5 2.600 .548 .245 Pleasant 5 6.400 1.517 .678 Aversive 5 4.400 1.140 .510 Ch. 14 Repeated Measures ANOVA - 19 Suppose you are interested in studying the impact of duration of exposure to faces on the ability of people to recognize faces. To finesse the issue of the actual durations used, I'll call them Short, Medium, and Long durations. Participants are first exposed to a set of 30 faces for one duration and then tested on their memory for those faces. Then they are exposed to another set of 30 faces for a different duration and then tested. Finally, they are given a final set of 30 faces for the final duration and then tested. The DV for this analysis is the percent Hits (saying Old to an Old item). Suppose that the results of the experiment come out as seen below. Complete the analysis and interpret the results as completely as you can. If the results turned out as seen below, what would they mean to you? [15 pts] Means Table for Duration Effect: Category for Duration Count Mean Std. Dev. Std. Err. Short 24 43.833 7.257 1.481 Medium 24 47.792 7.342 1.499 Long 24 49.917 6.978 1.424 Ch. 14 Repeated Measures ANOVA - 20