1 ANOVA PSY 211 4-14-09 TERM PAPERS DUE THURSDAY (4/16) INCLUDE SPSS OUTPUT INCLUDE 1 COPY OF ALL SOURCES A. Types of Analyses (some review) Continuous variables: Values are numbers that have meaning, such as rating scales or other numeric measurements (IQ, 9-pt attractiveness scale, height) Categorical variables: Groups, any numbers assigned are meaningless (gender, ethnicity, experimental condition, treatment group) Independent Variable(s) Dependent Variable Correlation (r) One continuous One continuous Multiple Regression (R) Multiple continuous One continuous Between-group t-test (t) Categorical (2 categories) One continuous ANOVA (F) Categorical (2+ categories) One continuous Categorical Categorical Type Chi square (χ2) 2 B. Review of t-tests Two commonly used t-tests Between-group t-test: two groups of people are compared on some continuous outcome variable Repeated-measures t-test: same group of people compared on some outcome variable across two time points C. Introduction to ANOVA ANOVA = analysis of variance Two types o ANOVA: compares multi-category variable to a continuous variable; like the betweengroup t-test but can involve 2+ groups o Repeated-measures ANOVA: examines how scores change across multiple time points; like repeated-measures t-test but can involve 2+ time points Thus, ANOVA are like a more powerful t-tests ANOVA t-test Note: If you have two groups or two time points, technically you could do some type of t-test or some type of ANOVA, but t-tests are simpler, so we do them when we can. 3 D. How does it work? “Variance” indicates amount of variability or differences Experimental manipulations (e.g. treatment group) or demographics (e.g. ethnicity) tend to increase variability – they cause people to get different scores on some outcome variable (e.g. anxiety) ANOVA examines whether the amount of variability we obtain is greater than what we’d expect by chance. If so, there are probably some interesting group differences Assume we are examining quality of K-12 education across several ethnic groups. Just by chance, we would expect some small differences to show up in our sample. ANOVA is used to examine whether there is more variability than what we’d expect by chance. E. ANOVA Lingo Factor: the categorical independent variable - Favorite color Level: specific value/condition within the factor - Blue A researcher wants to compare three treatments (placebo pill, medication, and surgery) to see which is most effective in improving heart health. - What is the IV? DV? - What is the factor? Levels? 4 F. Hypothesis Testing Hypothesis testing with ANOVA can be a bit peculiar. Let’s examine hypothesis testing, where the IV has three categories H0: μ1 = μ2 = μ3 or At the population level, groups 1, 2, and 3 all have the same mean on some variable. 10.00 Mean Health 8.00 6.00 4.00 2.00 0.00 Placebo Medication Surgery HA: At the population level, at least one mean differs from the others. HA will be true under a wide variety of conditions… μ1 ≠ μ2 ≠ μ3 (all means are different) μ1 = μ3, but μ2 is different (at least one mean differs) 5 8.00 Mean Health 6.00 4.00 2.00 10.00 0.00 Placebo Medication Surgery Placebo Medication Surgery Mean Health 8.00 6.00 4.00 2.00 0.00 8.00 Mean Health 6.00 4.00 2.00 0.00 Placebo Medication Surgery Alternative hypothesis is true if there are any mean differences at the population level 6 Because of sampling error, we always expect some mild group differences in scores at the sample level Based on the sample size, are the amount of differences observed in the sample so reliable that they would be expected to occur at the population level? To determine this, ANOVA relies on the F statistic G. Significance Testing with the F Statistic F ranges from 0 to ∞ Measures whether obtained variability is greater than chance When F is small, go with null hypothesis When F is large, go with alternative hypothesis Like t-tests, the cut score is based on degrees of freedom, so the critical F value needed for significance varies from study to study As F gets bigger and bigger, the observed difference is greater and greater; less likely due to chance 7 For PSY 211, we will not hand-calculate F. We will learn to understand it conceptually and let SPSS take care of the grunt work Some conceptual definitions of F: F = variance (differences) between groups variance (differences) within groups F = total amount of variability in scores variability due to chance Huh? Examples: F = differences in IQ across SES groups differences in IQ within SES groups F = differences in liberalism across music genres differences in liberalism within music genres Basically, we are checking to see whether the differences across groups are larger than the chance or meaningless differences we can observe within groups 8 Example with Data: Where the null hypothesis is supported… Health scores across groups 7 different people per group Placebo Medication 1 1 1 2 2 3 3 3 5 5 6 6 8 8 Surgery 2 2 2 3 5 6 8 Health scores across groups 7 different people per group Placebo Medication 1 1 1 2 2 3 3 3 5 5 6 6 8 8 Surgery 2 2 2 3 5 6 8 Health scores across groups 7 different people per group Placebo Medication 1 1 1 2 2 3 3 3 5 5 6 6 8 8 Surgery 2 2 2 3 5 6 8 Differences within each group are chance differences (bottom of fraction for F equation) Differences across groups (top of fraction for F equation) 9 F = total variability / chance variability Here, total variability and chance variability are about the same, so F will be near one Why? Any number divided by itself equals one o 100/100 = 1 o 2.76/2.76 = 1 o 0.38/0.38 = 1 When the total amount of variability (across groups) is about the same as the chance variability (within groups), F will be small What if there was a treatment effect? 10 Example with Data: Where the alternative hypothesis is supported… Health scores across groups 7 different people per group Placebo Medication 1 4 1 4 1 4 1 4 2 5 2 5 2 5 Surgery 8 8 8 9 9 9 9 Health scores across groups 7 different people per group Placebo Medication 1 4 1 4 1 4 1 4 2 5 2 5 2 5 Surgery 8 8 8 9 9 9 9 Health scores across groups 7 different people per group Placebo Medication 1 4 1 4 1 4 1 4 2 5 2 5 2 5 Surgery 8 8 8 9 9 9 9 Differences within each group are chance differences (bottom of fraction for F equation) Differences across groups (top of fraction for F equation) 11 F = total variability / chance variability Here, total variability is much greater than chance variability, so F will be large Why? Any big number divided by a much smaller number yields a big number o 100/10 = 10 o 2.76/0.21 = 13.14 o 0.38/0.11 = 3.45 As the total amount of variability (across groups) gets bigger than chance variability (within groups), F will get bigger Thus, F increases as group differences account for more and more of the variability F is small (near 1.0) when most of the variability is due to chance Distribution of F scores has a different shape than the z or t distributions. Also, the critical F value for determining significance depends on sample size and the number of groups. 12 H. Post-hoc tests The F test can be used to tell us whether the null hypothesis should be rejected Weakness: F test tells us that at least one of the groups has a significantly different mean, but doesn’t tell us which group(s) significantly differ This is about as useful as saying “one of the treatment conditions had some effect” Obviously, we want to know which condition or conditions differed, and whether scores were significantly higher or lower Post hoc test: Post hoc is Latin for “after this.” The F test tells us whether any groups differ, and the post hoc test tells test us how the groups differ (e.g. whether surgery is better than medication, or vice versa) There are many post hoc tests. They all differ slightly in how they calculate error terms; some are liberal, some conservative in judging which groups significantly differ 13 5 Mean 76. Pro-Bush 4 3 2 1 0 9/11 Oil Spread Democracy 28. Reason for the War Mean 40. Cigarettes Smoked per Day 3 2.5 2 1.5 1 0.5 0 Shame Anger Sadness 22. Negative Emotion (Childhood) Anxiety 14 Mean 103. Extraversion 7 6 5 4 3 Friends Family Love Success 19. Top Life Priority We’ll use the LSD (Least Significant Difference) post hoc test The Bonferroni and Scheffe tests are also commonly used The LSD test is very liberal; it will pick up on any major group differences, but has the highest risk of Type I errors Other tests are more conservative about labeling group differences as significant, and nerds fight about which test is best to use Different kind of LSD 15 I. Examples Using SPSS Is top life priority (friends, family, love, or success) related to extraversion (9-point rating)? De scri ptives 103. Extravers ion Friends Family Love Success Total N 23 188 56 12 279 Mean 4.83 5.38 5.50 6.00 5.39 St d. Deviation 1.825 2.110 2.098 1.595 2.067 St d. Error .381 .154 .280 .461 .124 95% Confidenc e Int erval for Mean Lower Upper Bound Bound 4.04 5.62 5.08 5.69 4.94 6.06 4.99 7.01 5.14 5.63 Minimum 1 1 2 4 1 ANOVA 103. Extraversion Between Groups W ithin Groups Total Sum of Squares 12.464 1175.730 1188.194 df 3 275 278 Mean Square 4.155 4.275 F .972 Sig. .406 1) Check the descriptives 2) Look at the p-value “Sig” 3) If p < .05, the result is significant, so look at the post-hoc test (later in Output). If p < .05, the result is not significant, so ignore any additional Output. - ANOVA uses two different types of degrees of freedom (reference numbers) to determine the shape of the F-distribution, and thus the critical F value. One is based on sample size; the other on the number of groups. - F is used to determine the p-value Life priority did not significantly predict level of extraversion, F(3,275) = 0.97, p = .41. Maximum 7 9 9 9 9 16 The relationship between eyewear and GPA using ANOVA. De scri ptives 41. GPA (High School) Glasses Contac t Lenses Neither Total N 93 85 101 279 Mean 3.141 3.493 3.448 3.359 St d. Deviation .6604 .4154 .5013 .5578 St d. Error .0685 .0451 .0499 .0334 95% Confidenc e Int erval for Mean Lower Upper Bound Bound 3.005 3.277 3.403 3.583 3.349 3.546 3.293 3.425 Minimum 1.3 2.3 1.7 1.3 Maximum 4.0 4.0 4.0 4.0 ANOVA 41. GPA (High School) Between Groups W ithin Groups Total Sum of Squares 6.742 79.752 86.494 df 2 276 278 Mean Square 3.371 .289 F 11.666 Sig. .000 This time there is a significant effect, so we need to examine the post-hoc test to see which groups reliably differ. Multiple Comparisons Dependent Variable: 41. GPA (High School) LSD (I) 29. Lens es Glasses Contact Lenses Neither (J) 29. Lens es Contact Lenses Neither Glasses Neither Glasses Contact Lenses Mean Difference (I-J) -.3521* -.3067* .3521* .0454 .3067* -.0454 Std. Error .0807 .0773 .0807 .0791 .0773 .0791 *. The mean difference is s ignificant at the .05 level. Sig. .000 .000 .000 .566 .000 .566 95% Confidence Interval Lower Bound Upper Bound -.511 -.193 -.459 -.155 .193 .511 -.110 .201 .155 .459 -.201 .110 17 Multiple Comparisons Dependent Variable: 41. GPA (High School) LSD (I) 29. Lens es Glasses Contact Lenses Neither (J) 29. Lens es Contact Lenses Neither Glasses Neither Glasses Contact Lenses Mean Difference (I-J) -.3521* -.3067* .3521* .0454 .3067* -.0454 Std. Error .0807 .0773 .0807 .0791 .0773 .0791 Sig. .000 .000 .000 .566 .000 .566 95% Confidence Interval Lower Bound Upper Bound -.511 -.193 -.459 -.155 .193 .511 -.110 .201 .155 .459 -.201 .110 *. The mean difference is s ignificant at the .05 level. The post-hoc test is basically like running t-tests for all possible comparisons. Blue box: Compared glasses to contacts. The mean difference is .35, which is significant. Red box: Compared glasses to neither (no corrective lenses). The mean difference is .31, which is significant. Green box: Compared contact lenses to neither. The mean difference is .045, which is not significant. APA-style: Eyewear was significantly related to GPA, F(2,276) = 11.67, p < .001. The results of a post-hoc LSD test indicated that people who wore glasses had reliably lower grades than people who wore contacts or no lenses. The contact and no lens groups did not significantly differ. Thus, wearing glasses was related to lower GPA. 18 [Examples from old data; skip in class] Is Favorite Fast Food Place related to ability to Delay Gratification? Arby’s, BK, McD’s, T-Bell, Wendy’s, Other Delay of Gratification (patience): 1=low, 7=high Correlation? T-test? ANOVA? Output is fairly easy to understand! De scri ptives 87. Delaying Gratification Arby's Burger King Mc Donald's Taco Bell W endy 's Ot her Total N 54 26 37 120 55 34 326 Mean 4.556 4.346 4.216 4.417 4.364 4.500 4.411 St d. Deviation 1.8801 1.6719 1.8125 1.6783 1.8193 1.7965 1.7532 St d. Error .2558 .3279 .2980 .1532 .2453 .3081 .0971 95% Confidenc e Int erval for Mean Lower Upper Bound Bound 4.042 5.069 3.671 5.021 3.612 4.821 4.113 4.720 3.872 4.855 3.873 5.127 4.220 4.602 Minimum 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ANOVA 87. Delaying Gratification Between Groups W ithin Groups Total Sum of Squares 3.038 995.882 998.920 df 5 320 325 Mean Square .608 3.112 F .195 Sig. .964 Notice that all of the means are about the same. No major differences, so F is small, and there are no statistically significant differences Maximum 7.0 7.0 7.0 7.0 7.0 7.0 7.0 19 There are no significant differences, so no post hoc test is needed. If the F-value was significant, what would the post hoc tests tell us? Ability to delay gratification was unrelated to favorite fast food place, F(5, 320) = 0.20, ns. Thus, customers are probably equally patient across fast food locations. Is sleeping position related crying frequency? Sleeping positions: Back, Side, Stomach, Fetal Crying Frequency: Compared to others I cry a lot… 1 = Disagree – 7 = Agree De scri ptives 85. Cry ing Back Side St omach Fetal Position Total N 22 127 82 95 326 Mean 1.864 2.480 2.695 3.168 2.693 St d. Deviation 1.5825 1.6849 1.7263 1.8022 1.7535 St d. Error .3374 .1495 .1906 .1849 .0971 95% Confidenc e Int erval for Mean Lower Upper Bound Bound 1.162 2.565 2.184 2.776 2.316 3.074 2.801 3.536 2.502 2.884 Step 1: Look at the means for each group. Who cries the most? Who cries the least? ANOVA 85. Cry ing Between Groups W ithin Groups Total Sum of Squares 42.350 956.975 999.325 df 3 322 325 Mean Square 14.117 2.972 F 4.750 Sig. .003 Minimum 1.0 1.0 1.0 1.0 1.0 Maximum 7.0 7.0 7.0 7.0 7.0 20 Step 2: Check for significance. What is the F value? Is it statistically significant (reliable)? Post Hoc Tests Multiple Comparisons Dependent Variable: 85. Crying LSD (I) 30. Sleeping Pos ition Back Side Stomach Fetal Position (J) 30. Sleeping Pos ition Side Stomach Fetal Position Back Stomach Fetal Position Back Side Fetal Position Back Side Stomach Mean Difference (I-J) -.6167 -.8315* -1.3048* .6167 -.2148 -.6881* .8315* .2148 -.4733 1.3048* .6881* .4733 Std. Error .3981 .4139 .4079 .3981 .2442 .2338 .4139 .2442 .2599 .4079 .2338 .2599 *. The mean difference is s ignificant at the .05 level. Step 3: Exam the post hoc results. Which sleep positions significantly differ from each other in terms of crying? Which sleep positions do not differ significantly from each other? (More details on next page) Sig. .122 .045 .002 .122 .380 .003 .045 .380 .069 .002 .003 .069 95% Confidence Interval Lower Upper Bound Bound -1.400 .167 -1.646 -.017 -2.107 -.502 -.167 1.400 -.695 .266 -1.148 -.228 .017 1.646 -.266 .695 -.985 .038 .502 2.107 .228 1.148 -.038 .985 21 Multiple Comparisons Dependent Variable: 85. Crying LSD (I) 30. Sleeping Pos ition Back Side Stomach Fetal Position (J) 30. Sleeping Pos ition Side Stomach Fetal Position Back Stomach Fetal Position Back Side Fetal Position Back Side Stomach Mean Difference (I-J) -.6167 -.8315* -1.3048* .6167 -.2148 -.6881* .8315* .2148 -.4733 1.3048* .6881* .4733 Std. Error .3981 .4139 .4079 .3981 .2442 .2338 .4139 .2442 .2599 .4079 .2338 .2599 Sig. .122 .045 .002 .122 .380 .003 .045 .380 .069 .002 .003 .069 *. The mean difference is s ignificant at the .05 level. Comparing Back to Side (pink box), the Mean Difference is small and non-significant. Comparing Back to Fetal Position (blue box), the mean difference in crying is much larger. The negative sign indicates that Back sleepers cry less than Fetal Position sleeper. The Sig. value (p-value) shows that this is statistically significant (reliable). Note that the post hoc table allows us to compare any two groups with each other (kind of like a big table of t-tests all at once), and some of the information is redundant (see gray box). Step 4: Write up the results in APA format. Sleeping positions is related to crying frequency, F(3, 322) = 4.75, p < .05. People who sleep in the fetal position cry the most, and people who sleep in their back cry the least. 95% Confidence Interval Lower Upper Bound Bound -1.400 .167 -1.646 -.017 -2.107 -.502 -.167 1.400 -.695 .266 -1.148 -.228 .017 1.646 -.266 .695 -.985 .038 .502 2.107 .228 1.148 -.038 .985 22 J. Repeated-Measures ANOVA ANOVA is similar to a between-group t-test Repeated-measures ANOVA is similar to a repeated-measures t-test, but with more time points 10 9 8 7 6 Pain 5 4 3 2 1 0 Intake Discharge Follow-up Ignore the dashed lines. The red line shows a repeatedmeasures ANOVA for life satisfaction across 10 different time points among people diagnosed with a disability. 23 Comparison Group High Schizotypy Two different repeated-measures ANOVAs: The comparison group (blue line) has a nonsignificant ANOVA, whereas the Schizotypy group (red line) has a significant ANOVA. Example of hypothesis testing, where a researcher is examining how the dependent variable (e.g. pain score) differs across three time points: H0: μ1 = μ2 = μ3 or At the population level, average scores will not differ across any time point. 10 9 8 7 6 5 4 3 2 1 0 Intake Discharge Follow-up The subscripts 1, 2, and 3 signify different time points. 24 HA: At the population level, the mean for at least one time point will significantly differ from the others Note: HA will be true under a wide variety of conditions… μ1 ≠ μ2 ≠ μ3 (all means are different) μ1 = μ3, but μ2 is different (at least one mean differs) 10 9 8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 Intake Discharge Follow-up 10 9 8 7 6 5 4 3 2 1 0 Intake Discharge Follow-up 10 9 8 7 6 5 4 3 2 1 0 Intake Discharge Follow-up Intake Discharge Follow-up Alternative hypothesis is true if any of the means significantly (reliably) differ from the others Just like ANOVA uses the F statistic to test whether there are between-group differences, the repeated-measures ANOVA uses the F statistic to examine whether means differ across time points F = total amount of variability in scores variability due to chance 25 Repeated-Measures ANOVA Similarities to Rationale for the F-test is the ANOVA same: compare total variability to the amount expected by chance Post-hoc tests are the same Differences from ANOVA Controls for some individual differences, so tends to be more powerful Smaller sample needed to obtain significant results K. SPSS Examples for Repeated-Measures ANOVA Repeated-measures ANOVA is a bit complicated to run in SPSS, so it is not a requirement of PSY 211 See the following web site if you would ever like to run one: http://academic.reed.edu/psychology/RDDAweb site/spssguide/anova.html Reading and understanding the Output is less difficult, so I will give you some Output to interpret 26 27 Descriptive Statistics Mean 7.8500 5.3000 5.4500 Intake Discharge Follow-up Std. Deviation 1.56525 2.36421 2.06410 N 20 20 20 Tests of Within-Subjects Effects Measure: MEASURE_1 Source factor1 Error(factor1) Sphericity Assumed Greenhous e-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhous e-Geisser Huynh-Feldt Lower-bound Type III Sum of Squares 81.900 81.900 81.900 81.900 56.767 56.767 56.767 56.767 df 2 1.348 1.414 1.000 38 25.607 26.860 19.000 Pa irw ise Com parisons Mean Square 40.950 60.768 57.933 81.900 1.494 2.217 2.113 2.988 F 27.412 27.412 27.412 27.412 Post hoc test Measure: MEA SURE_1 (I) fact or1 1 2 3 (J) fact or1 2 3 1 3 1 2 Mean Difference (I-J) St d. 2.550* 2.400* -2. 550* -.150 -2. 400* .150 E rror .489 .387 .489 .244 .387 .244 a Sig. .000 .000 .000 .545 .000 .545 95% Confidenc e Interval for a Difference Lower Bound Upper Bound 1.526 3.574 1.591 3.209 -3. 574 -1. 526 -.660 .360 -3. 209 -1. 591 -.360 .660 Based on estimated marginal means *. The mean differenc e is significant at the .05 level. a. Adjust ment for multiple comparisons: Least Signific ant Differenc e (equivalent to no adjustments). Sig. .000 .000 .000 .000 28 Red Box: Overview of descriptive statistics for each time point Green Box: These are the degree of freedom (df) values. Remember, the critical t value depended on the df value. The critical F value depends on two df values, one which is related to group size, and one which is related to sample size. These values help determine the critical F value, used to determine whether a finding will be significant, and they are used in the APA-style write-up too. Blue Box: F value and p value. If p is less than .05 (p<.05), the finding is significant, indicating that there is at least one reliable mean difference across time points. Orange Box: These values indicate the mean differences across two time points. For example, the first indicates a mean differences of 2.55 points between time 1 (Intake) and time 2 (Discharge). The next number indicates that there is a 2.40 point difference between time 1 (Intake) and time 3 (Follow-up). Stars (*) indicate a statistically significant (reliable) difference. Pink Box: These are the precise p values for the post hoc tests. If p<.05, the difference is statistically significant. All mean differences in the Orange Box that had a star also have a p<.05. APA Style Write-up: Pain scores significantly differed across time, F(2, 38) = 27.41, p < .05. A post hoc LSD test indicated that pain scores at discharge and follow-up were significantly lower than at intake. 29 A non-significant result…. 30 Descriptive Statistics Mean 7.8000 7.3000 7.2000 Intake Discharge Follow-up Std. Deviation 1.70448 1.65752 1.82382 N 20 20 20 Tests of Within-Subjects Effects Measure: MEASURE_1 Source factor1 Error(factor1) Sphericity Assumed Greenhous e-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhous e-Geisser Huynh-Feldt Lower-bound Type III Sum of Squares 4.133 4.133 4.133 4.133 81.867 81.867 81.867 81.867 df 2 1.613 1.741 1.000 38 30.653 33.075 19.000 Mean Square 2.067 2.562 2.374 4.133 2.154 2.671 2.475 4.309 F .959 .959 .959 .959 Pa irw ise Com parisons Measure: MEA SURE_1 (I) fact or1 1 2 3 (J) fact or1 2 3 1 3 1 2 Mean Difference (I-J) .500 .600 -.500 .100 -.600 -.100 St d. E rror .516 .520 .516 .332 .520 .332 a Sig. .344 .263 .344 .766 .263 .766 95% Confidenc e Interval for a Difference Lower Bound Upper Bound -.579 1.579 -.489 1.689 -1. 579 .579 -.594 .794 -1. 689 .489 -.794 .594 Based on estimated marginal means a. Adjust ment for multiple comparisons: Least Signific ant Differenc e (equivalent to no adjustments). Pain scores did not significantly differ across time, F(2, 38) = 0.96, ns. Thus, the treatment was unsuccessful. Sig. .392 .377 .383 .340