BLUE=USED TO CHECK FOR ERRORS YELLOW/BOXED=USED TO REPORT AND INTERPRET RESULTS ANALYSIS OF VARIANCE Example question: Among young daily smokers in NESARC, is nicotine dependence (TAB12MDX; IV) associated with smoking quantity (TAB12MDX; DV)? # syntax > data.aov <- aov(DV ~ IV, data=title_of_data_set) > summary(data.aov) > ND.aov <- aov(S3AQ3C1 ~ TAB12MDX, data=nesarc.subset) > summary(ND.aov) Df Sum Sq Mean Sq F value Pr(>F) TAB12MDX 1 2935 2935.37 38.406 7.458e-10 *** Residuals 1461 111664 76.43 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 3923 observations deleted due to missingness When examining the association between current number of cigarettes smoked (S3AQ3C1 quantitative) and past year nicotine dependence (TAB12MDX - categorical), an Analysis of Variance (ANOVA) revealed that among young, daily adult smokers (my sample), there is a significant association between nicotine dependence and number of cigarettes smoked per day (F(1, 1461) = 38.406, p=7.5 e-10). Additional commands can tell you more about the association. To get the means of each level (i.e. mean number of cigarettes smoked for each level of nicotine dependence), use the by() function you learned in Week 4 to calculate mean and s.d. of S3AQ3C1 for each level of TAB12MDX: > by(nesarc.subset$S3AQ3C1, nesarc.subset$TAB12MDX, mean, na.rm=T) nesarc.subset$TAB12MDX: 0 [1] 11.80211 ---------------------------------------------------------------------nesarc.subset$TAB12MDX: 1 [1] 14.64794 > by(nesarc.subset$S3AQ3C1, nesarc.subset$TAB12MDX, sd, na.rm=T) nesarc.subset$TAB12MDX: 0 [1] 8.173523 ---------------------------------------------------------------------nesarc.subset$TAB12MDX: 1 [1] 9.185908 Participants with no nicotine dependence smoke, on average, 11.8 cigarettes per day (s.d. 8.2) compared to the participants with nicotine dependence, who smoke more cigarettes per day (mean = 14.6; s.d. = 9.2). To determine exact differences between levels of the IV, especially if the IV has more than 2 levels, a post-hoc test must be performed: > TukeyHSD(ND.aov) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = S3AQ3C1 ~ TAB12MDX, data = nesarc.subset) $TAB12MDX diff lwr upr p adj 1-0 2.845825 1.945051 3.7466 0 Tukey’s post-hoc test reveals that the difference between level 1 and level 0 of TAB12MDX is positive and significant; therefore having nicotine dependence (level 1) is positively associated with smoking a higher number of cigarettes. CHI-SQUARE TEST OF INDEPENDENCE # syntax > data.chisq <- chisq.test(title_of_data_set$DV, title_of_data_set$IV) > DeprND.chisq <- chisq.test(nesarc.subset$MAJORDEPLIFE, nesarc.subset$TAB12MDX) > DeprND.chisq Pearson's Chi-squared test with Yates' continuity correction data: nesarc.subset$MAJORDEPLIFE and nesarc.subset$TAB12MDX X-squared = 64.6933, df = 1, p-value = 8.751e-16 When examining the association between lifetime major depression (MAJORDEPLIFE categorical) and past year nicotine dependence (TAB12MDX - categorical), a Chi Square analysis revealed that among young adult smokers (my sample), there is a significant relationship between nicotine dependence and major depression (X2 = 64.69, d.f. = 1, p = 8.8e-16). The differences between the expected and observed counts contribute to the Chi-square statistic. You can view the expected (by chance) and observed counts from the chisquare test: > DeprND.chisq$observed nesarc.subset$TAB12MDX nesarc.subset$MAJORDEPLIFE 0 1 0 553 513 1 114 289 > DeprND.chisq$expected nesarc.subset$TAB12MDX nesarc.subset$MAJORDEPLIFE 0 1 0 484.0177 581.9823 1 182.9823 220.0177 Although I can’t tell for certain from $observed and $expected which particular cells are significant, they are helpful if you are looking at pairwise comparisons between cells. For example, I am interested in the two cells for which MAJORDEPLIFE=1 (2nd row). Since the overall chi-square test is significant, and the cell (MAJORDEPLIFE=1, TAB12MDX=0) has a lower count than expected by chance while the cell (MAJORDEPLIFE=1, TAB12MDX=1) has a higher count than expected by chance, these cells may have a significant relationship. Need a post-hoc test to be sure. To determine the exact nature of this relationship, i.e. which particular cells are significantly different from the count that is expected by chance, a post-hoc test must be performed. The Pearson residuals from the chi-square test tell you which cells are significantly different from chance. A Pearson residual is like a z-score, thus if its magnitude is greater than 1.96, that cell is significant at the p=.05 significance level. > DeprND.chisq$residuals nesarc.subset$TAB12MDX nesarc.subset$MAJORDEPLIFE 0 1 0 3.135502 -2.859452 1 -5.099565 4.650599 I find that, in the significant Chi-square test between lifetime major depression and pastyear nicotine dependence, all cells are significantly different from the counts that are expected by chance. The cell (MAJORDEPLIFE=1, TAB12MDX) has a Pearson residual of 4.65. Since this is like a z-score, I look this up on a z-score table and find that its p-value is 1.7 x 10-6. Its positive value means that there are more people in this cell than is expected by chance. On the other hand, the cell (MAJORDEPLIFE=1, TAB12MDX=0) has a Pearson residual of -5.10, corresponding to a p-value of 1.8 x 10-7. Its negative sign means that there are fewer people in this cell than is expected by chance. PEARSON CORRELATION # syntax > cor.test(title_of_data_set$IV, title_of_data_set$DV) > cor.test(nesarc.subset$S3AQ3C1, nesarc.subset$NDScount) Pearson's product-moment correlation data: nesarc.subset$S3AQ3C1 and nesarc.subset$NDScount t = 5.472, df = 1461, p-value = 5.228e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.09112835 0.19157443 sample estimates: cor 0.1417162 Among young, daily adult smokers (my sample), the correlation between number of cigarettes smoked per day (S3AQ3C1 - continuous) and number of nicotine dependence symptoms experienced in the past year (NDScount - continuous) was 0.14 (p=5.22e-8), suggesting that only 2% (i.e. 0.142) of the variance in number of current nicotine dependence symptoms can be explained by number of cigarettes smoked per day (or vice versa), although the association between cigarettes smoked and nicotine dependence symptoms was statistically significant.