R Bivariate Models Handout

advertisement
BLUE=USED TO CHECK FOR ERRORS
YELLOW/BOXED=USED TO REPORT AND INTERPRET RESULTS
ANALYSIS OF VARIANCE
Example question: Among young daily smokers in NESARC, is nicotine dependence (TAB12MDX; IV)
associated with smoking quantity (TAB12MDX; DV)?
# syntax
> data.aov <- aov(DV ~ IV, data=title_of_data_set)
> summary(data.aov)
> ND.aov <- aov(S3AQ3C1 ~ TAB12MDX, data=nesarc.subset)
> summary(ND.aov)
Df Sum Sq Mean Sq F value
Pr(>F)
TAB12MDX
1
2935 2935.37 38.406 7.458e-10 ***
Residuals
1461 111664
76.43
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3923 observations deleted due to missingness
When examining the association between current number of cigarettes smoked (S3AQ3C1 quantitative) and past year nicotine dependence (TAB12MDX - categorical), an Analysis of
Variance (ANOVA) revealed that among young, daily adult smokers (my sample), there is a
significant association between nicotine dependence and number of cigarettes smoked per day
(F(1, 1461) = 38.406, p=7.5 e-10). Additional commands can tell you more about the association.

To get the means of each level (i.e. mean number of cigarettes smoked for each level of
nicotine dependence), use the by() function you learned in Week 4 to calculate mean and
s.d. of S3AQ3C1 for each level of TAB12MDX:
> by(nesarc.subset$S3AQ3C1, nesarc.subset$TAB12MDX, mean, na.rm=T)
nesarc.subset$TAB12MDX: 0
[1] 11.80211
---------------------------------------------------------------------nesarc.subset$TAB12MDX: 1
[1] 14.64794
> by(nesarc.subset$S3AQ3C1, nesarc.subset$TAB12MDX, sd, na.rm=T)
nesarc.subset$TAB12MDX: 0
[1] 8.173523
---------------------------------------------------------------------nesarc.subset$TAB12MDX: 1
[1] 9.185908
Participants with no nicotine dependence smoke, on average, 11.8 cigarettes per day
(s.d. 8.2) compared to the participants with nicotine dependence, who smoke more
cigarettes per day (mean = 14.6; s.d. = 9.2).

To determine exact differences between levels of the IV, especially if the IV has more
than 2 levels, a post-hoc test must be performed:
> TukeyHSD(ND.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = S3AQ3C1 ~ TAB12MDX, data = nesarc.subset)
$TAB12MDX
diff
lwr
upr p adj
1-0 2.845825 1.945051 3.7466
0
Tukey’s post-hoc test reveals that the difference between level 1 and level 0 of
TAB12MDX is positive and significant; therefore having nicotine dependence (level
1) is positively associated with smoking a higher number of cigarettes.
CHI-SQUARE TEST OF INDEPENDENCE
# syntax
> data.chisq <- chisq.test(title_of_data_set$DV, title_of_data_set$IV)
> DeprND.chisq <- chisq.test(nesarc.subset$MAJORDEPLIFE, nesarc.subset$TAB12MDX)
> DeprND.chisq
Pearson's Chi-squared test with Yates' continuity correction
data: nesarc.subset$MAJORDEPLIFE and nesarc.subset$TAB12MDX
X-squared = 64.6933, df = 1, p-value = 8.751e-16
When examining the association between lifetime major depression (MAJORDEPLIFE categorical) and past year nicotine dependence (TAB12MDX - categorical), a Chi Square
analysis revealed that among young adult smokers (my sample), there is a significant relationship
between nicotine dependence and major depression (X2 = 64.69, d.f. = 1, p = 8.8e-16).
 The differences between the expected and observed counts contribute to the Chi-square
statistic. You can view the expected (by chance) and observed counts from the chisquare test:
> DeprND.chisq$observed
nesarc.subset$TAB12MDX
nesarc.subset$MAJORDEPLIFE
0
1
0
553
513
1
114
289
> DeprND.chisq$expected
nesarc.subset$TAB12MDX
nesarc.subset$MAJORDEPLIFE
0
1
0
484.0177
581.9823
1
182.9823
220.0177
Although I can’t tell for certain from $observed and $expected which particular cells are
significant, they are helpful if you are looking at pairwise comparisons between cells.
For example, I am interested in the two cells for which MAJORDEPLIFE=1 (2nd row).
Since the overall chi-square test is significant, and the cell (MAJORDEPLIFE=1,
TAB12MDX=0) has a lower count than expected by chance while the cell
(MAJORDEPLIFE=1, TAB12MDX=1) has a higher count than expected by chance,
these cells may have a significant relationship. Need a post-hoc test to be sure.

To determine the exact nature of this relationship, i.e. which particular cells are
significantly different from the count that is expected by chance, a post-hoc test must be
performed. The Pearson residuals from the chi-square test tell you which cells are
significantly different from chance. A Pearson residual is like a z-score, thus if its
magnitude is greater than 1.96, that cell is significant at the p=.05 significance level.
> DeprND.chisq$residuals
nesarc.subset$TAB12MDX
nesarc.subset$MAJORDEPLIFE
0
1
0
3.135502
-2.859452
1
-5.099565
4.650599
I find that, in the significant Chi-square test between lifetime major depression and pastyear nicotine dependence, all cells are significantly different from the counts that are
expected by chance. The cell (MAJORDEPLIFE=1, TAB12MDX) has a Pearson
residual of 4.65. Since this is like a z-score, I look this up on a z-score table and find that
its p-value is 1.7 x 10-6. Its positive value means that there are more people in this cell
than is expected by chance. On the other hand, the cell (MAJORDEPLIFE=1,
TAB12MDX=0) has a Pearson residual of -5.10, corresponding to a p-value of 1.8 x 10-7.
Its negative sign means that there are fewer people in this cell than is expected by chance.
PEARSON CORRELATION
# syntax
> cor.test(title_of_data_set$IV, title_of_data_set$DV)
> cor.test(nesarc.subset$S3AQ3C1, nesarc.subset$NDScount)
Pearson's product-moment correlation
data: nesarc.subset$S3AQ3C1 and nesarc.subset$NDScount
t = 5.472, df = 1461, p-value = 5.228e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.09112835 0.19157443
sample estimates:
cor
0.1417162
Among young, daily adult smokers (my sample), the correlation between number of cigarettes
smoked per day (S3AQ3C1 - continuous) and number of nicotine dependence symptoms
experienced in the past year (NDScount - continuous) was 0.14 (p=5.22e-8), suggesting that only
2% (i.e. 0.142) of the variance in number of current nicotine dependence symptoms can be
explained by number of cigarettes smoked per day (or vice versa), although the association
between cigarettes smoked and nicotine dependence symptoms was statistically significant.
Download