BIO 4118 APLLIED BIOSTATISTICS ASSIGNMENT 1 SOLUTIONS As with any problem, a good first place to start is by inspection of the data. If we plot the proportion of mothers of a particular age class who had a child with Down’s syndrome (Prob-Downs = No_Downs/Births) we get Fig. 1. ( To get this graph, I created a new variable AGE, defined as the midpoint of the age class interval, and did a scatterplot of Prob_Downs (risk) versus AGE) 0.025 PROB_DOWNS 0.020 0.015 0.010 0.005 0.0 10 20 30 AGE 40 50 Fig. 1. The relationship between risk (Prob_Downs) and maternal age We see that (1) there is a general increase in the probability of having a Down’s child (risk) with age, but that (2) the increase in risk is very small between the younger age classes. Moreover, between age classes 34-39, 40-44 and > 44, there is a large increase in risk, from .003 to .010 (a 3-fold increase) to 0.022 (a 2-fold increase). But is this increase significant? The statistical analysis begins with a test of the null hypothesis that risk is independent of age. To do this, set up a 2-dimensional contingency table with three columns (Age$, Cat$ (Downs or No_Downs = No_Births – Downs), and Count, and test for independence. The results are: Frequencies AGE (rows) by CAT$ (columns) Downs No_Downs Total +----------------------------------------+ 15 | 15 35540 | 35555 22 | 128 207803 | 207931 27 | 208 253242 | 253450 32 | 194 170776 | 170970 37 | 297 85749 | 86046 42 | 240 24258 | 24498 48 | 37 1670 | 1707 +----------------------------------------+ Total 1119 779038 780157 1 Test statistic Pearson Chi-square Likelihood ratio Chi-square Cochran's Linear Trend Value df 2128.989 6.000 1071.446 6.000 1071.068 1.000 Prob 0.000 0.000 0.000 This result is as expected: we reject the null hypothesis that risk is independent of age. (Ignore the Cochran’s Linear Trend for the moment – I’ll get to that later). There are two issues that now must be addressed: (1) we have fitted a model with seven age categories, and shown that an interaction (AGE*CAT$) is present. But this interaction is based on a table with 7 age categories. Based on the graph shown above, we suspect that a simpler model might be appropriate, for example, one in which the lower age classes are combined; (2) the null hypothesis tested above was 2-tailed, but our biological theory predicts that risk increases with maternal age. That is, it specifies a directional lack of independence. We can address the first problem by the process of finding homogeneous subtables. For example, if we take only the first two-age classes and test them, we get: Frequencies CAT$ (rows) by AGE (columns) 15 22 Total +----------------------------------------+ Downs | 15 128 | 143 No_Downs | 35540 207803 | 243343 +-----------------------------------------+ Total 35555 207931 243486 Test statistic Pearson Chi-square Likelihood ratio Chi-square Yates corrected Chi-square Fisher exact test (two-tail) Value 1.941 2.119 1.625 df 1.000 1.000 1.000 Prob 0.164 0.146 0.202 0.192 So, we accept the null, and conclude that there is no evidence that the risk varies between these two age classes. (Note the use of Yates and the Fisher Exact test since this is a 2 X 2 table!).We now add the third class, 25-29 (AGE = 27) and try again (Table on next page) 2 Frequencies CAT$ (rows) by AGE (columns) 15 22 27 Total +--------------------------------------------------+ Downs | 15 128 208 | 351 No_Downs | 35540 207803 253242 | 496585 +---------------------------------------------------+ Total 35555 207931 253450 496936 Test statistic Pearson Chi-square Likelihood ratio Chi-square Cochran's Linear Trend Value 11.196 11.766 11.193 df 2.000 2.000 1.000 Prob 0.004 0.003 0.001 So, now we reject the null. Therefore, it seems that the first 2 classes can be combined, but the 3rd must remain separate. Can we combine the 3rd and 4th classes?: Frequencies CAT$ (rows) by AGE (columns) 27 32 Total +-----------------------------------------+ Downs | 208 194 | 402 No_Downs | 253242 170776 | 424018 +------------------------------------------+ Total 253450 170970 424420 Test statistic Pearson Chi-square Likelihood ratio Chi-square Yates corrected Chi-square Fisher exact test (two-tail) Value 10.640 10.462 10.311 df 1.000 1.000 1.000 Prob 0.001 0.001 0.001 0.001 Apparently not, because again we reject the null that the risk is independent of age across these two age classes. Clearly, the differences among successive remaining age classes are even greater than between these two classes (refer to the graph above), and sample sizes are still very large, so power is high. So there is no need to continue on. We conclude, therefore, that there is no evidence of an increase in risk until age 25. In other words, we can collapse the data into 6 age classes: < 25 and all the rest, and refit the model to get: 3 Test statistic Pearson Chi-square Likelihood ratio Chi-square Cochran's Linear Trend Value 2128.194 1069.327 1139.554 df 5.000 5.000 1.000 Prob 0.000 0.000 0.000 Note that compared to the original model, the reduction in goodness of fit is 1071.446 – 1069.327 = 2.119 with df = 6 – 5 = 1, p >>.05, i.e. combining the first two age classes does not significantly reduce the model fit. On the other hand, if we fit a model with the first 3 age classes combined, we get: Test statistic Pearson Chi-square Likelihood ratio Chi-square Cochran's Linear Trend Value 2123.472 1059.680 1440.214 df 4.000 4.000 1.000 Prob 0.000 0.000 0.000 Note that relative to the original model, we have reduced the goodness of fit by 1071.446 – 1059.68 = 11.76 with df = 6 – 4 = 2, p <.001. So in this case, we have significantly reduced model fit. So the best model is that which includes interactions involving only 6 age-classes: the coefficients of this model (the log-linear parameters) are the lambdas obtained by fitting a log-linear model with AGE, CAT$ and AGE*CAT$ included. Note that because we have reduced the number of age classes, these lambdas (and therefore, the model predictions) will be somewhat different from those obtained from an analysis where all 7 classes were included separately. The second issue is the “tailedness” of the biological hypothesis. From Fig. 1, it is clear that risk increases with age. Because in contingency analysis, tests for independence are always 2-tailed, it is entirely possible that we could reject the null, and still not have a pattern that is consistent with the biological prediction. For example, if risk decreased with age, or first increased and then decreased, we would still reject the null (of no interaction), but neither of these patterns would be consistent with (i.e. support) our biological hypothesis. SYSTAT does in fact have a way of testing for linear trends across ordered (sequential) categories (like age classes), and this is the Cochran’s test of linear trend noted above. If we take the log10 transform of risk, and plot it as a function of age, we get: -1 LOGRISK -2 -3 -4 10 20 30 AGE2 4 40 50 which, although not perfect, shows a strong linearity. In fact, this exactly what Cochran’s test is doing: it is, in fact, more or less equivalent to a linear regression of the logtransformed proportions on age (or “age2” in the above figure). So, not surprisingly, the null hypothesis of no linear trend is strongly rejected, and we conclude that in fact risk does increase with maternal age. 5