BIO 4118 APLLIED BIOSTATISTICS ASSIGNMENT 1 SOLUTIONS

advertisement
BIO 4118 APLLIED BIOSTATISTICS ASSIGNMENT 1 SOLUTIONS
As with any problem, a good first place to start is by inspection of the data. If we
plot the proportion of mothers of a particular age class who had a child with Down’s
syndrome (Prob-Downs = No_Downs/Births) we get Fig. 1. ( To get this graph, I created
a new variable AGE, defined as the midpoint of the age class interval, and did a
scatterplot of Prob_Downs (risk) versus AGE)
0.025
PROB_DOWNS
0.020
0.015
0.010
0.005
0.0
10
20
30
AGE
40
50
Fig. 1. The relationship between risk (Prob_Downs) and maternal age
We see that (1) there is a general increase in the probability of having a Down’s child
(risk) with age, but that (2) the increase in risk is very small between the younger age
classes. Moreover, between age classes 34-39, 40-44 and > 44, there is a large increase
in risk, from .003 to .010 (a 3-fold increase) to 0.022 (a 2-fold increase). But is this
increase significant?
The statistical analysis begins with a test of the null hypothesis that risk is
independent of age. To do this, set up a 2-dimensional contingency table with three
columns (Age$, Cat$ (Downs or No_Downs = No_Births – Downs), and Count, and test
for independence. The results are:
Frequencies
AGE (rows) by CAT$ (columns)
Downs No_Downs
Total
+----------------------------------------+
15 |
15 35540
| 35555
22 | 128 207803
| 207931
27 | 208 253242
| 253450
32 | 194 170776
| 170970
37 | 297 85749
| 86046
42 | 240 24258
| 24498
48 |
37 1670
| 1707
+----------------------------------------+
Total
1119 779038
780157
1
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Cochran's Linear Trend
Value
df
2128.989
6.000
1071.446
6.000
1071.068
1.000
Prob
0.000
0.000
0.000
This result is as expected: we reject the null hypothesis that risk is independent of age.
(Ignore the Cochran’s Linear Trend for the moment – I’ll get to that later).
There are two issues that now must be addressed: (1) we have fitted a model with seven
age categories, and shown that an interaction (AGE*CAT$) is present. But this
interaction is based on a table with 7 age categories. Based on the graph shown above,
we suspect that a simpler model might be appropriate, for example, one in which the
lower age classes are combined; (2) the null hypothesis tested above was 2-tailed, but our
biological theory predicts that risk increases with maternal age. That is, it specifies a
directional lack of independence.
We can address the first problem by the process of finding homogeneous subtables. For
example, if we take only the first two-age classes and test them, we get:
Frequencies
CAT$ (rows) by AGE (columns)
15
22
Total
+----------------------------------------+
Downs |
15
128 |
143
No_Downs | 35540 207803 | 243343
+-----------------------------------------+
Total
35555 207931
243486
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)
Value
1.941
2.119
1.625
df
1.000
1.000
1.000
Prob
0.164
0.146
0.202
0.192
So, we accept the null, and conclude that there is no evidence that the risk varies between
these two age classes. (Note the use of Yates and the Fisher Exact test since this is a 2 X
2 table!).We now add the third class, 25-29 (AGE = 27) and try again (Table on next
page)
2
Frequencies
CAT$ (rows) by AGE (columns)
15
22
27
Total
+--------------------------------------------------+
Downs |
15
128
208 |
351
No_Downs | 35540 207803 253242 | 496585
+---------------------------------------------------+
Total
35555 207931 253450
496936
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Cochran's Linear Trend
Value
11.196
11.766
11.193
df
2.000
2.000
1.000
Prob
0.004
0.003
0.001
So, now we reject the null. Therefore, it seems that the first 2 classes can be combined,
but the 3rd must remain separate. Can we combine the 3rd and 4th classes?:
Frequencies
CAT$ (rows) by AGE (columns)
27
32
Total
+-----------------------------------------+
Downs |
208
194 |
402
No_Downs | 253242 170776 | 424018
+------------------------------------------+
Total
253450 170970
424420
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)
Value
10.640
10.462
10.311
df
1.000
1.000
1.000
Prob
0.001
0.001
0.001
0.001
Apparently not, because again we reject the null that the risk is independent of age across
these two age classes.
Clearly, the differences among successive remaining age classes are even greater
than between these two classes (refer to the graph above), and sample sizes are still very
large, so power is high. So there is no need to continue on. We conclude, therefore, that
there is no evidence of an increase in risk until age 25. In other words, we can collapse
the data into 6 age classes: < 25 and all the rest, and refit the model to get:
3
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Cochran's Linear Trend
Value
2128.194
1069.327
1139.554
df
5.000
5.000
1.000
Prob
0.000
0.000
0.000
Note that compared to the original model, the reduction in goodness of fit is 1071.446 –
1069.327 = 2.119 with df = 6 – 5 = 1, p >>.05, i.e. combining the first two age classes
does not significantly reduce the model fit. On the other hand, if we fit a model with the
first 3 age classes combined, we get:
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
Cochran's Linear Trend
Value
2123.472
1059.680
1440.214
df
4.000
4.000
1.000
Prob
0.000
0.000
0.000
Note that relative to the original model, we have reduced the goodness of fit by 1071.446
– 1059.68 = 11.76 with df = 6 – 4 = 2, p <.001. So in this case, we have significantly
reduced model fit. So the best model is that which includes interactions involving only 6
age-classes: the coefficients of this model (the log-linear parameters) are the lambdas
obtained by fitting a log-linear model with AGE, CAT$ and AGE*CAT$ included. Note
that because we have reduced the number of age classes, these lambdas (and therefore,
the model predictions) will be somewhat different from those obtained from an analysis
where all 7 classes were included separately.
The second issue is the “tailedness” of the biological hypothesis. From Fig. 1, it is clear
that risk increases with age. Because in contingency analysis, tests for independence are
always 2-tailed, it is entirely possible that we could reject the null, and still not have a
pattern that is consistent with the biological prediction. For example, if risk decreased
with age, or first increased and then decreased, we would still reject the null (of no
interaction), but neither of these patterns would be consistent with (i.e. support) our
biological hypothesis.
SYSTAT does in fact have a way of testing for linear trends across ordered (sequential)
categories (like age classes), and this is the Cochran’s test of linear trend noted above. If
we take the log10 transform of risk, and plot it as a function of age, we get:
-1
LOGRISK
-2
-3
-4
10
20
30
AGE2
4
40
50
which, although not perfect, shows a strong linearity. In fact, this exactly what Cochran’s
test is doing: it is, in fact, more or less equivalent to a linear regression of the logtransformed proportions on age (or “age2” in the above figure). So, not surprisingly, the
null hypothesis of no linear trend is strongly rejected, and we conclude that in fact risk
does increase with maternal age.
5
Download