5719 - Emerson Statistics

advertisement
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 1 of 9
Biost 518: Applied Biostatistics II
Biost 515: Biostatistics II
Emerson, Winter 2015
Homework #5
February 20, 2015
Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 9:30 am
on Friday, February 27, 2014. See the instructions for peer grading of the homework that are posted on the
web pages.
On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY unacceptable.
Instead, prepare a table of statistics gleaned from the Stata output. The table should be appropriate
for inclusion in a scientific report, with all statistics rounded to a reasonable number of significant
digits. (I am interested in how statistics are used to answer the scientific question.)
Unless explicitly told otherwise in the statement of the problem, in all problems requesting
“statistical analyses” (either descriptive or inferential), you should present both
1. Methods: A brief sentence or paragraph describing the statistical methods you used. This
should be using wording suitable for a scientific journal, though it might be a little more
detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE Stata
OR R CODE.
2. Inference: A paragraph providing full statistical inference in answer to the question.
Please see the supplementary document relating to “Reporting Associations” for details.
All problems of the homework relate to the clinical trial of DFMO and suppression of polyamines. In this
homework I ask you to sometimes use dummy variables to analyze the data. There are two approaches to
performing this analysis:
o Manually create indicator variables dose0, dose075, dose200, dose400 that are 1 if the
subject received the corresponding dose, and 0 otherwise. (You will only need to include three of
these variables in any particular model, because the intercept will model the remaining group.)
o Use Stata’s facility to automatically create such dummy variables in a regression analysis by
prefixing a variable name with “i.”:
o To use this feature, you must have a variable with only integer values. So you might want to
create a variable g doseint= 1000 * dose
o Then, to perform an analysis of mean spermidine at 12 months across dose groups using
dose 0 group as reference: regress spd12 i.doseint, robust
o To perform an analysis of mean spermidine at 12 months across dose groups using dose
0.075 group as reference: regress spd12 ib75.doseint, robust
o (In R you can use the dummy( ) function to do these same things.)
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 2 of 9
1. Provide suitable descriptive statistics.
Table 1. Descriptive Statistics by DFMO Dose Groups and Total
Baseline
Difluoro methyl ornithine (DFMO) Dose ( in g/sq m/day)
Variables
0; (n=32)
0.075; (n=29)
0.2; (n=25)
0.4; (n=28)
(missing*)
Age (1)
65.9 (8.51; 45.5 – 77.2)
61.35 (7.7; 47.8 – 76.9)
62.83 (8.3; 45.4 – 77.6) 63.9 (7.8; 48.5 – 81)
Female
18.8%
17.2%
0%
21.4%
Putrescine
0.66 (0.44; .061 – 1.98)
0.65 (.52; 0.009 – 2.59)
0.61 (0.42; 0 – 1.963)
0.65 (0.57; 0 – 2.3)
(µm/mg protein)
Spermidine
3.26 (1.45; 1.4 – 7.05)
3.47 (1.6; 1.5 – 7.02)
3.35 (1.33; 1.7 – 6.22)
3.56 (1.88; .66 – 7.6)
(µm/mg protein)
Spermine
8.22 (5.5; 1.46 – 35.5)
8.43 (5.86; 4.13 – 37.7)
9.02 (7.04; 2.54 – 41.7) 8.08 (5.5; 2.28 – 34.04)
(µm/mg protein)
All Doses (n=114)
63.6 (8.16; 45.4 – 81)
14.9%
0.649 (0.49; 0 – 2.59)
3.41 (1.6; 0.66 – 7.6)
8.42 (5.9; 1.46 – 41.68)
*No indication means no missing values
Descriptive statistics are provided here for variables age, putrescine, spermidine, and spermine at
baseline, as continuous variables (mean, standard deviation, and range). The “female” variable is
listed as the proportion of females for the listed group. We take note that there are no missing values
except for one value for the age variable. Descriptive statistics are listed across the four DFMO
dosage groups, namely, 0, 0.075, 0.2, and 0.4 g/sq m/day. We see that the age is higher at dosage
zero, then drops at 0.075 and climbs through the highest dose level of 0.4. Notably, there is
imbalance in the distribution of females across dose groups (zero females exist in the 0.2 g/sq m/day
dosage group).
2. For each of the following models, provide inference (P values, and where appropriate, 95%
confidence intervals with scientific interpretation of the parameters) regarding the effect of
DFMO on the mucosal spermidine levels after 12 months of treatment. (Recall that when
multiple modeled covariates are derived from the same scientific factor, you need to test
all those covariates simultaneously. When no other covariates are in the model, the
“overall F” or “overall chi squared” test can do this for us.) Note that part h asks you to
provide a table of predicted values for each of these models.
a. Model dose as dummy variables using the dose 0 group as the reference group.
Methods: We use multiple linear regression with dummy variables utilizing the “.i” function in stata
and report out corresponding fitted values with CI and p-values where pertinent.
Inference: With dose modeled as a dummy variable and with dose 0 (placebo) as the reference
group, and after 12 months (for all proceeding figures), we see the placebo group estimate for mean
spermidine level to be 3.255 µm/g protein (p<0.000). The dose of 0.075 g/sq m/day is estimated to
have a mean spermidine level 0.3363 µm/g protein lower than placebo (p-value insignificant at
0.291; unadjusted CI: 0.9649 lower to 0.2924 higher µm/g protein). The dose of 2.0 g/sq m/day is
estimated to have a mean spermidine level 0.5443 µm/g protein lower than placebo (p-value
insignificant at 0.169; unadjusted CI: 1.324 lower to 0.2358 higher µm/g protein). The dose of 4.0
g/sq m/day is estimated to have a mean spermidine level 1.306 µm/g protein lower than placebo (pvalue<0.000; unadjusted CI: 1.914 to 0.6983 lower µm/g protein). The overall F-test has a
significant p-value at 0.0001 so we can reject the null hypothesis.
b. Model dose as dummy variables using the dose 0.075 group as the reference group.
You do not have to provide a formal description of the methods or inference for this
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 3 of 9
part. Instead comment on how the regression parameters from this model relate to
those obtained in part a. Suppose we were to completely ignore the major multiple
comparison issues and to instead trust the individual p values listed in the coefficient
table, what conclusions would we reach about differences among the dose groups in
part a vs in part b?
These regression parameters are exactly the same as those obtained in part a. Ignoring major
multiple comparison issues and trusting individual p-values listed in the coefficient table, we would
conclude equality.
c. Model dose continuously as a linear predictor.
Methods: We use multiple linear regression with dose modeled continuously as a linear predictor
and report out corresponding fitted values with CI and p-values where pertinent.
Inference: With dose modeled as a linear predictor, and after 12 months (for all proceeding figures),
we see the mean spermidine level tends to decrease by 3.125 µm/g protein (p-value=0.000; CI:
Decrease 4.504 to 1.747 µm/g protein) for each 1 unit increase in dose. Overall F-test 0.0000
denotes significance.
d. Model dose as two variables: a continuous linear predictor along with a quadratic
term (so an additional predictor equal to the square of dose).
Methods: We use multiple linear regression with a quadratic term and a continuous linear predictor
and report out corresponding fitted values with CI and p-values where pertinent.
Inference: This analysis produces for the continuous linear predictor of dose a value of 2.58 µm/g
protein lower (CI: -9.366 lower to 4.205 higher, p-value insignificant at 0.452). The quadratic term
provides a value of 1.35 µm/g protein lower (CI: -16.75 lower to 14.04 higher, p-value insignificant
at 0.861). Overall F-test denotes significance (0.000)
e. Model dose as a binary variable indicating whether dose was greater than 0.
Methods: We use multiple linear regression with a binary variable indicating whether dose was
greater than zero through generation of a new variable and report out corresponding fitted values
with CI and p-values where pertinent.
Inference: When we compare the placebo group to all pooled groups receiving a dose above zero,
the observed mean difference is 0.691 µm/g protein lower than the control group, with a CI of
1.22677 to 0.1553 µm/g protein lower, p-value at 0.012.
f. Model dose as two variables: a binary variable indicating whether dose was greater
than 0 and a continuous linear term.
Methods: We use multiple linear regression with a binary variable indicating whether dose was
greater than zero through generation of a new variable and and continuous linear term and report
out corresponding fitted values with CI and p-values where pertinent.
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 4 of 9
Inference: This analysis produces for the continuous linear predictor of dose a value of 3.017 µm/g
protein lower (CI: -4.614 lower to 1.421 lower, p-value insignificant at 0.000). The inary indicator
provides us with a value of 0.0538 µm/g protein lower (CI: -.7301 lower to 0.6225 higher, p-value
insignificant at 0.875).
g. Model dose as three variables: a continuous linear predictor, a quadratic term, and a
cubic term (a term equal to dose raised to the third power).
Methods: We use multiple linear regression with a continuous linear predictor, a quadratic term we
generated, and a cubic term we generated. We report out corresponding fitted values with CI and pvalues where pertinent.
Inference: This analysis produces individual p-values that are insignificant for the three variables.
The overall F-test is significant at 0.0001
h. Provide a table of the fitted values for each dose group from the above models.
Comment on the similarities / differences between those fitted values (and the
descriptive statistics).
Table 2. Fitted values for each dose group from above models
DFMO
Dose (in
g/sq
m/day)
0
.75
2
4
Sample
Means
(baseline)
3.26
3.47
3.35
3.56
A
3.255
2.919
2.711
1.949
B
3.255
2.919
2.711
1.949
Values (µm/g protein)*
C
D
3.234
2.999
2.609
1.983
3.212
3.011
2.642
1.963
E
F
3.255
2.564
2.564
2.564
3.255
2.975
2.598
1.995
G
3.255
2.919
2.711
1.949
*all values are for spermidine at 12 months except for sample mean column
We see here that the descriptive values in the first “sample mean” column are nearly
identical at baseline to models A, and B. There is a divergence with increasing dose.
Models A and B are identical to one another. We see isues in Model E with repeating
doses at any level beyond 0, this could be due to the pooled indicator variable. In
Models C and D we see very similar values.
3. Repeat the analyses in problem 1 adjusting for the baseline mucosal spermidine levels.
(Note that the Stata functions "test" and "testparm" can be used to perform Wald tests of
multiple parameters adjusted for other covariates.) You do not need to consider the
descriptive statistics or the fitted values for this problem.
a. Model dose as dummy variables using the dose 0 group as the reference group.
Methods: We use multiple linear regression with dummy variables utilizing the “.i” function in stata
and report out corresponding fitted values with CI and p-values where pertinent and control for
baseline dose.
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 5 of 9
Inference: With dose modeled as a dummy variable and with dose 0 (placebo) as the reference
group, and after 12 months (for all proceeding figures), we see the placebo group estimate for mean
spermidine level to be 2.666 µm/g protein (p<0.000). The dose of 0.075 g/sq m/day is estimated to
have a mean spermidine level 0.3424 µm/g protein lower than placebo (p-value insignificant at
0.252; unadjusted CI: 0.9324 lower to 0.248 higher µm/g protein). The dose of 2.0 g/sq m/day is
estimated to have a mean spermidine level 0.5411 µm/g protein lower than placebo (p-value
insignificant at 0.16; unadjusted CI: 1.298 lower to 0.216 higher µm/g protein). The dose of 4.0 g/sq
m/day is estimated to have a mean spermidine level 1.379 µm/g protein lower than placebo (pvalue<0.000; unadjusted CI: 2.01 to 0.7484 lower µm/g protein). The overall F-test has a significant
p-value at 0.0002 so we can reject the null hypothesis.
b. Model dose as dummy variables using the dose 0.075 group as the reference group.
You do not have to provide a formal description of the methods or inference for this
part. Instead comment on how the regression parameters from this model relate to
those obtained in part a. Suppose we were to completely ignore the major multiple
comparison issues and to instead trust the individual p values listed in the coefficient
table, what conclusions would we reach about differences among the dose groups in
part a vs in part b?
These regression parameters are exactly the same as those obtained in part a. Ignoring major
multiple comparison issues and trusting individual p-values listed in the coefficient table, we would
conclude equality. Here we control for baseline dose
c. Model dose continuously as a linear predictor.
Methods: We use multiple linear regression with dose modeled continuously as a linear predictor
and control for baseline dose and report out corresponding fitted values with CI and p-values where
pertinent.
Inference: With dose modeled as a linear predictor, and after 12 months (for all proceeding figures),
we see the mean spermidine level tends to decrease by 3.29 µm/g protein (p-value=0.000; CI:
Decrease 4.722 to 1.859 µm/g protein) for each 1 unit increase in dose.
d. Model dose as two variables: a continuous linear predictor along with a quadratic
term (so an additional predictor equal to the square of dose).
Methods: We use multiple linear regression with a quadratic term and a continuous linear predictor
and report out corresponding fitted values with CI and p-values where pertinent, and control for
baseline dose.
Inference: This analysis produces for the continuous linear predictor of dose a value of 2.41 µm/g
protein lower (CI: -9.019 lower to 4.200 higher, p-value insignificant at 0.471). The quadratic term
provides a value of 2.19 µm/g protein lower (CI: -17.45 lower to 13.06 higher, p-value insignificant
at 0.776). Overall F-test significant at 0.0001.
e. Model dose as a binary variable indicating whether dose was greater than 0.
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 6 of 9
Methods: We use multiple linear regression with a binary variable indicating whether dose was
greater than zero through generation of a new variable and report out corresponding fitted values
with CI and p-values where pertinent, and control for baseline dose.
Inference: When we compare the placebo group to all pooled groups receiving a dose above zero,
the observed mean difference is 0.711 µm/g protein lower than the control group, with a CI of 1.254
to 0.1679 µm/g protein lower, p-value at 0.011. Overall F-test result: 0.0072.
f. Model dose as two variables: a binary variable indicating whether dose was greater
than 0 and a continuous linear term.
Methods: We use multiple linear regression with a binary variable indicating whether dose was
greater than zero through generation of a new variable and and continuous linear term and report
out corresponding fitted values with CI and p-values where pertinent, and control for baseline dose.
Inference: This analysis produces for the continuous linear predictor of dose a value of 3.225 µm/g
protein lower (CI: -4.898 lower to 1.553 lower, p-value insignificant at 0.000). The binary indicator
provides us with a value of 0.03256 µm/g protein lower (CI: -.676 lower to 0.611 higher, p-value
insignificant at 0.920) The overall F-test is at 0.0001.
g. Model dose as three variables: a continuous linear predictor, a quadratic term, and a
cubic term (a term equal to dose raised to the third power).
Methods: We use multiple linear regression with a continuous linear predictor, a quadratic term we
generated, and a cubic term we generated. We report out corresponding fitted values with CI and pvalues where pertinent, and control for baseline dose.
Inference: This analysis produces individual p-values that are insignificant for the three variables.
The overall F-test is significant at 0.0002.
h. Provide a table of the fitted values for each dose group from the above models.
Comment on the similarities / differences between those fitted values (and the
descriptive statistics).
Table 3. Fitted values for each dose group from above models
DFMO
Dose (in
g/sq
m/day)
0
.75
2
4
Sample
Means
(baseline)
3.26
3.47
3.35
3.56
A
3.249
2.945
2.724
1.924
B
3.249
2.945
2.724
1.924
Values (µm/g protein)*
C
D
3.236
3.027
2.594
1.973
3.202
3.046
2.647
1.940
E
3.250
2.572
2.553
2.586
F
3.249
3.0125
2.587
1.980
G
3.249
2.945
2.724
1.9246
*all values are for spermidine at 12 months except for sample mean column
We see here that the descriptive values in the first “sample mean” column are nearly
identical at baseline to models A, and B. There is a divergence with increasing dose.
Models A and B are identical to one another. We see isues in Model E with repeating
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 7 of 9
doses at any level beyond 0, this could be due to the pooled indicator variable. In
Models C and D we see very similar values.
4. For each of the following models, provide inference (P values, and where appropriate, 95%
confidence intervals with scientific interpretation of the parameters) regarding the effect of
DFMO on the odds of decreased spermidine levels after 12 months of treatment (i.e., a
lower spermidine level at 12 months than at baseline). Note that in part g you are asked to
provide a table of predicted values for the odds of decreased spermidine as well as the
probability of decreased spermidine for each of these models.
a. Model dose as dummy variables.
Methods: We use logistic regression with dummy variables utilizing the “.i” function in stata and
report out corresponding fitted values with CI and p-values where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as a dummy variable that the difference we observe between
the dose group means is not greater than what we would expect to observe if there were no true
effect from the DFMO dosage (p=0.1594). The 0.075 g/sq m/day dose group has an estimated odds
of decreased spermidine levels 1.846 higher than the placebo group (CI: 0.62 times to 5.49 times
higher), and the corresponding odds for the 2 and 4 g/sq m/day groups are 1.875 (IC: 0.588 – 5.97)
and 4.615 (CI: 1.22 – 17.46), respectively.
b. Model dose continuously as a linear predictor.
Methods: We use logistic regression and model dose as a continuous linear predictor with the
generated variable addressing the decreased spermidine levels, and report out on fitted values with
CI and p-values where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as a linear predictor the odds ratio of decreased spermidine
levels tend to be quite large (30.85, CI: 1.399 through 682.05) for the one unit difference in dose.
What is detected here is not outside of what is reasonably expected to be observed if there is no true
effect (P=0.0299).
c. Model dose as two variables: a continuous linear predictor along with a quadratic
term (so an additional predictor equal to the square of dose).
Methods: We use logistic regression and model dose as two variables, one as a continuous linear
predictor and another as a quadratic term with the generated variable addressing the decreased
spermidine levels, and report out on fitted values with CI and p-values where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as a linear predictor with a quadratic term, what is detected
here is not outside of what is reasonably expected to be observed if there is no true effect
(P=0.0931).
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 8 of 9
d. Model dose as a binary variable indicating whether dose was greater than 0.
Methods: We use logistic regression and model dose as an indicator of whether it was greater than
zero and examine the effect on odds of spermidine decrease, and report out on fitted values with CI
and p-values where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as an indicator variable, what is detected here is not outside of
what is reasonably expected to be observed if there is no true effect (P=0.0661).
e. Model dose as two variables: a binary variable indicating whether dose was greater
than 0 and a continuous linear term.
Methods: We use logistic regression and model dose as an indicator of whether it was greater than
zero alongside the continuous linear term and examine the effect on odds of spermidine decrease,
and report out on fitted values with CI and p-values where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as an indicator variable alongside the continuous linear term,
what is detected here is not outside of what is reasonably expected to be observed if there is no true
effect (P=0.0554).
f. Model dose as three variables: a continuous linear predictor, a quadratic term, and a
cubic term (a term equal to dose raised to the third power).
Methods: We use logistic regression and model dose as three variables, one as a continuous linear
predictor and another as a quadratic term and a third as a cubic term with the generated variable
addressing the decreased spermidine levels, and report out on fitted values with CI and p-values
where pertinent.
Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment,
we see that when modeling the doses as a linear predictor with a quadratic term, what is detected
here is not outside of what is reasonably expected to be observed if there is no true effect
(P=0.0992).
Biost 518 / 515, Winter 2015
Homework #5
February 20, 2014, Page 9 of 9
g. Provide a table of the fitted values for each dose group from the above models.
Comment on the similarities / differences between those fitted values (and the
descriptive statistics).
Table 4. Fitted proportion values for each dose group from above models - probabilities
DFMO
Values (µm/g protein)*
Dose (in
A
B
C
D
E
g/sq
m/day)
0
0.4642857
0.4931918 0.4910965 0.4642857 0.4642857
.75
0.6153846
0.5572381 0.5585266 0.6716418 0.5924772
2
0.6190476
0.65895 0.6619449 0.6716418 0.6651351
4
0.8
0.7932245
0.791338 0.6716418 0.7813877
Table 5. Fitted odds values for each dose group from above models – odds (probabilitiy / 1-probability)
DFMO
Values (µm/g protein)*
Dose
A
B
C
D
E
(in
g/sq
m/day)
0
.75
2
4
F
0.4642857
0.6153846
0.6190476
0.8
F
0.866666617 0.973133031 0.965009083 0.866666617 0.866666617 0.866666617
1.599999896 1.25855025 1.265142135 2.045454629 1.453850435 1.599999896
1.624999869 1.93212139 1.958097659 2.045454629 1.986278944 1.624999869
4 3.836162892 3.792439448 2.045454629 3.574308033
4
These fitted values all correspond with the models above. Where sample means are not linear here,
the models would not lie on a straight line. Where we have the pooled treatment variable operating
(Model D), we should see identical values in the dosage groups above 0, and we do.
5. Which of the above analyses would you prefer a priori to test for an effect of DFMO on
mucosal levels of polyamines?
Based on material thus far in this course, including homework keys from previous iterations of this class
(including HW #4 key from Biostat 518 in 2007), I conclude that adjusting for baseline is important in an
RCT, as it is here. With that said, we eliminate the analyses from number 2 as our preference, since there is
no controlling for the baseline, which is only introduced in number 3. Importantly, we would only operate on
judgments made before looking at the data. The freedom of looking for multiple types of trends
simultaneously is desirable in a model, so, those only looking for a first-order trend may be inadequate in this
regard. Therefor, a model looking at the effect of DFMO dosage on mucosal levels of polyamines may have
both an indicator variable of treatment as well as a continuous linear indicator. Other indicators, such as
those quadratic and cubic terms, would depend on a priori knowledge about dose-response relationships and
the biological underpinnings which may or may not suggest a non-linear scale.
Download