Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 1 of 9 Biost 518: Applied Biostatistics II Biost 515: Biostatistics II Emerson, Winter 2015 Homework #5 February 20, 2015 Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 9:30 am on Friday, February 27, 2014. See the instructions for peer grading of the homework that are posted on the web pages. On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table should be appropriate for inclusion in a scientific report, with all statistics rounded to a reasonable number of significant digits. (I am interested in how statistics are used to answer the scientific question.) Unless explicitly told otherwise in the statement of the problem, in all problems requesting “statistical analyses” (either descriptive or inferential), you should present both 1. Methods: A brief sentence or paragraph describing the statistical methods you used. This should be using wording suitable for a scientific journal, though it might be a little more detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE Stata OR R CODE. 2. Inference: A paragraph providing full statistical inference in answer to the question. Please see the supplementary document relating to “Reporting Associations” for details. All problems of the homework relate to the clinical trial of DFMO and suppression of polyamines. In this homework I ask you to sometimes use dummy variables to analyze the data. There are two approaches to performing this analysis: o Manually create indicator variables dose0, dose075, dose200, dose400 that are 1 if the subject received the corresponding dose, and 0 otherwise. (You will only need to include three of these variables in any particular model, because the intercept will model the remaining group.) o Use Stata’s facility to automatically create such dummy variables in a regression analysis by prefixing a variable name with “i.”: o To use this feature, you must have a variable with only integer values. So you might want to create a variable g doseint= 1000 * dose o Then, to perform an analysis of mean spermidine at 12 months across dose groups using dose 0 group as reference: regress spd12 i.doseint, robust o To perform an analysis of mean spermidine at 12 months across dose groups using dose 0.075 group as reference: regress spd12 ib75.doseint, robust o (In R you can use the dummy( ) function to do these same things.) Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 2 of 9 1. Provide suitable descriptive statistics. Table 1. Descriptive Statistics by DFMO Dose Groups and Total Baseline Difluoro methyl ornithine (DFMO) Dose ( in g/sq m/day) Variables 0; (n=32) 0.075; (n=29) 0.2; (n=25) 0.4; (n=28) (missing*) Age (1) 65.9 (8.51; 45.5 – 77.2) 61.35 (7.7; 47.8 – 76.9) 62.83 (8.3; 45.4 – 77.6) 63.9 (7.8; 48.5 – 81) Female 18.8% 17.2% 0% 21.4% Putrescine 0.66 (0.44; .061 – 1.98) 0.65 (.52; 0.009 – 2.59) 0.61 (0.42; 0 – 1.963) 0.65 (0.57; 0 – 2.3) (µm/mg protein) Spermidine 3.26 (1.45; 1.4 – 7.05) 3.47 (1.6; 1.5 – 7.02) 3.35 (1.33; 1.7 – 6.22) 3.56 (1.88; .66 – 7.6) (µm/mg protein) Spermine 8.22 (5.5; 1.46 – 35.5) 8.43 (5.86; 4.13 – 37.7) 9.02 (7.04; 2.54 – 41.7) 8.08 (5.5; 2.28 – 34.04) (µm/mg protein) All Doses (n=114) 63.6 (8.16; 45.4 – 81) 14.9% 0.649 (0.49; 0 – 2.59) 3.41 (1.6; 0.66 – 7.6) 8.42 (5.9; 1.46 – 41.68) *No indication means no missing values Descriptive statistics are provided here for variables age, putrescine, spermidine, and spermine at baseline, as continuous variables (mean, standard deviation, and range). The “female” variable is listed as the proportion of females for the listed group. We take note that there are no missing values except for one value for the age variable. Descriptive statistics are listed across the four DFMO dosage groups, namely, 0, 0.075, 0.2, and 0.4 g/sq m/day. We see that the age is higher at dosage zero, then drops at 0.075 and climbs through the highest dose level of 0.4. Notably, there is imbalance in the distribution of females across dose groups (zero females exist in the 0.2 g/sq m/day dosage group). 2. For each of the following models, provide inference (P values, and where appropriate, 95% confidence intervals with scientific interpretation of the parameters) regarding the effect of DFMO on the mucosal spermidine levels after 12 months of treatment. (Recall that when multiple modeled covariates are derived from the same scientific factor, you need to test all those covariates simultaneously. When no other covariates are in the model, the “overall F” or “overall chi squared” test can do this for us.) Note that part h asks you to provide a table of predicted values for each of these models. a. Model dose as dummy variables using the dose 0 group as the reference group. Methods: We use multiple linear regression with dummy variables utilizing the “.i” function in stata and report out corresponding fitted values with CI and p-values where pertinent. Inference: With dose modeled as a dummy variable and with dose 0 (placebo) as the reference group, and after 12 months (for all proceeding figures), we see the placebo group estimate for mean spermidine level to be 3.255 µm/g protein (p<0.000). The dose of 0.075 g/sq m/day is estimated to have a mean spermidine level 0.3363 µm/g protein lower than placebo (p-value insignificant at 0.291; unadjusted CI: 0.9649 lower to 0.2924 higher µm/g protein). The dose of 2.0 g/sq m/day is estimated to have a mean spermidine level 0.5443 µm/g protein lower than placebo (p-value insignificant at 0.169; unadjusted CI: 1.324 lower to 0.2358 higher µm/g protein). The dose of 4.0 g/sq m/day is estimated to have a mean spermidine level 1.306 µm/g protein lower than placebo (pvalue<0.000; unadjusted CI: 1.914 to 0.6983 lower µm/g protein). The overall F-test has a significant p-value at 0.0001 so we can reject the null hypothesis. b. Model dose as dummy variables using the dose 0.075 group as the reference group. You do not have to provide a formal description of the methods or inference for this Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 3 of 9 part. Instead comment on how the regression parameters from this model relate to those obtained in part a. Suppose we were to completely ignore the major multiple comparison issues and to instead trust the individual p values listed in the coefficient table, what conclusions would we reach about differences among the dose groups in part a vs in part b? These regression parameters are exactly the same as those obtained in part a. Ignoring major multiple comparison issues and trusting individual p-values listed in the coefficient table, we would conclude equality. c. Model dose continuously as a linear predictor. Methods: We use multiple linear regression with dose modeled continuously as a linear predictor and report out corresponding fitted values with CI and p-values where pertinent. Inference: With dose modeled as a linear predictor, and after 12 months (for all proceeding figures), we see the mean spermidine level tends to decrease by 3.125 µm/g protein (p-value=0.000; CI: Decrease 4.504 to 1.747 µm/g protein) for each 1 unit increase in dose. Overall F-test 0.0000 denotes significance. d. Model dose as two variables: a continuous linear predictor along with a quadratic term (so an additional predictor equal to the square of dose). Methods: We use multiple linear regression with a quadratic term and a continuous linear predictor and report out corresponding fitted values with CI and p-values where pertinent. Inference: This analysis produces for the continuous linear predictor of dose a value of 2.58 µm/g protein lower (CI: -9.366 lower to 4.205 higher, p-value insignificant at 0.452). The quadratic term provides a value of 1.35 µm/g protein lower (CI: -16.75 lower to 14.04 higher, p-value insignificant at 0.861). Overall F-test denotes significance (0.000) e. Model dose as a binary variable indicating whether dose was greater than 0. Methods: We use multiple linear regression with a binary variable indicating whether dose was greater than zero through generation of a new variable and report out corresponding fitted values with CI and p-values where pertinent. Inference: When we compare the placebo group to all pooled groups receiving a dose above zero, the observed mean difference is 0.691 µm/g protein lower than the control group, with a CI of 1.22677 to 0.1553 µm/g protein lower, p-value at 0.012. f. Model dose as two variables: a binary variable indicating whether dose was greater than 0 and a continuous linear term. Methods: We use multiple linear regression with a binary variable indicating whether dose was greater than zero through generation of a new variable and and continuous linear term and report out corresponding fitted values with CI and p-values where pertinent. Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 4 of 9 Inference: This analysis produces for the continuous linear predictor of dose a value of 3.017 µm/g protein lower (CI: -4.614 lower to 1.421 lower, p-value insignificant at 0.000). The inary indicator provides us with a value of 0.0538 µm/g protein lower (CI: -.7301 lower to 0.6225 higher, p-value insignificant at 0.875). g. Model dose as three variables: a continuous linear predictor, a quadratic term, and a cubic term (a term equal to dose raised to the third power). Methods: We use multiple linear regression with a continuous linear predictor, a quadratic term we generated, and a cubic term we generated. We report out corresponding fitted values with CI and pvalues where pertinent. Inference: This analysis produces individual p-values that are insignificant for the three variables. The overall F-test is significant at 0.0001 h. Provide a table of the fitted values for each dose group from the above models. Comment on the similarities / differences between those fitted values (and the descriptive statistics). Table 2. Fitted values for each dose group from above models DFMO Dose (in g/sq m/day) 0 .75 2 4 Sample Means (baseline) 3.26 3.47 3.35 3.56 A 3.255 2.919 2.711 1.949 B 3.255 2.919 2.711 1.949 Values (µm/g protein)* C D 3.234 2.999 2.609 1.983 3.212 3.011 2.642 1.963 E F 3.255 2.564 2.564 2.564 3.255 2.975 2.598 1.995 G 3.255 2.919 2.711 1.949 *all values are for spermidine at 12 months except for sample mean column We see here that the descriptive values in the first “sample mean” column are nearly identical at baseline to models A, and B. There is a divergence with increasing dose. Models A and B are identical to one another. We see isues in Model E with repeating doses at any level beyond 0, this could be due to the pooled indicator variable. In Models C and D we see very similar values. 3. Repeat the analyses in problem 1 adjusting for the baseline mucosal spermidine levels. (Note that the Stata functions "test" and "testparm" can be used to perform Wald tests of multiple parameters adjusted for other covariates.) You do not need to consider the descriptive statistics or the fitted values for this problem. a. Model dose as dummy variables using the dose 0 group as the reference group. Methods: We use multiple linear regression with dummy variables utilizing the “.i” function in stata and report out corresponding fitted values with CI and p-values where pertinent and control for baseline dose. Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 5 of 9 Inference: With dose modeled as a dummy variable and with dose 0 (placebo) as the reference group, and after 12 months (for all proceeding figures), we see the placebo group estimate for mean spermidine level to be 2.666 µm/g protein (p<0.000). The dose of 0.075 g/sq m/day is estimated to have a mean spermidine level 0.3424 µm/g protein lower than placebo (p-value insignificant at 0.252; unadjusted CI: 0.9324 lower to 0.248 higher µm/g protein). The dose of 2.0 g/sq m/day is estimated to have a mean spermidine level 0.5411 µm/g protein lower than placebo (p-value insignificant at 0.16; unadjusted CI: 1.298 lower to 0.216 higher µm/g protein). The dose of 4.0 g/sq m/day is estimated to have a mean spermidine level 1.379 µm/g protein lower than placebo (pvalue<0.000; unadjusted CI: 2.01 to 0.7484 lower µm/g protein). The overall F-test has a significant p-value at 0.0002 so we can reject the null hypothesis. b. Model dose as dummy variables using the dose 0.075 group as the reference group. You do not have to provide a formal description of the methods or inference for this part. Instead comment on how the regression parameters from this model relate to those obtained in part a. Suppose we were to completely ignore the major multiple comparison issues and to instead trust the individual p values listed in the coefficient table, what conclusions would we reach about differences among the dose groups in part a vs in part b? These regression parameters are exactly the same as those obtained in part a. Ignoring major multiple comparison issues and trusting individual p-values listed in the coefficient table, we would conclude equality. Here we control for baseline dose c. Model dose continuously as a linear predictor. Methods: We use multiple linear regression with dose modeled continuously as a linear predictor and control for baseline dose and report out corresponding fitted values with CI and p-values where pertinent. Inference: With dose modeled as a linear predictor, and after 12 months (for all proceeding figures), we see the mean spermidine level tends to decrease by 3.29 µm/g protein (p-value=0.000; CI: Decrease 4.722 to 1.859 µm/g protein) for each 1 unit increase in dose. d. Model dose as two variables: a continuous linear predictor along with a quadratic term (so an additional predictor equal to the square of dose). Methods: We use multiple linear regression with a quadratic term and a continuous linear predictor and report out corresponding fitted values with CI and p-values where pertinent, and control for baseline dose. Inference: This analysis produces for the continuous linear predictor of dose a value of 2.41 µm/g protein lower (CI: -9.019 lower to 4.200 higher, p-value insignificant at 0.471). The quadratic term provides a value of 2.19 µm/g protein lower (CI: -17.45 lower to 13.06 higher, p-value insignificant at 0.776). Overall F-test significant at 0.0001. e. Model dose as a binary variable indicating whether dose was greater than 0. Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 6 of 9 Methods: We use multiple linear regression with a binary variable indicating whether dose was greater than zero through generation of a new variable and report out corresponding fitted values with CI and p-values where pertinent, and control for baseline dose. Inference: When we compare the placebo group to all pooled groups receiving a dose above zero, the observed mean difference is 0.711 µm/g protein lower than the control group, with a CI of 1.254 to 0.1679 µm/g protein lower, p-value at 0.011. Overall F-test result: 0.0072. f. Model dose as two variables: a binary variable indicating whether dose was greater than 0 and a continuous linear term. Methods: We use multiple linear regression with a binary variable indicating whether dose was greater than zero through generation of a new variable and and continuous linear term and report out corresponding fitted values with CI and p-values where pertinent, and control for baseline dose. Inference: This analysis produces for the continuous linear predictor of dose a value of 3.225 µm/g protein lower (CI: -4.898 lower to 1.553 lower, p-value insignificant at 0.000). The binary indicator provides us with a value of 0.03256 µm/g protein lower (CI: -.676 lower to 0.611 higher, p-value insignificant at 0.920) The overall F-test is at 0.0001. g. Model dose as three variables: a continuous linear predictor, a quadratic term, and a cubic term (a term equal to dose raised to the third power). Methods: We use multiple linear regression with a continuous linear predictor, a quadratic term we generated, and a cubic term we generated. We report out corresponding fitted values with CI and pvalues where pertinent, and control for baseline dose. Inference: This analysis produces individual p-values that are insignificant for the three variables. The overall F-test is significant at 0.0002. h. Provide a table of the fitted values for each dose group from the above models. Comment on the similarities / differences between those fitted values (and the descriptive statistics). Table 3. Fitted values for each dose group from above models DFMO Dose (in g/sq m/day) 0 .75 2 4 Sample Means (baseline) 3.26 3.47 3.35 3.56 A 3.249 2.945 2.724 1.924 B 3.249 2.945 2.724 1.924 Values (µm/g protein)* C D 3.236 3.027 2.594 1.973 3.202 3.046 2.647 1.940 E 3.250 2.572 2.553 2.586 F 3.249 3.0125 2.587 1.980 G 3.249 2.945 2.724 1.9246 *all values are for spermidine at 12 months except for sample mean column We see here that the descriptive values in the first “sample mean” column are nearly identical at baseline to models A, and B. There is a divergence with increasing dose. Models A and B are identical to one another. We see isues in Model E with repeating Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 7 of 9 doses at any level beyond 0, this could be due to the pooled indicator variable. In Models C and D we see very similar values. 4. For each of the following models, provide inference (P values, and where appropriate, 95% confidence intervals with scientific interpretation of the parameters) regarding the effect of DFMO on the odds of decreased spermidine levels after 12 months of treatment (i.e., a lower spermidine level at 12 months than at baseline). Note that in part g you are asked to provide a table of predicted values for the odds of decreased spermidine as well as the probability of decreased spermidine for each of these models. a. Model dose as dummy variables. Methods: We use logistic regression with dummy variables utilizing the “.i” function in stata and report out corresponding fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as a dummy variable that the difference we observe between the dose group means is not greater than what we would expect to observe if there were no true effect from the DFMO dosage (p=0.1594). The 0.075 g/sq m/day dose group has an estimated odds of decreased spermidine levels 1.846 higher than the placebo group (CI: 0.62 times to 5.49 times higher), and the corresponding odds for the 2 and 4 g/sq m/day groups are 1.875 (IC: 0.588 – 5.97) and 4.615 (CI: 1.22 – 17.46), respectively. b. Model dose continuously as a linear predictor. Methods: We use logistic regression and model dose as a continuous linear predictor with the generated variable addressing the decreased spermidine levels, and report out on fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as a linear predictor the odds ratio of decreased spermidine levels tend to be quite large (30.85, CI: 1.399 through 682.05) for the one unit difference in dose. What is detected here is not outside of what is reasonably expected to be observed if there is no true effect (P=0.0299). c. Model dose as two variables: a continuous linear predictor along with a quadratic term (so an additional predictor equal to the square of dose). Methods: We use logistic regression and model dose as two variables, one as a continuous linear predictor and another as a quadratic term with the generated variable addressing the decreased spermidine levels, and report out on fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as a linear predictor with a quadratic term, what is detected here is not outside of what is reasonably expected to be observed if there is no true effect (P=0.0931). Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 8 of 9 d. Model dose as a binary variable indicating whether dose was greater than 0. Methods: We use logistic regression and model dose as an indicator of whether it was greater than zero and examine the effect on odds of spermidine decrease, and report out on fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as an indicator variable, what is detected here is not outside of what is reasonably expected to be observed if there is no true effect (P=0.0661). e. Model dose as two variables: a binary variable indicating whether dose was greater than 0 and a continuous linear term. Methods: We use logistic regression and model dose as an indicator of whether it was greater than zero alongside the continuous linear term and examine the effect on odds of spermidine decrease, and report out on fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as an indicator variable alongside the continuous linear term, what is detected here is not outside of what is reasonably expected to be observed if there is no true effect (P=0.0554). f. Model dose as three variables: a continuous linear predictor, a quadratic term, and a cubic term (a term equal to dose raised to the third power). Methods: We use logistic regression and model dose as three variables, one as a continuous linear predictor and another as a quadratic term and a third as a cubic term with the generated variable addressing the decreased spermidine levels, and report out on fitted values with CI and p-values where pertinent. Inference: In investigation of the odds for decreased spermidine levels after 12 months of treatment, we see that when modeling the doses as a linear predictor with a quadratic term, what is detected here is not outside of what is reasonably expected to be observed if there is no true effect (P=0.0992). Biost 518 / 515, Winter 2015 Homework #5 February 20, 2014, Page 9 of 9 g. Provide a table of the fitted values for each dose group from the above models. Comment on the similarities / differences between those fitted values (and the descriptive statistics). Table 4. Fitted proportion values for each dose group from above models - probabilities DFMO Values (µm/g protein)* Dose (in A B C D E g/sq m/day) 0 0.4642857 0.4931918 0.4910965 0.4642857 0.4642857 .75 0.6153846 0.5572381 0.5585266 0.6716418 0.5924772 2 0.6190476 0.65895 0.6619449 0.6716418 0.6651351 4 0.8 0.7932245 0.791338 0.6716418 0.7813877 Table 5. Fitted odds values for each dose group from above models – odds (probabilitiy / 1-probability) DFMO Values (µm/g protein)* Dose A B C D E (in g/sq m/day) 0 .75 2 4 F 0.4642857 0.6153846 0.6190476 0.8 F 0.866666617 0.973133031 0.965009083 0.866666617 0.866666617 0.866666617 1.599999896 1.25855025 1.265142135 2.045454629 1.453850435 1.599999896 1.624999869 1.93212139 1.958097659 2.045454629 1.986278944 1.624999869 4 3.836162892 3.792439448 2.045454629 3.574308033 4 These fitted values all correspond with the models above. Where sample means are not linear here, the models would not lie on a straight line. Where we have the pooled treatment variable operating (Model D), we should see identical values in the dosage groups above 0, and we do. 5. Which of the above analyses would you prefer a priori to test for an effect of DFMO on mucosal levels of polyamines? Based on material thus far in this course, including homework keys from previous iterations of this class (including HW #4 key from Biostat 518 in 2007), I conclude that adjusting for baseline is important in an RCT, as it is here. With that said, we eliminate the analyses from number 2 as our preference, since there is no controlling for the baseline, which is only introduced in number 3. Importantly, we would only operate on judgments made before looking at the data. The freedom of looking for multiple types of trends simultaneously is desirable in a model, so, those only looking for a first-order trend may be inadequate in this regard. Therefor, a model looking at the effect of DFMO dosage on mucosal levels of polyamines may have both an indicator variable of treatment as well as a continuous linear indicator. Other indicators, such as those quadratic and cubic terms, would depend on a priori knowledge about dose-response relationships and the biological underpinnings which may or may not suggest a non-linear scale.