Stat 565 Final Exam Name Instructions: This is a closed book exam, but you can use the formula sheet attached to this exam. You are not allowed to use any other books or notes. Write your answers on other paper; we did not leave enough room to write your answers on this exam. If you are asked to perform a test or compute some quantity, show the formulas you used. It is not necessary to obtain a final numerical answer to receive full credit if you show that you not how to apply an appropriate procedure. Be sure to write your name on every piece of paper that you submit. Please staple the exam questions to your solutions. You can keep the formula sheets. The questions will be returned to you with your graded solutions. The following is an analysis of the survival time after breast cancer diagnosis for n = 239 young women (aged 25-45 years). Women have been followed for up to 14 years and more than half of the women were still living at the time of data analysis. Interest is in whether the measurement of two proteins in tumor biopsies can be used to predict survival. The proteins of interest are: P27 = 1 if abnormal, 0 if normal CYCLINE = 1 if abnormal, 0 if normal In addition, the following variables were used as predictors: NODES = 1 if cancer has spread to lymph nodes, 0 otherwise SIZE(1) = 1 if tumor size is less than 2cm, 0 otherwise SIZE(2) = 1 if tumor size is between 2cm and 4crn, 0 otherwise SIZE(3) = 1 if tumor size exceeds 4cm, 0 otherwise AGE = 1 if age at diagnosis is between 36 and 45 years, 0 if age at diagnosis is between 25 and 35 years YEAR = 1 if breast cancer was diagnosed between 1989 and 1992, 0 if breast cancer was diagnosed between 1983 and 1988 Consider the following maximum partial likelihood estimates for two Cox regression (proportional hazards) models: Model 1: -2 Log Likelihood = 836.11 1 .................... Parameter Estimates ..................... Wald B S.E Statistic df p-value Exp(B) 1.0672 .2526 17.8458 1 .OOOO 2.9073 CYCLINE 0.7256 .2159 11.2980 1 .0008 2.0659 NODES 1.2040 .2288 27.6830 1 .0000 3.3334 SIZE(2) 0.6680 .2594 6.6323 1 .0100 1.9504 SIZE(3) 0.9024 .3687 5.9902 1 .0144 2.4654 -0.1 195 .2328 0.2638 1 .6075 0.8873 0.1980 .2437 0.6604 1 .4164 1.2190 Variable P27 AGE YEAR Model 2: -2 Log Likelihood = 835.797 ------------------ Parameter Estimates --------------Std. Wald Variable Estimate Error Statistic df p-value P27 1.0884 0.438 6.183 1 0.013 CYCLINE 0.8961 0.372 5.809 1 0.016 NODES 1.3577 0.544 6.220 1 0.013 SIZE(2) 0.6641 0.261 6.495 1 0.011 SIZE(3) 0.8674 0.376 5.325 1 0.021 AGE -0.1253 0.232 0.291 1 0.590 YEAR 0.1979 0.244 0.659 1 0.420 NODES'P27 -0.0274 0.533 0.003 1 0.960 NODES'CYCLINE -0.2589 0.462 0.314 1 0.580 l(a) Write out the formula for the Cox regression model that corresponds to Model 1. l(b) What are the key assumptions that are made when using Cox regression and making inferences using maximum partial likelihood estimates. 1(c) Give an interpretation of the coefficient for CYCLINE in Model 2. Show how to construct an approximate 95 percent confidence interval for the corresponding hazard ratio. 1(d) Give an interpretation of the coefficient for the NODES*P27 term in Model 2. 1(e) What test statistic would you use to determine if the NODES*P27 and Nodes*CYCLINE interactions are significant? What is its value, and how many degrees of freedom are associated with this statistic? 1(f) If the 5-year survival probability for a woman that has a zero value for each covariate is estimated as 6 = 0.93, then based on Model 2, what would be the estimate of 5-year survival for a woman who has an abnormal P27 measurement and an abnormal CYCLINE measurement, (P27=1 and CYCLINE=I), but had values of 0 for all other covariates (i.e. NODES=O, SIZE(2)=0, SIZE(3)=0, AGE=O, and YEAR=O)? 1(g) For Model 1, describe two plots you could use to assess the proportional hazards assumption for P27. 1(h) For Model 1, describe how you could obtain a p-value for a hypothesis test of the proportional hazards assumption for P27. 2 . In a study of the possible effects of diet on the prevention of certain infections, 210 subjects were randomly assigned to one of three diets, with 70 subjects assigned to each diet. The subjects were examined at 4, 8, 12, 16,20, and 24 months to determine if they suffered from at least one of the specified set of infections. A binary response was recorded for each subject at each inspection time: Yijk = 1 if the j- th subjectin thei- th diet group has aninfectiona thek- th inspectiontime 0 if the j- th subjectin the j - th diet group has no infectionat the k - th inspectiontime A plot of the observed infection rates is shown below: Probabilrty of Infection 0.8 Using the information provided by the previous plot, the following logistic regression model was fit to the data: where 'ICijk = E(Yijk IDiet i and tkne Xk) and Yijk is the infection status at the k-th inspection time for the j-th subject assigned to the i-th diet. SAS code for fittirrg this model is displayed below along with some of the output. groc genmod data=setl descending; class diet subject; model Y = diet diet*x diet*x*x/ noint dist=binomial link=logit; repeated subject=subject(diet) / type=un modelse covb corrw; run ; The GENMOD Procedure C r i t e r i a For Assessing Goodness O f F i t DF Value Value /DF 1251 1251 1251 1251 2032.6 2032.5 2405.9 2405.9 -1016.3 1.6248 1.6248 1.9264 1.9264 Criterion Deviance Scaled Deviance Pearson Chi-square Scaled Pearson X2 Log L i k e l i h o o d A n a l y s i s O f I n i t i a l Parameter Estimates Parameter diet diet diet x*diet x*diet x*diet x*x*diet x*x*diet x*x*diet Scale Estimate Standard Error Wald 95% Confidence Limits ChiSquare Pr>ChiSq 1 2 3 1 2 3 1 2 3 GEE Model I n f o r m a t i o n Working C o r r e l a t i o n M a t r i x C0ll C012 C013 C014 C015 C016 A n a l y s i s O f GEE Parameter Estimates E m p i r i c a l Standard E r r o r Estimates Parameter diet diet diet x*diet x*diet x*diet x*x*diet x*x*diet x*x*diet Estimate 1 2 3 1 2 3 1 2 3 Standard Error 95% Confidence Limits Z Pr>lZI -4.6272 -0.8668 -2.8005 0.1848 0.2305 0.2506 -0.0050 -0.0111 -0.0085 Covariance M a t r i x ( E m p i r i c a l ) X*X* dietl diet diet diet x*diet x*diet x*diet x*x*diet x*x*diet x*x*diet 1 2 3 1 2 3 1 2 3 diet2 diet3 x*dietl x*diet2 x*diet3 dietl x*x* diet2 x*x* diet3 0.9981 0 0 -0.0409 0 0 0.0029 0 0 2(a) The GEE estimate of the parameter associated with Diet 2 is oOy2= -0.8668 with standard error 0.3345. Give an interpretation of this estimate with respect to the effect of diet on the odds of infection. 2(b) Using an appropriate odds ratio, estimate the relative risk of infection at 12 months versus 8 months with diet 2 is used. Show how to construct an approximate confidence interval for this relative risk. 2(c) Show how you would test the null hypothesis Ho : 821= the quadratic trends across time are all the same. = 823,the slopes on 2(d) Initial parameter estimates are given near the top of page 5. These estimates are obtained by maximizing a likelihood function based on the assumptions that Yijk :E~) - Bin(1, n i j k ) , where log(l- = Pli +S2i(Xk - 4) +f13i(Xk - 412, and the B = (Pol BO2 803 Bll B12 Bl3 B21 B22 B231T denote the vector of initial estimates. If these assumptions are correct, B would approximately have Yijk Is are independent. Let a normal disttibution with mean 8 and covatiance matrix and Dik is the matrix of first partial derivatives of T 7~ik= (7Cilk 7Ci2k 7Ci3k ni4k 7Ci5k 7Ci6k) with respect to the elements of 8 = (801 802 803 Pi1 812 813 821 822 8 2 3 1 ~ -What can YOU say about the distribution of B =(Pol DM 803 Bll B12 B13 B21 B23)Tif the T elements of Y a = (Yilk Yi2k Yi3k Yi4k Y i ~ kYi6k) , the repeated measurements of infection status on the k-th subject in the I-th diet group are not independent? 2(e) How do the parameter estimates shown on the top of page 6 differ from the initial parameter estimates shown on page 5? Which are the better estimates? Explain. (Define what you mean by better.) 3. Forty rabbits were used in an experiment that examined the effects of the drug MDL on controlling blood pressure. The rabbits were randomly divided into two groups with 20 rabbits in each group. The 20 rabbits in one group (the treatment group) were all given the same dose of the drug MDL. The 20 rabbits in the other group (the control group) were not given MDL. The baseline blood pressure of each rabbit was measured just before a stimulant (PBG) was given to each rabbit. Then, each rabbit was injected with one unit of PBG and the increase in blood pressure was measured. Two hours later each rabbit was injected with 2 units of PBG and the increase in blood pressure relative to the baseline measurement was recorded for each rabbit. This continued at two-hour intervals until increases in blood pressure were measured for each rabbit for exposure to 1, 2, 3, 4, 5 units of PBG. Let Yljk denote the measured increase in blood pressure when the j-th rabbit in the group treated with MDL is exposed to k units of PBG. Let Y2jk denote the measured increase in blood pressure when the j-th rabbit in the control group is exposed to k units of PBG. Consider the model Yijk = Poi +PliXk +Y0ij + ~ l i j X k+&ijk (model A) where i=1,2, = I . 2 k=1,2,3,4,5, and Xk denotes the k-th level of PBG, 2 ~ i j k NID(O,oE,) are independent random errors, - are independent vectors of random coefficients, and any Eijk is independent of any yij 3(a) Show how to write this model in the form Y = XP +ZU + E where P is a vector of . non -random parameters, U is a vector of random effects, and E is a vector of random errors. 2 3(b) Suppose the values of o: , o2 , oyl , oYo,yl are known. Yo the best linear unbiased estimator of B . Give a formula for 3(c) Describe how REML estimates of o:, o2 , o2 , o ~are ~obtained. , ~ ~ Yo Y1 What is the motivation for using REML estimates of the variance components instead of maximum likelihood estimates. 3(d) Suppose the REML estimates of a:, a2 , a2 Yo Y1 into your formula for the estimator of o ~ o are , ~ inserted ~ from part(b). Whatcan you say about the distributional properties of the resulting estimator? :1; A second model that could be used is - - yijl Yij2 4 - Yij3 - - Pi1 Eijl Pi2 €ij2 + ~ij3 (model B) Yij4 = €ij4 PiS- - €ij5 - Yij5 T where pi = (Pil pi2 pi3 pi4 pi5) is a vector of mean responses to the five levels of PBG for the i-th treatment group, and Eij = (Eijl 3 2 ~ i j 3~ i j 4E - ~ NID(0,R) ) ~ . Compound symmetry, AR(1) and a general unstructured covariance matrix were considered for R. Results for REML and maximum likelihood estimation (ML) are shown below. ML Model for R -2(log-likelihood) Model A Model B REML -2(log-likelihood) 267.35 243.15 Compound symmetry 257.21 236.14 AR(1) 236.53 218.88 Unstructured 232.18 214.37 3(e) What are the AIC values corresponding to the REML log-likelihoods for the three versions of model B? What does this information tell you? 3(f) If possible, show how to test the hypothesis that RsatisRes the AR(1) covariance structure for model B. Give degrees of freedom for you test. 3(g) If possible, show how to test the null hypothesis that model A is appropriate for these data. Give degrees of freedom for your test.