Final Exam Stat Name 565

advertisement
Stat 565
Final Exam
Name
Instructions: This is a closed book exam, but you can use the formula sheet attached
to this exam. You are not allowed to use any other books or notes. Write your answers
on other paper; we did not leave enough room to write your answers on this exam. If
you are asked to perform a test or compute some quantity, show the formulas you
used. It is not necessary to obtain a final numerical answer to receive full credit if you
show that you not how to apply an appropriate procedure. Be sure to write your name
on every piece of paper that you submit. Please staple the exam questions to your
solutions. You can keep the formula sheets. The questions will be returned to you
with your graded solutions.
The following is an analysis of the survival time after breast cancer diagnosis for
n = 239 young women (aged 25-45 years). Women have been followed for up to 14 years
and more than half of the women were still living at the time of data analysis. Interest is in
whether the measurement of two proteins in tumor biopsies can be used to predict survival.
The proteins of interest are:
P27 = 1 if abnormal, 0 if normal
CYCLINE = 1 if abnormal, 0 if normal
In addition, the following variables were used as predictors:
NODES = 1 if cancer has spread to lymph nodes, 0 otherwise
SIZE(1) = 1 if tumor size is less than 2cm, 0 otherwise
SIZE(2) = 1 if tumor size is between 2cm and 4crn, 0 otherwise
SIZE(3) = 1 if tumor size exceeds 4cm, 0 otherwise
AGE = 1 if age at diagnosis is between 36 and 45 years,
0 if age at diagnosis is between 25 and 35 years
YEAR = 1 if breast cancer was diagnosed between 1989 and 1992,
0 if breast cancer was diagnosed between 1983 and 1988
Consider the following maximum partial likelihood estimates for two Cox regression
(proportional hazards) models:
Model 1:
-2 Log Likelihood = 836.11 1
....................
Parameter Estimates .....................
Wald
B
S.E
Statistic
df
p-value
Exp(B)
1.0672
.2526
17.8458
1
.OOOO
2.9073
CYCLINE 0.7256 .2159
11.2980
1
.0008
2.0659
NODES 1.2040
.2288
27.6830
1
.0000
3.3334
SIZE(2)
0.6680
.2594
6.6323
1
.0100
1.9504
SIZE(3)
0.9024
.3687
5.9902
1
.0144
2.4654
-0.1 195
.2328
0.2638
1
.6075
0.8873
0.1980
.2437
0.6604
1
.4164
1.2190
Variable
P27
AGE
YEAR
Model 2:
-2 Log Likelihood = 835.797
------------------
Parameter Estimates --------------Std.
Wald
Variable
Estimate Error Statistic df p-value
P27
1.0884 0.438
6.183
1 0.013
CYCLINE
0.8961 0.372
5.809
1 0.016
NODES
1.3577 0.544
6.220
1 0.013
SIZE(2)
0.6641 0.261 6.495
1 0.011
SIZE(3)
0.8674 0.376
5.325
1 0.021
AGE
-0.1253 0.232
0.291
1 0.590
YEAR
0.1979
0.244
0.659
1 0.420
NODES'P27
-0.0274
0.533
0.003
1 0.960
NODES'CYCLINE -0.2589
0.462
0.314
1 0.580
l(a) Write out the formula for the Cox regression model that corresponds to Model 1.
l(b) What are the key assumptions that are made when using Cox regression and
making inferences using maximum partial likelihood estimates.
1(c) Give an interpretation of the coefficient for CYCLINE in Model 2. Show how to construct an
approximate 95 percent confidence interval for the corresponding hazard ratio.
1(d) Give an interpretation of the coefficient for the NODES*P27 term in Model 2.
1(e) What test statistic would you use to determine if the NODES*P27 and Nodes*CYCLINE
interactions are significant? What is its value, and how many degrees of freedom are
associated with this statistic?
1(f) If the 5-year survival probability for a woman that has a zero value for each
covariate is estimated as
6 = 0.93, then based on Model 2, what would be the
estimate of 5-year survival for a woman who has an abnormal P27 measurement and
an abnormal CYCLINE measurement, (P27=1 and CYCLINE=I), but had values of 0
for all other covariates (i.e. NODES=O, SIZE(2)=0, SIZE(3)=0, AGE=O, and YEAR=O)?
1(g) For Model 1, describe two plots you could use to assess the proportional hazards
assumption for P27.
1(h) For Model 1, describe how you could obtain a p-value for a hypothesis test of the
proportional hazards assumption for P27.
2 . In a study of the possible effects of diet on the prevention of certain infections, 210 subjects
were randomly assigned to one of three diets, with 70 subjects assigned to each diet. The
subjects were examined at 4, 8, 12, 16,20, and 24 months to determine if they suffered from
at least one of the specified set of infections. A binary response was recorded for each
subject at each inspection time:
Yijk =
1 if the j- th subjectin thei- th diet group has aninfectiona thek- th inspectiontime
0 if the j- th subjectin the j - th diet group has no infectionat the k - th inspectiontime
A plot of the observed infection rates is shown below:
Probabilrty of Infection
0.8
Using the information provided by the previous plot, the following logistic regression
model was fit to the data:
where 'ICijk = E(Yijk IDiet i and tkne Xk) and Yijk is the infection status at the k-th
inspection time for the j-th subject assigned to the i-th diet. SAS code for fittirrg this
model is displayed below along with some of the output.
groc genmod data=setl descending;
class diet subject;
model Y = diet diet*x diet*x*x/ noint dist=binomial link=logit;
repeated subject=subject(diet) / type=un modelse covb corrw;
run ;
The GENMOD Procedure
C r i t e r i a For Assessing Goodness O f F i t
DF
Value
Value /DF
1251
1251
1251
1251
2032.6
2032.5
2405.9
2405.9
-1016.3
1.6248
1.6248
1.9264
1.9264
Criterion
Deviance
Scaled Deviance
Pearson Chi-square
Scaled Pearson X2
Log L i k e l i h o o d
A n a l y s i s O f I n i t i a l Parameter Estimates
Parameter
diet
diet
diet
x*diet
x*diet
x*diet
x*x*diet
x*x*diet
x*x*diet
Scale
Estimate
Standard
Error
Wald 95% Confidence
Limits
ChiSquare
Pr>ChiSq
1
2
3
1
2
3
1
2
3
GEE Model I n f o r m a t i o n
Working C o r r e l a t i o n M a t r i x
C0ll
C012
C013
C014
C015
C016
A n a l y s i s O f GEE Parameter Estimates
E m p i r i c a l Standard E r r o r Estimates
Parameter
diet
diet
diet
x*diet
x*diet
x*diet
x*x*diet
x*x*diet
x*x*diet
Estimate
1
2
3
1
2
3
1
2
3
Standard
Error
95% Confidence
Limits
Z
Pr>lZI
-4.6272
-0.8668
-2.8005
0.1848
0.2305
0.2506
-0.0050
-0.0111
-0.0085
Covariance M a t r i x ( E m p i r i c a l )
X*X*
dietl
diet
diet
diet
x*diet
x*diet
x*diet
x*x*diet
x*x*diet
x*x*diet
1
2
3
1
2
3
1
2
3
diet2
diet3
x*dietl
x*diet2
x*diet3
dietl
x*x*
diet2
x*x*
diet3
0.9981
0
0
-0.0409
0
0
0.0029
0
0
2(a) The GEE estimate of the parameter associated with Diet 2 is
oOy2= -0.8668
with standard
error 0.3345. Give an interpretation of this estimate with respect to the effect of diet on
the odds of infection.
2(b) Using an appropriate odds ratio, estimate the relative risk of infection at 12 months
versus 8 months with diet 2 is used. Show how to construct an approximate confidence
interval for this relative risk.
2(c) Show how you would test the null hypothesis Ho : 821=
the quadratic trends across time are all the same.
=
823,the slopes on
2(d) Initial parameter estimates are given near the top of page 5. These estimates are
obtained by maximizing a likelihood function based on the assumptions that
Yijk
:E~)
- Bin(1, n i j k ) , where log(l- = Pli +S2i(Xk
- 4)
+f13i(Xk - 412, and the
B = (Pol BO2 803 Bll B12 Bl3 B21 B22 B231T denote the
vector of initial estimates. If these assumptions are correct, B would approximately have
Yijk Is are independent. Let
a normal disttibution with mean 8 and covatiance matrix
and Dik is the matrix of first partial derivatives of
T
7~ik= (7Cilk 7Ci2k 7Ci3k ni4k 7Ci5k 7Ci6k) with respect to the elements of
8 = (801 802 803 Pi1 812 813 821 822 8 2 3 1 ~ -What can YOU
say about the distribution of
B =(Pol DM
803
Bll B12 B13 B21
B23)Tif the
T
elements of Y a = (Yilk Yi2k Yi3k Yi4k Y i ~ kYi6k) , the repeated measurements of
infection status on the k-th subject in the I-th diet group are not independent?
2(e)
How do the parameter estimates shown on the top of page 6 differ from the initial
parameter estimates shown on page 5? Which are the better estimates? Explain. (Define
what you mean by better.)
3. Forty rabbits were used in an experiment that examined the effects of the drug MDL on
controlling blood pressure. The rabbits were randomly divided into two groups with 20 rabbits
in each group. The 20 rabbits in one group (the treatment group) were all given the same
dose of the drug MDL. The 20 rabbits in the other group (the control group) were not given
MDL. The baseline blood pressure of each rabbit was measured just before a stimulant
(PBG) was given to each rabbit. Then, each rabbit was injected with one unit of PBG and the
increase in blood pressure was measured. Two hours later each rabbit was injected with 2
units of PBG and the increase in blood pressure relative to the baseline measurement was
recorded for each rabbit. This continued at two-hour intervals until increases in blood
pressure were measured for each rabbit for exposure to 1, 2, 3, 4, 5 units of PBG. Let Yljk
denote the measured increase in blood pressure when the j-th rabbit in the group treated with
MDL is exposed to k units of PBG. Let Y2jk denote the measured increase in blood
pressure when the j-th rabbit in the control group is exposed to k units of PBG. Consider the
model
Yijk = Poi +PliXk +Y0ij + ~ l i j X k+&ijk
(model A)
where i=1,2, = I . 2
k=1,2,3,4,5, and Xk denotes the k-th level of PBG,
2
~ i j k NID(O,oE,)
are independent random errors,
-
are independent vectors of random coefficients, and any Eijk is independent of any yij
3(a) Show how to write this model in the form Y = XP +ZU + E where
P is a vector of
.
non
-random parameters, U is a vector of random effects, and E is a vector of random errors.
2
3(b) Suppose the values of o: , o2 , oyl
, oYo,yl
are known.
Yo
the best linear unbiased estimator of B .
Give a formula for
3(c) Describe how REML estimates of o:, o2 , o2 , o
~are ~obtained.
,
~
~
Yo
Y1
What is the motivation for using REML estimates of the variance
components instead of maximum likelihood estimates.
3(d) Suppose the REML estimates of a:, a2 , a2
Yo
Y1
into your formula for the estimator of
o ~ o are
, ~ inserted
~
from part(b). Whatcan you
say about the distributional properties of the resulting estimator?
:1;
A second model that could be used is
-
-
yijl
Yij2
4 - Yij3
-
-
Pi1
Eijl
Pi2
€ij2
+ ~ij3
(model B)
Yij4 =
€ij4
PiS- - €ij5 - Yij5 T
where pi = (Pil pi2 pi3 pi4 pi5) is a vector of mean responses to the five levels of
PBG for the i-th treatment group, and Eij = (Eijl 3 2 ~ i j 3~ i j 4E
-
~ NID(0,R)
)
~ .
Compound symmetry, AR(1) and a general unstructured covariance matrix were
considered for R. Results for REML and maximum likelihood estimation (ML) are shown
below.
ML
Model for R
-2(log-likelihood)
Model A
Model B
REML
-2(log-likelihood)
267.35
243.15
Compound symmetry
257.21
236.14
AR(1)
236.53
218.88
Unstructured
232.18
214.37
3(e) What are the AIC values corresponding to the REML log-likelihoods for the three
versions of model B? What does this information tell you?
3(f) If possible, show how to test the hypothesis that RsatisRes the AR(1) covariance
structure for model B. Give degrees of freedom for you test.
3(g) If possible, show how to test the null hypothesis that model A is appropriate for these
data. Give degrees of freedom for your test.
Download