Document 10714944

advertisement
STAT 557
ASSIGNMENT 7
Name ________________
Reading Assignment:
Lloyd, Chapter 5, review Section 6.2.
Written Assignment:
Solutions will be distributed on December 8.
Final Exam:
Thursday, December 14, 9:45-11:45 a.m., in 171 Durham.
Bring a calculator, pencils, erasers, and formula sheets.
1. Table 6.1 on Page 301 in Lloyd’s book gives the following data on the relationship of age
and marital status of Danes. An individual is classified as divorced if he or she had been
divorced at any time prior to the survey, regardless of whether or not that individual is
currently remarried. Hence, an individual is classified as single if the individual has never
married and an individual is classified as married if the individual has married once and has
never been divorced. Lloyd does not say how individuals who are widowed and never
divorced are classified. Furthermore, Lloyd treats these data as a simple random sample of
185 individuals from the population of Danes who are at least 16 years old. At the time of
the survey, the legal age of marriage in Denmark was 16.
Age Group
17 – 21
21 – 25
25 – 30
30 – 40
40 – 50
50 – 60
60 – 70
70+
Xi
19
23
27.5
35
45
55
65
75
Single
17
16
8
6
5
3
2
1
Observed Counts
Married
Divorced
1
0
8
0
17
1
22
4
21
6
17
8
8
6
3
5
Total
18
24
26
32
32
28
16
9
Lloyd uses the Xi variable to model trends across age. We will do the same, although it
would be better to have the actual age of each respondent. These data have been stored in the
file dmstatus.dat. S-PLUS code for completing this problem has been stored in the file
dmstatus.ssc on the course web page.
A. Using marriage as the baseline category, fit the following polychotomous logistic
regression model.
 π single, i 
 = α1 + β1 (X i − 16)
log
 π married, i 


 π divorced , i 
 = α2 + β2 (X i − 16 )
log
 π married , i 


2
2
Report a value of a log-likelihood ratio G test, and its degrees of freedom and p-value,
for testing the fit of this model against the general alternate of eight different and
independent multinomial distributions for the eight age groups. Examine the plot of the
estimated curves against age. This plot also shows the observed proportions of
respondents in the three response categories at each level. Does this model appear to be
appropriate for the data?
B. Does the model
 π sin gle, i
log
 π married, i


 = α1 + β1(Xi − 16 ) + γ1(Xi − 16 )2


 πdivorced, i 
 = α2 + β 2 (X i − 16) + γ 2 (Xi − 16 )2
log
 π married, i 


appear to be appropriate for these data?
C. Does the model
 πsin gle , i 
 = α1 + β1 log(X i − 16 )
log
 π married , i 


 π divorced , i 
 = α2 + β2 log(Xi − 16)
log
 π married , i 


appear to be appropriate for these data?
D. Does the model
 πsin gle , i 
 = α1 + β1 log(X i − 16 ) + γ1 [log(X i − 16 )]2
log
 π married , i 


 π divorced , i 
 = α2 + β2 log(X i − 16 ) + γ 2 [log(Xi − 16)]2
log
 π married , i 


appear to fit these data?
3
E. Which of the models from parts A-D, if any, provides an adequate description of the
relationships between marital status and age in Denmark? Describe these relationships.
2. One alternative to searching for parametric curves to describe the Danish marital status data,
from Problem 1, is to fit non-parametric curves. In Section 5.2, Lloyd describes how the
generalized additive modeling (gam) function in S-PLUS can be used to do this. GAM
models have the form
link function = β0 +
K
∑ βj X ji +
j =1
∑ f j (X ji )
P
j= k + 1
where (X1i , ..., X Pi ) are the values of the P explanatory variables for the i-th case (or
respondent). The functions f j ( ) : j = k + 1, ..., P are not specified, they are estimated with
smoothing routines such as loess, kernel smoothers, or smoothing splines. Obviously, this
approach does not yield parametric formulas for curves, but the predict.gam command can be
used to compute a list of “smoothed” values at points on a grid that can be used to display the
“smoothed” curve. This can be helpful in detecting trends or patterns.
{
}
The S-PLUS code in the file dmstatus.ssc uses the gam ( ) function to fit the model
 πsin gle , i 
 = α1 + β1 log(X i − 16 ) + f1 X i − 16
log
 π married , i 


(
)
 π divorced , i 
 = α2 + β2 log(X i − 16) + f 2 X i − 16
log
 π married , i 


(
)
to the Danish marital status data from Problem 1. Examine the resulting plots. What do they
reveal? (Note that this approach may have performed better if the data file contained the
actual age of each respondent instead of only 8 unique age values.)
3. The data for this problem come from a randomized clinical trial involving treatment of
patients with squamous carcinoma (cancer) at various sites in the mouth and throat. Patients
who agreed to participate in the study were randomly assigned to one of two treatment
groups, radiation therapy alone or radiation therapy together with a chemotherapeutic agent.
The primary objective of this analysis is to compare the two treatments with respect to
patient survival at one year after treatment was initiated.
Although certain criteria had to be met for a patient to be eligible for this study, which
eliminated extremes in the extent of the disease, many factors are not controlled. The study
included measurement of many covariates that could be related to survival. Seven of those
variables are included in the data. The year that the patient entered the study is also recorded
on the data file.
4
The data are stored in the file oncology.dat. There is one line of data for each of the 195
patients examined in the study. The variables appear in the following order.
Columns
Variable
Codes
1-3
5
7
9
CASE
SEX
TREAT
GRADE
11-12
14
AGE
SITE
16
TSTAGE
18
NSTAGE
20−21
23
YEAR
STATUS
A 3 digit patient identification number
1 = male, 2 = female
1 = radiation only, 2 = radiation and chemotherapy
1 = well differentiated
2 = moderately differentiated
3 = poorly differentiated
age in years at start of treatment
1 = faucial arch
2 = tonsillar fossa
3 = posterior pillar
4 = pharyngeal tongue
5 = posterior wall
1 = primary tumor less than 2 cm. in diameter
2 = primary tumor between 2 cm. and 4 cm.
3 = primary tumor more than 4 cm. in diameter
4 = massive invasive tumor
0 = no clinical evidence of node metastases
1 = single positive node 3 cm. or less in diameter
2 = single positive node more than 3 cm. in diameter
3 = multiple positive nodes
Year patient started treatment (1978 through 1982)
0 = died within one year of beginning treatment
1 = survived for at least one year
The tumor state (TSTAGE) and node stage (NSTAGE) classifications provide an indication of
the prevalence of tumors at the primary site and neighboring lymph nodes. TSTAGE=1 refers
to a small isolated tumor at the primary site, whereas, TSTAGE=4 refers to a massive tumor
extending into the adjoining tissue. TSTAGE levels 2 and 3 refer to intermediate cases.
NSTATE=0 indicates no evidence of lymph node metastasis and NSTAGE values 1, 2, and 3
indicate increasing lymph node involvement. Patients with classifications (1,0), (1,1), (2,0), or
(2,1) for (TSTAGE, NSTAGE) were excluded from the study. The variable GRADE provides
an indication of the degree of differentiation of the tumor (the degree to which the tumor cells
resemble the host cells) from 1 (well differentiated) to 3 (poorly differentiated).
In addition to the primary question of whether the combined radiation and chemotherapy
treatment provides higher one year survival rates than the radiation treatment, it is of
considerable interest to determine the extent to which the covariates (SEX, GRADE, AGE,
SITE, TSTAGE, NSTATE, YEAR) are associated with one year survival rates.
Code for applying the SAS version of the CHAID algorithm is stored in the file oncology.sas
on the course web page. Code for applying the S-PLUS tree ( ) function to these data is
stored in the file oncology.ssc. Use this code to investigate associations between the covariates
and the one year survival rate. Under what circumstances, if any, do the one year survival rates
differ for the two treatments? Which treatment is better? How would you use these results to
help formulate a logistic regression model for predicting one year survival?
Download