STAT 405 – Biostatistics ~ Assignment #9 (102 points) 1. Mayo Clinic Lung Cancer Data Description: Survival in patients with lung cancer at Mayo Clinic. Karnofsky performance scores rate how well the patient can perform usual daily activities Variable Names: inst: time: status: age: sex: ph.ecog: ph.karno: pat.karno: meal.cal: wt.loss: Institution code (DO NOT USE IN YOUR ANALYSIS) Survival time in days censoring status 0=censored, 1=dead Age in years Male=1 Female=2 ECOG performance score (0=good 5=dead) Karnofsky performance score (bad=0-good=100) rated by physician Karnofsky performance score rated by patient Calories consumed at meals Weight loss in last six months Download Lung.txt from the course website and then source it into R. Then do the following to make sex a factor variable and fix the censoring variable so censor = 0 and dead = 1. > source(file.choose()) locate the Lung.txt file in your directory. > Lung$sex = as.factor(Lung$sex) > Lung$status = Lung$status – 1 a) Perform a test so see if the survival experience differs across gender. Obtain and plot the Kaplan-Meier estimates for both genders. Discuss. (4 pts.) b) Fit a model using age, sex, ph.ecog, ph.karno, pat.karno, meal.cal, and wt.loss. Briefly summarize the model (don’t compute HR’s), which variables appear to be significant? (4 pts.) c) Test proportional hazards assumptions for the significant covariates (p < .10). Which variables appear to have a violation of proportional hazards assumption? Include plots from the cox.zph function in support of the p-values from the PH test. (4 pts.) 1 Notice: There is only on patient classified as having an ECOG performance score of 3 or more. We might want to collapse categories so ECOG would be coded: 0, 1, 2+. This can done using the following commands. > table(Lung$ph.ecog) 0 1 2 3 47 81 38 1 > Lung$ph.ecog[Lung$ph.ecog==3] = 2 > table(Lung$ph.ecog) 0 1 2 47 81 39 We could then treat ph.ecog as a factor variable with levels - 0, 1, 2 (which is 2+ now). d) Using sex as a stratification variable refit the model from part (b) with the recoded ph.ecog instead of the original one. Remove any predictors from your model with pvalue < .20. i) Confirm that your reduced model is Ok using a general Chi-square test. (3 pts.) ii) Assess proportional hazards (cox.zph) for your reduced model. Discuss. (3 pts.) iii) Examine martingale residual plots for the variables that are treated as numeric, do you see any evidence of the need to transform any of the predictors? (3 pts.) iv) Examine DFBETAS for all estimated coefficients in your final model. Do see any problems with undue influence? Discuss. (4 pts.) e) Interpret the coefficients in your final model from part (d). If the LCL for the HR is below one and UCL is not, discuss how large the HR could be by focusing on the UCL. (4 pts.) f) Construct plots of the survival curves for different cohorts of patients. Choose these curves so that they illustrate the effect of the continuous covariates in your final model. Label the plots so the cohorts are identifiable from the plots. For any particular set of values for the predictors in your model you will get two curves one for males and one for females (i.e. one for each strata) and thus you will always get two cohorts determined by gender automatically. Discuss each of your plots. (8 pts.) 2 2. A Clinical Trial in the Treatment of Carcinoma of the Oropharynx Datafile: Pharynx.R SIZE: 192 observations, 13 variables. Use Source R Code… to read it into R. DESCRIPTIVE ABSTRACT: The Pharynx.R file gives the data for a part of a large clinical trial carried out by the Radiation Therapy Oncology Group in the United States. The full study included patients with squamous carcinoma of 15 sites in the mouth and throat, with 16 participating institutions, though only data on three sites in the oropharynx reported by the six largest institutions are considered here. Patients entering the study were randomly assigned to one of two treatment groups, radiation therapy alone or radiation therapy together with a chemotherapeutic agent. One objective of the study was to compare the two treatment policies with respect to patient survival. LIST OF VARIABLES: Variable Description _______________________________________________________________________ CASE Case Number INST Participating Institution SEX 1=male, 2=female TX (DO NOT USE AS A COVARIATE!!!) (DO NOT USE AS A COVARIATE!!) Treatment: 1=standard, 2=test GRADE 1=well differentiated, 2=moderately differentiated, 3=poorly differentiated,9=missing (missing cases deleted!) AGE In years at time of diagnosis(yrs.) COND Condition: 1=no disability, 2=restricted work, 3=requires assistance with self care, 4=bed confined, 9=missing (missing cases deleted!) SITE 1=faucial arch, 2=tonsillar fossa, 4=pharyngeal tongue T_STAGE 1=primary tumor measuring 2 cm or less in largest diameter, 2=primary tumor measuring 2 cm to 4 cm in largest diameter with minimal infiltration in depth, 3=primary tumor measuring more than 4 cm, 4=massive invasive tumor N_STAGE 0=no clinical evidence of node metastases, 1=single positive node 3 cm or less in diameter, not fixed, 2=single positive node more than 3 cm in diameter, not fixed, 3=multiple positive nodes or fixed positive nodes ENTRY_DT Date of study entry: Day of year and year, dddyy STATUS 0=censored, 1=dead TIME Survival time in days from day of diagnosis 3 _______________________________________________________________________ STORY BEHIND THE DATA: Approximately 30% of the survival times are censored owing primarily to patients surviving to the time of analysis. Some patients were lost to follow-up because the patient moved or transferred to an institution not participating in the study, though these cases were relatively rare. From a statistical point of view, an important feature of these data is the considerable lack of homogeneity between individuals being studied. Of course, as part of the study design, certain criteria for patient eligibility had to be met which eliminated extremes in the extent of disease, but still many factors are not controlled. This study included measurements of many covariates which would be expected to relate to survival experience. Six such variables are given in the data (sex, T staging, N staging, age, general condition, and grade). The site of the primary tumor and possible differences between participating institutions require consideration as well. The T,N staging classification gives a measure of the extent of the tumor at the primary site and at regional lymph nodes. T=1, refers to a small primary tumor, 2 centimeters or less in largest diameter, whereas T=4 is a massive tumor with extension to adjoining tissue. T=2 and T=3 refer to intermediate cases. N=0 refers to there being no clinical evidence of a lymph node metastasis and N=1, N=2, N=3 indicate, in increasing magnitude, the extent of existing lymph node involvement. Patients with classifications T=1,N=0; T=1,N=1; T=2,N=0; or T=2,N=1, or with distant metastases were excluded from study. The variable general condition gives a measure of the functional capacity of the patient at the time of diagnosis (1 refers to no disability whereas 4 denotes bed confinement; 2 and 3 measure intermediate levels). The variable grade is a measure of the degree of differentiation of the tumor (the degree to which the tumor cell resembles the host cell) from 1 (well differentiated) to 3 (poorly differentiated) In addition to the primary question whether the combined treatment mode is preferable to the conventional radiation therapy, it is of considerable interest to determine the extent to which the several covariates relate to subsequent survival. It is also imperative in answering the primary question to adjust the survivals for possible imbalance that may be present in the study with regard to the other covariates. Such problems are similar to those encountered in the classical theory of linear regression and the analysis of covariance. Again, the need to accommodate censoring is an important distinguishing point. In many situations it is also important to develop nonparametric and robust procedures since there is frequently little empirical or theoretical work to support a particular family of failure time distributions. 4 Analyze these data and summarize your findings. Make sure address whether the treatment is related to patient survival, which means you DO NOT remove Tx from the model! (15 pts.) 3 – AIDS Survival Conduct an analysis of time to AIDS diagnosis or death (time) with censor indicator (censor). The data are contained in the file AidsIDV.R which you can read into R using “Source R code…” from the File pull-down menu. Do NOT use the variables: id,time_d,censor_d, txgrp,cd4,or cd4ind in your analysis. Summarize your findings. Interest centers whether or not the addition of IDV (tx) to their treatment regimen increases “survival”. (20 pts.) 5 Nonparametric Methods Problems 4. Hypertension (Rosner pg. 347) – Problems 9.19 – 9.23 (1, 2, 1, 2, 2 pts.) enter these data yourself! 5. Health Promotion (Rosner pg. 348) – Problems 9.28 – 9.32 (Smoke.JMP) (2 pts. each) 6. Table 6.12 Conduct an appropriate test to answer the question of interest. If you conclude there are differences between these groups in terms of typical protoporphyrin level then use multiple comparisons to determine which groups significantly differ. (6 pts.) 6 7. Table 7.12 Conduct an appropriate test to answer the question of interest. If you conclude there are differences between the peak ft/lb readings across treatment group then use multiple comparisons to determine which groups significantly differ. (6 pts.) 7