Statistical Methods in Epi II (171:242) The Cox Regression Model: Functional Form of the Covariates Brian J. Smith, Ph.D. February 24, 2003 Functional Form of the Covariates In this section we will cover different parameterizations of the covariates included in the regression model t , x i 0 t exp βx i 0 t exp 1 x1i K xKi Implicitly, we assume that the predictors are continuous covariates. Although the relative risk component is generally written as an additive function of the predictors, interaction terms and transformed versions of the covariates can be included in the model. Interaction: Categorical Variables Recall the bone marrow transplant study of Non-Hodgkin’s and Hodgkin’s lymphoma patients. The primary goal was to compare diseases-free survival between allogeneic or autologous transplant recipients. Also measured, were the patient’s Karnofsky score and waiting time to receive the transplant. The data are summarized in Table 1 and Table 2. Table 1. Summary of the categorical variables in the lymphoma study. Variable Levels Event Indicator 0 = Disease-Free 1 = Death or Relapse Graft 0 = Allogeneic 1 = Autologous Disease 0 = Non-Hodgkin’s 1 = Hodgkin’s 1 N 17 26 16 27 23 20 Percents 40% 60% 37% 63% 53% 47% Table 2. Summary of the continuous variables in the lymphoma study. Variable Disease-Free Survival in Days Karnofsky Score Wait Time Mean Std. Dev. Min Max 429.7 570.5 2 2144 76.3 37.7 21.0 33.6 20 5 100 171 A Cox regression model was fit with main effects for graft, type of disease, Karnofsky score, and wait time; the resulting estimates are ˆ i se ̂ i p-value Graft -0.2332 0.4430 0.60 Disease 0.9806 0.5226 0.061 Karnofsky -0.05584 0.01216 < 0.0001 Wait -0.00786 0.00788 0.32 No interactions between variables were specified in the model. Thus, the effect of a covariate, say graft, is the same across all levels of the remaining variables. For instance, the estimated risk of relapse for autologous patients, relative to allogeneic patients is t, x1 1 exp ˆ1 1 exp ˆ1 t, x1 0 exp ˆ1 0 exp 0.2332 0.792 . Because only main effects were included in the model, it is assumed that this hazard ratio is the same among both NonHodgkin’s and Hodgkin’s patients. However, in our stratified logrank analysis, we saw that the hazard rates were reversed in order when going from one disease type to the other. Disease was, in fact, found to be an effect modifier that interacted with graft type in its effect on survival. Therefore, an interaction term is needed in the Cox model. 2 If an interaction term is formed by multiplying together the indicator variables for graft and disease, the result is Variable Graft*Disease Levels 0 = Otherwise 1 = Autologous and Hodgkin’s N 28 15 Percents 65% 35% and the model becomes t , x i 0 t exp 1 x1i 2 x2i 1, 2 x1i x2i 3 x3i 4 x4i . The resulting estimates are given below. ˆi se ̂ i p-value Graft 0.6394 0.5937 0.2800 Disease 2.7603 0.9474 0.0036 Graft*Disease Karnofsky -2.3709 -0.0495 1.0355 0.0124 0.0220 0.0001 Wait -0.0166 0.0102 0.1000 The interaction term is highly significant (p = 0.0001). Notice the differences in the parameter estimates, particularly for graft and disease, as compared to the model without an interaction term. In particular, the coefficient for graft type has changed sign, and the coefficient for disease is about three times larger than before. With the interaction term in the model, the hazard ratio for graft type is allowed to vary between the two disease types. Autologous vs. Allogeneic (Non-Hodgkin’s): t , x1 1, x2 0 exp ˆ1 1 ˆ2 0 ˆ1, 2 0 exp ˆ1 ˆ ˆ ˆ t , x1 0, x2 0 exp 1 0 2 0 1, 2 0 exp 0.6394 1.895 Autologous vs. Allogeneic (Hodgkin’s): 3 t , x1 1, x2 1 exp ˆ1 1 ˆ2 1 ˆ1, 2 1 exp ˆ1 ˆ1, 2 t , x1 0, x2 1 exp ˆ1 0 ˆ2 1 ˆ1, 2 0 exp 0.6394 2.3709 0.177 Interaction: Continous Variables We could likewise have an interaction term that involves a continuous variable. Consider the breast-feeding example. We originally estimated coefficients for the following main effects. White ˆ1 Black Poverty Smoke Alcohol ˆ 2 -0.305 -0.111 ̂ 3 ˆ 4 ̂ 5 -0.211 0.249 0.168 Care Age Educ ̂ 6 ̂ 7 ̂ 8 -0.027 0.020 -0.056 Again, the main effects model specifies that the effect of a given covariate is the same across all levels of the remaining variables. For example, the effect of education in this model is the same for both blacks and whites. However, differences in educational environments might suggest that an incremental increase in years of education has a different effect for whites and blacks. Thus, we might construct two interaction terms by separately multiplying the education variable by the white and black race indicators. The resulting model is t , x i 0 t exp 1 x1i 2 x2i 8 x8i 1,8 x1i x8i 2,8 x2i x8i . The regression estimates for the race and education variables, with the interaction terms in the model, are ˆi se ̂ i p-value White 0.1018 0.5030 0.840 Black -0.2853 0.9441 0.760 Educ -0.0336 0.0373 0.370 4 White*Educ Black*Educ -0.0344 0.0125 0.0421 0.0769 0.410 0.870 The likelihood ratio test for the two interaction terms indicates that they are not significant (p = 0.6172). Nevertheless, we can use these estimates to illustrate hazard ratio estimation in the presence of a continuous interaction term. Education 16 years vs. 12 years (White Mothers): exp 4ˆ ˆ t , x1 1, x8 16 exp ˆ1 1 ˆ8 16 ˆ1,8 16 t , x1 1, x8 12 exp ˆ1 1 ˆ8 12 ˆ1,8 12 8 1,8 exp 4 0.0336 0.0334 0.762 Education 16 years vs. 12 years (Black Mothers): exp 4ˆ ˆ t , x2 1, x8 16 exp ˆ2 1 ˆ8 16 ˆ2,8 16 t , x2 1, x8 12 exp ˆ2 1 ˆ8 12 ˆ2,8 12 8 2 ,8 exp 4 0.0336 0.0125 0.919 Indicator Variables for Categorical Covariates Indicator variables should be used to quantify the effects of a categorical predictor in the model. An indicator variable is constructed for each level of the categorical predictor. For level L, the indicator takes on a value of one if a subject falls within that level; otherwise, the indicator is zero. The variables “white” and “black” in the breast-feeding example are indicator variables for the categorical predictor race (white, black, or other). We did not, however, include in the model another indicator for mothers of race “other”. The reason for this is that a variable cannot be included if it 5 is a linear function of other variables already in the model. Since the indicator variable other = 1 – black – white, it cannot be included. In effect, it becomes the reference group in the model. Transformations for Continuous Covariates Veteran’s Administration Lung Cancer Trial The effectiveness of a new drug for lung cancer was tested by the Verteran’s Administrations. A summary of the variables collected in the study is given in Table 3 and Table 4. Table 3. Summary of the categorical variables in the VA lung cancer study. Variable Event Indicator Treatment Cell Type * Small Cell * Adenocarcinoma * Large Cell Prior Therapy Levels 0 = Censored 1 = Death 1 = Standard 2 = Test 1 = Squamous 2 = Small Cell 3 = Adenocarcinoma 4 = Large Cell 0 = No 1 = Yes 0 = No 1 = Yes 0 = No 1 = Yes 0 = No 1 = Yes N 9 128 69 68 35 48 27 27 48 89 27 110 27 110 97 40 * Indicator variables for use in regression models. 6 Percents 7% 93% 50% 50% 26% 35% 20% 20% 71% 29% Table 4. Summary of the continuous variables in the VA lung cancer study. Variable Survival in Days Age in years Karnofsky Score Months from Diagnosis to Study Entry Mean 121.6 58.3 58.6 Std. Dev. 157.8 10.5 20.0 Min 1 34 10 Max 999 81 99 8.77 1.07 1 87 Assessing Linearity For illustrative purposes, a Cox regression model was fit with age as the only predictor in the model. The estimates are as follows: Variable Age ˆ se ˆ 0.0075 0.00957 p-value 0.43 One might be tempted to conclude that age does not have a significant effect on survival. However, the model that was fit t, xi 0 t exp xi assumes that the relative risk component is linear in the continuous variable age. It is very important that we try to determine whether the effect of age is truly linear, or whether there is some nonlinearity to it. Categorical Method One method of checking for departures from linearity is to categorize x into three or more categories. For example, the age variable could be dived into categories defined as 7 Categorical Age Variables Age1 Age2 Age3 Levels N 1 = (Age 45) 0 = Otherwise 1 = (45 < Age 60) 0 = Otherwise 1 = (Age > 60) 0 = Otherwise 23 114 37 100 77 60 Two of these indicator variables could be used in the Cox model instead of the continuous variable for age. The likelihood ratio test indicates that age is significant in the Cox model as a categorical variable (p = 0.0466); whereas, the Wald test is not significant for age as a linear variable in the model (p = 0.43). This indicates only that the relative risk is not linearly related to age. It does not imply that age has no effect on survival. The small p-value for the categorical measure of age and the non-significance of the linear effect suggests that a nonlinear relationship exists. Graphical Method Another method for detecting nonlinearity is to construct a scatter plot of the survival times by age. A smoothing function can be employed to fit a line to the data. Curves in the fitted line suggest departures from linearity, as is the case in Figure 1. 8 1000 800 600 Survival Time 400 200 0 40 50 60 70 80 Age Figure 1. Scatter plot of the survival times by age in the VA lung cancer trial. Adding Polynomial Terms To address the problem of nonlinearity, we can try adding polynomial terms to the model. The previous scatter plot suggests the quadratic relationship t , xi 0 t exp 1 xi 2 xi2 . which gives the estimates Variable Age Age2 ˆi se ̂ i -0.2190 0.00205 0.0819 0.000734 p-value 0.0075 0.0053 The likelihood ratio test indicates a significant effect of age as a quadratic effect in the model (p = 0.0258). 9 0.0 0.2 0.4 x^2 0.6 0.8 1.0 Great care must be exercised in adding polynomial terms to a regression model. Polynomial terms are highly correlated (see Figure 2), which proves to be problematic in the parameter estimation routines. 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 2 In linear regression we address this problem by centering the covariate about its mean to get xi x . There is very little correlation between polynomial terms computed from these centered variables (see Figure 3). Centering the age measurements about the mean age of 58.3 and refitting the survival model gives Variable Age – 58.3 (Age – 58.3)2 ˆi se ̂ i 0.01963 0.00205 0.00944 0.000734 10 p-value 0.0380 0.0053 -0.4 -0.2 x - mean(x) 0.0 Figure 3 11 0.2 0.4 0.00 0.05 0.15 (x - mean(x))^2 0.10 0.20 0.25