External validation and updating of a prognostic survival model 1 Patrick Royston & Mahesh K B Parmar Hub for Trials Methodology Research MRC Clinical Trials Unit and University College London 222 Euston Road, LONDON NW1 2DA UK. Douglas G. Altman Centre for Statistics in Medicine, University of Oxford 17 March 2010 Summary We propose the view that external validation of a prognostic survival model means evaluating its performance in an independent (secondary) dataset obtained separately from the data in the original (primary) study, with a minimum of change or reconstruction of the model. The primary and secondary datasets should consist of patients with a set of prognostically relevant predictors in common, comparably de…ned time-to-event outcomes with similar follow-up times, and the same clinical condition observed in similar settings. We suggest graphical and analytical techniques for evaluating the …t of a proposed model in the secondary dataset. The analyses involve evaluating by how much minor changes to the model in the secondary dataset improve the …t there. The ‘ideal’ model does not require any such modi…cation. Other assessments of performance we discuss include a goodness of …t test and a measure of explained variation. We note that full validation of a model requires the model to provide a complete probabilistic description of the data, su¢ cient to predict the survival probabilities at any relevant timepoint and for any combination of values of the prognostic factors. The Cox proportional hazards model was designed to estimate the e¤ects of covariates on the hazard function, but not to estimate or predict survival probabilities; by de…nition, it does not satisfy our full validation requirement. Instead, we recommend use of adequately ‡exible parametric survival models, both for estimation and validation. We focus on the families of models proposed by Royston & Parmar. We show how the techniques may be used in the analysis of two datasets in cancer. 1 Introduction Techniques for validation of regression models may be classi…ed as internal or external. The key element of internal validation is that only one dataset (the primary dataset) is used. Internal 1 Research report No. 307, Department of Statistical Science, University College London. Date: March 2010. 1 validation always involves some form of data re-use. One older technique is to split the data at random, develop a model on one portion and evaluate it on the other. More recent and arguably more appropriate techniques include k-fold cross-validation and bootstrap resampling [1]. While internal validation can throw light on possible fragilities (such as ‘over-optimism’) of the model, it can never be an adequate substitute for validation on independent data (external validation). Unvalidated models are likely to be “quickly forgotten” [2]. In external validation, two independent datasets (primary and secondary) are required. Typically, the primary dataset is collected …rst and a prognostic model is derived from it. Later, further relevant data are assembled to form the secondary dataset and the performance of the model is evaluated. In this paper, we propose the view that external validation of a prognostic survival model derived on the primary dataset means evaluating its performance in the secondary dataset with a minimum of change or reconstruction of the model. Prerequisites of such analyses are that the primary and secondary datasets comprise patients having a set of prognostically relevant predictors in common, comparably de…ned time-to-event outcomes with roughly similar follow-up times, and the same clinical condition observed in similar settings. The acceptable degree of di¤erence between the populations from which the samples were drawn is a matter of debate [3]; for example, it may be unreasonable to expect a model derived on a sample of adults to perform well in children. However, we do not address that issue here. The populations and samples under study are taken as given. External validation of prognostic models with a binary outcome (usually, logistic regression models) has become quite a well-established practice in the medical literature. An example is reference [4]. However, the task there is much simpler than for survival models, since predictions are essentially estimates of the patient- or group-speci…c probability of the event of interest, conditional on covariates, at a prede…ned, …xed point in time. Estimating and comparing survival probabilities across the whole epoch since study inception are avoided. By contrast, there is little statistical literature on formal techniques for external validation of models for timeto-event data. Authors of medical papers tend to con…ne themselves to creating a prediction score or nomogram (a graphical version of the score) on the primary data (or using an existing score), generating prognostic groups from the score, and plotting Kaplan-Meier survival curves for the prognostic groups in the secondary data (for example, references [5], [6]). A measure of discrimination (see below) of the score, such as Harrell’s concordance or c-index [7], may be evaluated on the primary and secondary datasets, and compared. Sometimes, a signi…cance test, based on the logrank test or a Cox model, of the relative hazards among the prognostic groups in the secondary dataset is performed. However, these approaches do not adequately address the question of how well the model works in the secondary dataset, the main topic of the present paper. A prognostic survival model has a systematic component, typically several predictors relevant to the disease in question whose e¤ects are combined linearly into a risk score or prognostic index, and a stochastic part describing the probabilistic element, almost always represented 2 through the (unspeci…ed) hazard function in a Cox proportional hazards (PH) model. We agree with van Houwelingen, who states “Generally speaking, validation of well-speci…ed (quantitative) models is better de…ned than validation of vague (qualitative) models that only produce subgroups.” [8] More precisely, we regard the characteristic aim of a survival model, which typically operates with right-censored time-to-event data, to be to estimate or predict survival probabilities at given times after the temporal origin at either the individual or the group level, conditional on covariates or prognostic factors. The terms ‘covariates’, ‘predictors’ and ‘prognostic factors’ are used interchangeably in the present context. While in our validation activities we largely focus on this aspect of model performance, we also look at global measures of explained variation and goodness of …t. The predictors used in the model for the primary dataset must somehow be selected from a set of candidate variables. The process can be done in many ways (e.g. use prespeci…ed predictors based on literature evidence and/or medical knowledge, use all available predictors, select variables by criteria of statistical importance, etc.) We do not consider in any detail the method of selecting predictors in the primary dataset. We concentrate on external validation, interpreted as the model’s performance in a separate, secondary dataset. The concept of validation appears at face value to be straightforward: either a model ‘validates’or it does not. The dichotomy is deceptive. In practice, assessment of model performance in new data does not give a simple ‘yes/no’answer. Many performance measures are available, and each of them is a value on a continuous scale. Subjective judgment of performance is therefore always needed, guided by clinical aims of the model [9]. Traditionally, performance measures are classi…ed in terms of ‘calibration’or ‘discrimination’. Calibration relates to bias in predictions, and discrimination to the ability of a model to distinguish between outcomes of patients with putatively di¤erent risks of an event. However, the attributes are not necessarily independent. Further, measures of calibration or discrimination do not directly re‡ect the goodness of …t of a model. Validation is becoming an increasingly important issue in prognostic studies, as interest in prognostic and ‘predictive’molecular and genetic markers grows, e.g. references [10], [11]. (A predictive marker is one that predicts response to treatment. For example, in primary breast cancer, oestrogen receptor status of the tumour predicts whether a patient’s cancer responds to the hormonal drug tamoxifen.) Practical guidance on the assessment of such markers is however very limited. The Cox proportional hazards model is a near-universal tool in the analysis of prognostic models in clinical medicine [12]. However, it was not formulated with our concept of full external validation in mind. The main issue is that a parametric estimate of the baseline survival function is not available (note that ‘baseline’refers to zero values of the covariates, not to time t = 0). Prediction in new data of survival probabilities from a Cox model is therefore problematic. Further, covariate e¤ects must conform to the proportional hazards assumption, which is quite 3 limiting, particularly in data with medium- or long-term follow-up. We also work in some detail with three families of ‡exible parametric survival (‘Royston-Parmar’) models [13]. The three families assume proportional (cumulative) hazards, proportional (cumulative) odds and probitscale covariate e¤ects, respectively. By the ‘scale’ of a model we mean the transformation of the survival function on which covariate e¤ects are assumed linear. For example, covariates in the PH model are assumed to act linearly on the log-minus-log transformation of the survival function, referred to as hazards scaling. In Royston-Parmar models, the scale transformation (or link function) of the baseline survival function is approximated parametrically by a restricted cubic spline function (see section 2.2). In the simplest cases, the spline function reduces to a linear function of log survival time. The models then simplify to the Weibull, log-logistic and lognormal distributions, respectively, and so are seen as generalizations of these more familiar survival models. The goal of the paper is to propose and demonstrate a practical approach to full validation of a model for time-to-event data. To keep things relatively simple, we do not consider models with time-dependent covariate e¤ects— either time-varying e¤ects of static covariates, or (a much less common case) dynamic covariates whose values change over time. We aim to show that ‡exible parametric modelling is necessary if validation of both the systematic and the stochastic components of a model is required (as it should be). We propose a technique which teases out three aspects of possible lack of …t of a model on a secondary dataset: the location of the baseline survival function, its shape, and calibration of the linear predictor or prognostic index. The plan of the paper is as follows. Section 2 outlines the Cox and Royston-Parmar models. Section 3 describes the methodology, discussing model assessment, goodness of …t, validation, prediction of survival probabilities, graphical display of results and explained variation. Section 4 describes two cancer datasets and prognostic models used to predict time-to-event outcomes in them. Section 5 presents analyses of the datasets using the performance measures and approaches to validation described in section 3. Section 6 is a discussion. 2 2.1 Models Cox model In a Cox model with covariate vector x and parameter vector , the stochastic component is the (hypothetical) baseline hazard function, h0 (t) = h (t; 0), and the systematic part is the prognostic index, (i.e. eta) = x . The baseline hazard function is ‘hypothetical’ in the sense of depending on how the covariates x are coded and/or centered; there is no guarantee or necessity that any patient in the sample has x = 0. Estimation of by maximizing the partial log likelihood does not require an explicit formulation of h0 (t), which is not itself estimated, 4 hence the description of the model as ‘semi-parametric’. See section 3.5 for the implications of the semi-parametric nature of the Cox model for validation. Hazards-scaled Royston-Parmar models may be seen as parametric extensions of the Cox model that retain the stochastic/systematic decomposition. Writing the Cox model for the hazard function h (t; x) as h (t; x) = h0 (t) exp (x ) ; and integrating with respect to time, gives the equivalent model H (t; x) = H0 (t) exp (x ) ; a proportional cumulative hazards model. Royston-Parmar models include a parametric representation of H0 (t), whereas in the Cox model H0 (t) is left unspeci…ed. See section 2.2 for a general de…nition of Royston-Parmar models. 2.2 Royston-Parmar survival models Royston & Parmar [13] described families of ‡exible parametric survival models resembling generalized linear models with various link functions. The models rely on transformation of the survival function by a link function g (:), g [S (t; x)] = g [S0 (t)] + x ; (1) where S0 (t) = S (t; 0) is the baseline survival function and is a vector of parameters to be estimated for covariate vectors x. In earlier work, Younes & Lachin [14] used the parameterised link function g (z; ) = ln z 1 = proposed by Aranda-Ordaz [15], in which = 1 corresponds to the proportional odds model and ! 0 to the proportional hazards (PH) model. Royston & Parmar [13] investigated these two models and also considered the probit model for 1 (s), where 1 (s) is the inverse standard Normal distribution function. which g (s) = By letting ! 0 and exponentiating (1), proportional cumulative hazards models are obtained, namely H (t; x) = H0 (t) exp (x ) (2) where H (t; x) is the cumulative hazard function and H0 (t) = H (t; 0) is the baseline cumulative hazard function. This is also a PH model. Royston & Parmar [13] approximated ln H0 (t) by a restricted cubic spline function of log time with two ‘boundary’knots and m interior knots, namely ln H0 (t) = 0 + 1 ln t + 2 v1 (ln t) + : : : + m+1 vm (ln t) (3) where v (ln t) = ln t; v1 (ln t) ; : : : ; vm (ln t) is a vector of spline basis functions (see Royston & Parmar [13] for details). Ignoring the constant term 0 , a spline with m knots has m+1 degrees 5 of freedom (d.f.). The PH model obtained by substituting (3) for H0 (t) in (2) is fully speci…ed parametrically and may be written ln H (t; x) = 0 + v (ln t) +x (4) If we write the prognostic index as = x then model (4) may be extended slightly or ‘recalibrated’as ln H (t; x) = 0 + v (ln t) + (5) Since the term v (ln t) in (5) has not changed, the shape of the baseline cumulative hazard function is unaltered, but its location changes through 0 + when 0 6= 0 or 6= 1. Estimation of is relevant in validation studies where recalibration is of interest (see reference [8], for example). Regression on the prognostic index in the primary (derivation) data yields b 1. In the secondary (validation) data, a value of b much less than 1 may indicate a model with poor performance. If b = 0 the model is probably useless. 3 3.1 Methods Preliminary investigation As a …rst step before any modelling starts, it is worthwhile plotting the overall Kaplan-Meier survival curves by dataset. If survival varies noticeably across the datasets, inaccurate prediction of survival probabilities from the primary to the secondary datasets may result. If di¤erences in the shapes of the survival curves are present, systematic di¤erences in the study populations may be indicated. Among possible exploratory analyses, it is also worth comparing the distributions of the covariates between the primary and secondary datasets. Substantial di¤erences here may also indicate important di¤erences in the study populations. 3.2 Components of parametric survival models The prognostic index, , is of central importance in validation and assessment of model performance. The spread or variance of across patients indicates the degree of discrimination, predictive ability or variation explained by the model [16]. A large variance suggests that a model has a (relatively) good ability to discriminate between patients who have di¤erent risks of an event. The role of the prognostic index has been widely recognised, but that of the baseline survival function, S0 (t), has often been ignored, presumably because it is not part of Cox model. However, S0 (t) is arguably of equal importance, since it determines the pattern of absolute survival probabilities over time, whereas the prognostic index expresses the (log) relative risk (hazard, etc) of one individual in the sample relative to another, or relative to a 6 hypothetical baseline individual for whom all covariate values are zero. An appraisal of a survival model is incomplete without considering both components. For a model to be clinically useful, it must estimate actual probabilities. We present an approach to assessing model performance in independent data that takes into account both the prognostic index and the baseline distribution function. In the proportional cumulative hazards model (4), the baseline distribution function is the baseline log cumulative hazard function. Two aspects of the baseline distribution function, represented by the parameters 0 and in (4), may be distinguished. The …rst, 0 , is a type of intercept and indicates the position in ‘log relative risk space’of the baseline distribution function. A higher value of 0 indicates generally higher risk and hence lower survival. As already mentioned, the parameter vector controls the shape of the baseline distribution function and hence of the predicted survival and hazard functions. The prognostic index raises (when > 0) or lowers (when < 0) the risk for an individual relative to 0 . In a validation study, interest lies in whether values of 0 , or from the primary dataset still hold good in the secondary dataset. Sometimes a simple recalibration of markedly improves model …t in the secondary dataset. 3.3 Model validation and updating In the secondary dataset, the prognostic index is calculated using values of the parameter vector b estimated on the primary dataset and applied to the covariate values of the secondary dataset. The components of are not re-estimated on the secondary dataset. Similarly, the baseline survival function is calculated in the secondary dataset using the parameter vector (b0 ; b ) estimated on the primary dataset, together with the (log) time values in the secondary dataset and the set of spline knots used in the primary dataset. It is possible that the predicted survival probabilities, derived from b0 ; b and b , are suf…ciently accurate in the secondary dataset to remove the need to ‘tune’ any of the model parameters. This is the ideal scenario regarding validation; we refer to it as Case 1. We now describe six additional cases, which comprise a structured, ordered set of models allowing one to assess and if necessary improve the …t of the original model in the secondary dataset. The adjustments provide a route to updating a given model in a simple way, thus avoiding complete reconstruction (see van Houwelingen [8] for a related approach). Of course, in some situations reconstruction may be unavoidable. In any case, an updated model should be re-validated in further independent data. ‘Case 1’(see section 3.2) is the situation that the parameters of model (4) estimated on the primary dataset also provide an adequate …t to the secondary dataset. Going beyond that, we consider four general and parsimonious ways to update the model on the primary dataset using data from the secondary dataset. These are summarized in Table 1. [TABLE 1 NEAR HERE] The need for such updating, if detected, indicates di¤erent types of lack of …t of the original 7 Baseline distribution Not re-estimated 0 re-estimated re-estimated re-estimated 0 and 1 Estimated Case 1 Case 2a Case 2b Case 3 Case 4 Case 5a [Case 5b ] Case 6 Table 1: De…nition of Cases 1-6 in structured updating of a ‡exible parametric survival model on a secondary dataset. Infeasible (see text) model: 1. Re-estimate 0 (Case 2a). Altering 0 moves the entire baseline distribution function up or down. The shift is sometimes referred to as ‘recalibration in the large’ [1]. A situation in which 0 changes is where the susceptibility of the patients to an event di¤ers systematically between the primary and secondary datasets. 2. Re-estimate (but not 0 ) (Case 2b). This is a more radical revision. Changing the shape of the baseline distribution, but not its general level. 3. Re-estimate changed. 0 and alters (Case 3). Similar to Case 2b, except that the general level is also 4. Regarding as a continuous covariate with a linear e¤ect in the secondary dataset, estimate its slope, (Cases 4, 5 and 6). This idea, which was proposed for the logistic model by Miller [17], posits that the e¤ect of is linear but may be mis-calibrated (i.e. 6= 1). Since b is often < 1, is sometimes described as a shrinkage factor. Table 1 shows how Cases 1-6 relate to the types of model revision listed above. [TABLE 1 NEAR HERE] The reason why there is no Case 5b is that estimating and re-estimating with 0 …xed is infeasible. Multiplying by a¤ects the value of 0 . A nice feature of this schema is that almost all of the pairs of models moving down each column and across each row of Table 1 are nested. The exception is that Case 2a is not nested in Case 2b, although Cases 2a and 2b nest Case 1 and are nested in Case 3. The deviances (minus twice the maximised log likelihood) of the 7 models may be tabulated and compared. The d.f. of the 7 models corresponding to Cases 1-6 are 0, 1, 1 + m, 2 + m, 1, 2, 3 + m, respectively. Likelihood ratio tests based on 2 approximations may be performed to compare pairs of nested models. These simple adjustments to the original model may produce major or minor changes in the goodness of …t (deviance). Such information may guide the interpretation of the results, and where appropriate, the choice of a ‘…nal’updated model for the secondary dataset. 8 3.4 Goodness of …t Attention should also be paid to goodness of …t of the model in the primary and secondary datasets. A common cause of poor …t of a survival model is failure of the proportionality assumption (e.g. non-proportional hazards). A simple ‘global’or ‘average’assessment of nonproportionality of the prognostic index on the chosen scale of a Royston-Parmar model is to test the interaction between the spline basis functions and the prognostic index. Such tests are discussed in principle in section 4.1 of Royston & Parmar [13]. The …rst step is to estimate b on the primary dataset using model (4). In the next step, (4) is extended to include the interaction between b and the spline basis functions v (ln t), giving (for cumulative hazards scaling) the following model: ln H (t; x) = 0 + v (ln t) + b + interaction (6) where interaction = y1 b ln t + y2 bv1 (ln t) + : : : + ym+1 bvm (ln t). The interaction induces nonproportionality of cumulative hazards by allowing the shape of the distribution function to vary with b. All the parameters in (6) are estimated by maximum likelihood in the primary dataset. To test for lack of …t, the di¤erence in deviance between (4) and (4) is compared with a 2 distribution on m + 1 d.f. Given b from the primary dataset (now evaluated on the secondary dataset), the same approach to testing an interaction is applied in the secondary dataset. Since the parameters of (6) are re-estimated in the secondary dataset, …rst without and then with the interaction term, the two models that are compared resemble Case 6 (see Table 1). Thus there is recalibration of the prognostic index and re-estimation of the distribution function. The goodness of …t assessment in the secondary dataset, therefore, is of a revised model. Since categorisation of the prognostic index is avoided, the approach is parsimonious, which should confer good power. However, the properties of the test await detailed evaluation. In the Cox model, a test of non-proportional hazards in the same spirit may be performed by applying Grambsch-Therneau tests [18] of standardized Schoenfeld residuals for in the primary and secondary datasets. Other possibilities are described by Sauerbrei et al [19]. 3.5 Validation of a Cox model Principles similar to those just described for parametric survival models may be applied to validating a Cox PH model, but with restrictions. As already mentioned, the Cox model does not include a transportable estimate of the baseline survival function, making prediction of the latter from the primary to the secondary dataset infeasible. Two updating options remain from the list given in Table 1, equivalent to Cases 3 and 6 in section 3.3: either = 1 (Case 3, no recalibration) or is estimated (Case 6, recalibration is performed). In each case, the 9 (possibly recalibrated) prognostic index is applied to the baseline survival function, estimated by a standard method (such as Lawless [20]) at = 0 in the secondary dataset. To test the hypothesis = 1, i.e. to assess the need for recalibration, the deviance di¤erence between Cases 3 and 6, based on the partial likelihood, may be compared with a 2 distribution on 1 d.f. Further investigation of possible model weaknesses of the type available with a parametric model cannot be done with the Cox model. An alternative approach to validating a Cox model is to regard a Royston-Parmar PH model as a Cox model with a parametric baseline distribution function. Validation is done exactly as described in section 3.3. The steps of the analysis are 1. Fit the Cox model to the primary data and estimate the prognostic index, ; 2. In Royston-Parmar PH models for the primary data with o¤set, search for a parsimonious spline function for the baseline cumulative hazard function; 3. Apply the methods of section 3.3 to the secondary dataset. In fact, since Cox and Royston-Parmar PH models yield almost identical parameter estimates [13], one could dispense with …tting the Cox model and regard validating a RoystonParmar model as equivalent to validating a Cox model with the same covariates. Assessment of the discrimination of discussed in section 3.7. 3.6 in a secondary dataset is straightforward, and is Prediction of survival probabilities and graphical display A cardinal strength of parametric survival models is their ability to predict survival probabilities for individuals and groups in the primary and secondary datasets. Survival probabilities in groups may be estimated by averaging the survival curves for individuals at the each of the observed failure and censoring times. The group survival probabilities may then be compared with Kaplan-Meier curves. Risk groups must …rst be created by categorising the (continuous) prognostic index, . Clearly many such schemes are possible. For the purpose of validation (but not necessarily more generally), we suggest forming three groups using cutpoints at the …rst quartile and third quartile of the distribution of across the individuals in the primary dataset. (If validation of more extreme predicted risks is important, the cutpoints can be adjusted accordingly.) The predicted survival curves are averaged over all individuals in a given group at the observed failure and censoring times. Predicted and observed (Kaplan-Meier) curves may be compared graphically. As seen in the examples to come, the resulting graph, with all groups displayed in the same plot, indicates the predictive performance of the model over the entire time range. 10 We strongly recommend producing such plots, which with parametric models may be done for Cases 1-6. For the Cox model, similar plots may be produced, but as discussed in section 3.5, they are limited to Cases 3 and 6. The predicted survival probabilities are again obtained by averaged individual survival curves in risk groups. The predicted survival curve for an individual with prognostic index is S0 (t)exp( ) , where S0 (t) is the baseline survival distribution estimated in the secondary dataset. Similar graphical comparisons of observed and predicted survival probabilities may also be produced for the model for the primary dataset. They are quite helpful in assessing the accuracy of the predicted survival probabilities over time. Comparison of the plots between the primary and secondary datasets is also informative. For example, the plots may reveal lack of model …t on the primary or secondary dataset (or both), or that model …t is worse (or indeed better) in the secondary than the primary dataset. 3.7 Measure of explained variation suitable for validation studies The explained variation statistic R2 for the linear regression model y may be written as var ( ) R2 = 2 + var ( ) N ; 2 with =x (7) where var( ) is the the variance of the distribution of across the sample (see eqn. (9) of Helland [21]). Several authors have proposed versions of explained variation statistics for use with censored survival data. These measures have not necessarily been extensions of the linear regression case. For example, Graf et al [22] based their proposed measure on the Brier score for survival data (a measure of predictive accuracy at the individual level), while others (e.g. Schemper [23]) have worked with the survival curves estimated from a model. 2 , a transformation of their DRoyston & Sauerbrei [16] proposed a measure called RD measure of discrimination, for use with the Cox model and other fairly general models for 2 is de…ned analogously to R2 in (7) as censored survival data. RD 2 RD = D2 = 2 2 + D2 = 2 (8) where 2 = 8= and D is the slope of the regression of the survival outcome on scaled Normal order statistics corresponding to . See Royston & Sauerbrei [16] for justi…cation and for explanation of the motivation, interpretation and properties of D. Under the assumption of normality of the distribution of across the sample, D2 = 2 is an estimate of var( ) that is nevertheless reasonably robust to departures from normality of . Furthermore, D and hence 2 are insensitive to outliers in RD when the underlying distribution of is normal but possibly contaminated by a small number of extreme values. 11 2 has the advantages that it was designed for use with Cox models For present purposes, RD and Royston-Parmar models of all three types (hazard, odds and probit scaling), and that it is suitable for assessing explained variation in a secondary dataset (albeit in a ‘Case 6’scenario, since the outcome must be regressed on in the secondary dataset). Although D changes 2 by altering the value of 2 . As a according to the scaling, the scaling is allowed for in RD 2 does not usually change much across di¤erent scalings. See Appendix A of Royston result, RD & Sauerbrei [16] for further details, including the values of 2 that are used in (8) with each 2 is available [24]. type of model. Stata software to compute D and RD 4 4.1 Example datasets and models Breast cancer From July 1984 to December 1989, the German Breast Cancer Study Group recruited 720 patients with primary node positive breast cancer into a 2 2 trial investigating the e¤ectiveness of three versus six cycles of chemotherapy and of additional hormonal treatment with tamoxifen. We analysed the recurrence-free survival time of the 686 patients who had complete data for the prognostic factors age (x1), menopausal status (x2), tumour size (x3), tumour grade (2 dummy variables x4a, x4b), number of positive lymph nodes (x5), progesterone (x6) and oestrogen (x7) receptor status. With updated follow-up [25], 398 patients had had an event recorded. This is the primary dataset. The secondary dataset is from a trial of 303 patients (158 events) with identical inclusion criteria, procedures and recorded prognostic factors, but studying a di¤erent intervention. For comparability, we truncated follow-up time at 9 years in both datasets, thus censoring 17 (3:1%) of the total of 556 events. All Cox models for the primary dataset we investigated had signi…cant non-proportional hazards (see also section 5.1). Using multivariable fractional polynomial (MFP) methodology [26], which combines backward elimination of weakly in‡uential predictors with selection of fractional polynomial functions for continuous predictors, we developed a Royston-Parmar model with probit scaling and m = 1 knot (see section 2.2). It included the following predictors: x1 1 , x1 0:5 , x3, x4a, log x5, log (x6+1). This model was validated (i.e. evaluated) on the secondary dataset. We also evaluated the Cox model III reported in Table 4 of reference [26]. Figure 1 shows Kaplan-Meier curves for the two studies (datasets). [FIGURE 1 NEAR HERE] The survival curves are extremely similar up to about 3 years and not much di¤erent after that. 12 1.00 0.75 S(t) 0.50 0.25 0.00 Prim ary dataset Secondary dataset 0 3 6 Recurrence-free survival tim e, yr Num ber at risk (number of events) Prim ary 686 (235) 407 (115) 243 Secondary 303 (98) 189 (37) 112 9 (35) (19) 126 51 Figure 1: Breast cancer datasets. Kaplan-Meier curves for recurrence-free survival by dataset 4.2 Bladder cancer Over the past two decades, the European Organisation for the Research and Treatment of Cancer (EORTC) and the UK Medical Research Council (MRC) Working Party on Super…cial Bladder Cancer have performed several phase III randomised controlled trials for the prophylactic treatment of stage T1 /Ta super…cial bladder cancer following transurethral resection. Further clinical details are given by Royston et al [27] and references therein. Royston et al [27] developed a Cox model for the progression-free survival of the 3441 patients (721 events) in all 9 trials, adjusting for study di¤erences. We combined the seven EORTC trials to form the primary dataset and the two MRC trials to constitute the secondary dataset. Since the aim of the analysis was to illustrate the methodology of validation, the small overall di¤erences in progression-free survival between the studies within each combined dataset were ignored. We truncated the follow-up at 10 years, thus censoring 21 (2:9%) events overall.We took the prognostic model (10 terms) described in Table 3 of reference [27], …tted it to the primary dataset and evaluated it on the secondary dataset. A Royston-Parmar proportional cumulative hazards model with m = 1 (i.e. one interior knot) and a Cox model were used. Figure 2 shows Kaplan-Meier curves for the two sets of trials (datasets). [FIGURE 2 NEAR HERE] The survival curves di¤er in shape, with a much higher hazard in the early follow-up 13 1.00 0.75 S(t) 0.50 0.25 0.00 Prim ary dataset Secondary dataset 0 2.5 5 7.5 Progression-free survival tim e, yr Num ber at risk (number of events) Prim ary 2610 (280) 1934 (167) 1352 (86) 731 Secondary 831 (32) 749 (33) 655 (44) 362 10 (29) (29) 322 157 Figure 2: Bladder cancer data. Kaplan-Meier curves for progression-free survival by dataset period in the primary dataset. Notice also the high attrition rates in each dataset, with many patients being lost to follow up at all stages of each study. Attrition has the potential to induce bias if it is not at random with respect to the survival outcome. 5 5.1 Results Breast cancer example Column (a) of Table 2 shows the results of the validation analysis for the breast cancer datasets. [TABLE 2 NEAR HERE] Note that the Royston-Parmar models had probit link functions, whereas the Cox model assumed proportional hazards. The Royston-Parmar models corresponding to Cases 1-6 all …t about equally well in the secondary dataset, suggesting that the model estimated on the primary dataset may be used to predict in the secondary dataset without the need to modify the baseline distribution or the prognostic index (Case 1). This is the benchmark or ideal validation scenario. The goodness of …t test for the prognostic index of the probit model is not signi…cant at the 5% level in either the primary or the secondary dataset. Figure 3 compares the observed and predicted recurrence-free 14 Model (a) Breast cancer (b) Bladder cancer Royston-Parmar Case 1 Case 2a Case 2b Case 3 Case 4 Case 5a Case 6 Calibration slope b (SE) 2 (SE) primary data RD 2 (SE) secondary data RD GOF primary data (P -value) GOF secondary data (P -value) 651:03 650:09 647:96 647:95 650:08 650:08 647:94 0:99 (0:13) 0:23 (0:03) 0:22 (0:05) 0:24 0:45 900:80 889:52 864:18 850:00 891:73 887:85 846:30 1:26 (0:13) 0:23 (0:02) 0:30 (0:05) 0:44 0:0009 Cox Case 3 Case 6 Calibration slope b (SE) 2 (SE) primary data RD 2 (SE) secondary data RD GOF primary data (P -value) GOF secondary data (P -value) 1576:00 1575:87 0:95 (0:14) 0:20 (0:03) 0:17 (0:04) 0:004 0:11 1648:63 1644:54 1:27 (0:14) 0:23 (0:02) 0:31 (0:05) 0:22 0:003 Table 2: Model deviances, calibration constants, goodness of …t and explained variation for Cases 1-6 for (a) breast and (b) bladder cancer datasets. Probit and proportional-hazards Royston-Parmar models were …tted to the breast and bladder cancer datasets, respectively. GOF denotes goodness of …t Deviances for Cases 1-6 are based on maximized likelihoods Deviances for Cases 3 and 6 are based on maximized partial likelihoods 15 survival probabilities for the primary data and Cases 1-6. [FIGURE 3 NEAR HERE] Although 3 6 9 1 0 .25 .5 .75 0 .25 .5 .75 0 .25 .5 .75 0 0 3 6 9 6 9 6 9 6 9 1 0 .25 .5 .75 1 3 3 Case 4 0 .25 .5 .75 0 .25 .5 .75 0 0 Case 3 1 Case 2b 0 3 6 9 6 9 0 3 0 .25 .5 .75 1 Case 6 1 Case 5a 0 .25 .5 .75 Survival probability Case 2a 1 Case 1 1 Primary data 0 3 6 9 0 3 Recurrence-free survival time, yr Figure 3: Breast cancer data. Plots of observed and predicted progression-free survival probabilities from a Royston-Parmar probit model in 3 prognostic groups for the primary data and Cases 1-6 for the secondary data (see text for details). Dashed lines, predicted probabilities; solid lines and shaded areas, Kaplan-Meier estimates with their 95% pointwise con…dence intervals. Note that CIs for the central group sometimes overlap and obscure those for the outer groups the predicted survival probabilities are not a perfect …t in the poor prognosis group in the secondary dataset, the number of events beyond 4 years is very small. The results for Case 6 (4 additional parameters …tted) are similar to those for Case 1 (no additional parameters …tted). 2 values are similar between the two datasets. The value of b is near 1. The RD The results from the Cox model are similar to those from the Royston-Parmar probit model (see Table 2). However, the PH assumption for the prognostic index is violated in the primary dataset (P = 0:004, Grambsch-Therneau test). Figure 4 shows the observed and predicted survival probabilities. [FIGURE 4 NEAR HERE] Slight lack of …t is visible in the extreme groups, the pattern being similar between the primary and secondary datasets. Finally, as described in section 3.5, we validated a Cox model with the same covariates as the probit model through the ‘equivalent’Royston-Parmar PH model with 2 df (one interior knot). 16 .75 .5 .25 0 0 0 3 6 9 6 9 0 3 6 9 .5 .75 1 Case 6 0 .25 Survival probability .25 .5 .75 1 Case 3 1 Primary data 0 3 Recurrence-free survival time, yr Figure 4: Breast cancer data. Plots of observed and predicted recurrence-free survival probabilities from a Cox model in 3 prognostic groups for the primary dataset and for Cases 3 and 6 in the secondary dataset (see text for details). Dashed lines, predicted probabilities; solid lines and shaded areas, Kaplan-Meier estimates with their 95% pointwise con…dence intervals 17 The comparison of Cases 1-6 was essentially as given in Table 2, except that the deviances of all the models were around 660, compared with around 650 for the probit models in Table 2. This shows that the probit models are a rather better …t. 5.2 Bladder cancer example Column (b) of Table 2 shows the results for Cases 1-6 for the bladder cancer datasets and Cases 3 and 6 for the Cox model. The pattern of deviance di¤erences suggests that estimating both 0 and is required in the Royston-Parmar model, pointing to a di¤erence in shape and position of the baseline distributions between the datasets. The …t of the Case 3 model is not signi…cantly worse than that of Case 6, although Case 6 is marginally better (P = 0:054). The calibration slope in Case 6 is 1:26 (SE 0:13). Figure 5 shows the observed and predicted survival probabilities in 3 prognostic groups for the primary data and Cases 1-6 for the secondary data. [FIGURE 5 NEAR HERE] The 2 4 6 8 10 1 .8 .4 .6 .8 .4 .6 .8 .6 .4 0 0 2 4 6 8 10 4 6 8 10 4 6 8 10 8 10 1 .4 .6 .8 1 .8 .4 2 2 Case 4 .6 .8 .6 .4 0 0 Case 3 1 Case 2b 0 2 4 6 8 10 8 10 0 2 4 6 .6 .4 .4 .6 .8 1 Case 6 1 Case 5a .8 Survival probability Case 2a 1 Case 1 1 Primary data 0 2 4 6 8 10 0 2 4 6 Progression-free survival time, yr Figure 5: Bladder cancer data. Plots of observed and predicted progression-free survival probabilities from a Royston-Parmar proportional hazards model in 3 prognostic groups for the primary data and Cases 1-6 for the secondary data (see text for details). Dashed lines, predicted probabilities; solid lines and shaded areas, Kaplan-Meier estimates with 95% pointwise con…dence intervals. Note that the vertical axes begin at 0.4 18 observed (Kaplan-Meier) and predicted probabilities agree well in the primary data but rather poorly in all cases in the secondary data. The goodness of …t tests align with this observation, the P -values being 0:4 and 0:0009 in the primary and secondary data, respectively (Table 2). 2 ) is actually higher in the secondary data (0:31 vs. 0:23). Explained variation (RD The prognostic index was split at its 25th and 75th centiles in the primary dataset. In the secondary dataset, the risk groups de…ned using the same cutpoints contain approximately 39=48=13 percent of patients. The case mix is di¤erent, with about half as many poor-prognosis patients (see also Figure 2). Results for the Cox model are also given in Table 2. Where available, they mirror closely the results for the Royston-Parmar model. Figure 6 shows the observed and predicted survival probabilities. [FIGURE 6 NEAR HERE] Apart from the absence of smoothness of the predicted .8 .6 .4 0 2 4 6 8 10 6 8 10 0 2 4 6 8 10 .6 .8 1 Case 6 .4 Survival probability .4 .6 .8 1 Case 3 1 Primary data 0 2 4 Progression-free survival time, yr Figure 6: Bladder cancer data. Plots of observed and predicted progression-free survival probabilities from a Cox model. For details, see Figure 4 survival functions, the results are similar to those in Figure 5. The main conclusion from this example is that the prognostic index of the prognostic model validates well, but that the baseline survival distributions are substantially di¤erent between the primary and secondary datasets. A possible explanation could be that the patient populations in the two trials datasets were somewhat dissimilar. The MRC patients had not yet had any 19 recurrence of their cancer, whereas a substantial proportion of the EORTC patients had had recurrences. This may explain the higher hazard of early disease progression seen among the EORTC patients. However, the shape of the survival distribution in the EORTC (primary) patients is similar between patients who had and had not experienced a recurrence (data not shown). 6 Discussion In our view, validation is not about re…tting the predictors in a postulated model on the secondary dataset and comparing the estimated ’s, as some have done. It is about predicting relevant quantities (such as survival distributions) from a model fully speci…ed on the primary dataset to a secondary dataset (our Case 1), and examining its accuracy. Only parametric models support this approach for time-to-event data. Since the Cox model does not estimate the baseline distribution, it cannot be fully validated. To throw further light on possible lack of …t, we suggest modifying the model in minor ways corresponding to our Cases 2a-6. Tests related to these models may point to di¤erences in model …t between the primary and secondary datasets. We describe the di¤erences in terms of the parameters 0 , and . It is also informative to apply a simple goodness of …t test; we suggest a proportionality test based on the prognostic index, . A model for which Case 1 …tted as well as Cases 2a-6 and which passed the goodness of …t test would validate as well as it possibly could, within the limitations of our method. That is so with the breast cancer example, but not with the bladder cancer example. Most prognostic survival models in the (medical) literature are reported as b ’s with SE or CI from a Cox model. Kaplan-Meier survival curves in groups obtained from the prognostic index are sometimes presented, but no attempt is made to report formal estimates of the baseline survival distribution. We propose that the baseline survival distribution and the prognostic index are equally important components of a survival model. Without the baseline distribution, model speci…cation is incomplete, so both should be estimated and reported. Many parametric survival models are available (e.g. exponential, Weibull, etc), but they are little used in clinical research. Presumably the reason is lack of ‡exibility, and the concern that the …t may be poor. Due to the incorporation of spline modelling of the baseline cumulative distribution, the Royston-Parmar family can successfully model arbitrarily complex distributions. We suggest that they have potential value in prognostic research, particularly in better reporting and validation of survival models. Due to the pre-eminence of the Cox model, models with other link functions are not used very often. The Royston-Parmar family o¤ers three di¤erent link functions. In the breast cancer data, which has moderately long follow-up, we found that a probit model provided a much better …t than a proportional hazards model (see also [28]), and also validated well. We have concentrated here on external validation of survival models. Internal validation 20 is no substitute, of course, but our methods could be used, for example, with internal datasplitting. With the latter, the data are split (usually unequally) at random into a primary and secondary dataset. Model development is done on the primary data and validation on the secondary data, once testing and ‘tuning’of the model on the primary data has been completed. Although our approach is formal in the sense that it provides estimates and signi…cance tests relevant to the validation task, we regard the graphical component as equally important. The plots of observed and predicted survival probabilities, in both the primary and secondary datasets, are a useful tool for illustrating the amount of discrimination o¤ered by the model in each dataset and the nature of the survival distributions, and may point to areas of lack of …t. Varying prognostic classi…cation schemes could be used, depending on the clinical context. For example, particular categorisations of the prognostic index in the secondary dataset could be used to study predictions at the extremes of the risk distribution. Perhaps the most challenging questions in validation are what do we expect to see, and what is it reasonable to expect? A suggested benchmark of ‘optimal’performance is for a model to give satisfactory performance in Case 1 (i.e. b0 and b not requiring revision), to have b near 2 values and to pass a goodness of …t test. The 1, and, in both datasets, to have comparable RD probit model in the breast cancer example conforms to the benchmark. The bladder cancer model does not; the predicted survival probabilities do not match the observed probabilities very closely and Case 6 is a signi…cantly better …t than Case 1, both indicating lack of …t. 2 statistic, is actually better in Nevertheless, the model’s discrimination, according to the RD the secondary dataset. The clinical aim(s) are paramount. If the aim is accurate prediction of survival probabilities, the bladder cancer model does not perform very well. If the aim is to …nd prognostic factors leading to a risk score that discriminates well, the opposite is the case. There is no absolute standard of performance here. Nevertheless, our proposed benchmark may be helpful in that it supplies a practical framework in which di¤erent aspects of model validation may be assessed systematically. A situation we have not considered here is whether it is possible to validate (in our sense) a published model for which the original (primary) data are not available. With the Cox model, the answer is clearly ‘no’, since the baseline survival function is not available. With parametric models, the answer is a provisional ‘yes’, depending on publication of the full speci…cation of the model and its estimated parameters. For Royston-Parmar models, for example, one would need to know the scale (hazards, odds or probit) of the model, the knot position(s) on t, and the estimates of 0 , and . Precisely how parametric survival models should be reported with possible validation (and other uses) in mind is a topic for further research. Do we recommend the routine use of (‡exible) parametric survival models, both for modelling and for validation? The answer is ‘yes’— provided the chosen model …ts the data. In many datasets (but by no means all) this rules out too-simple models, such as the exponential and Weibull. Although the Cox model is convenient, its strength lies mainly in estimating relative 21 hazards for covariate e¤ects. It is ‘semi-parametric’in nature and was never intended for the task of comprehensive prediction of survival probabilities and other important quantities. Although we favour Royston-Parmar models, other ‡exible parametric models exist, for example those based on the Poisson distribution (e.g. reference [29]). The original implementation of Royston-Parmar models is through the stpm user-written command for Stata [30]; an improved version, stpm2, is available [31]. An undocumented feature of the mfp command in version 11 of Stata [32] allows users to develop MFP models [26] using stpm or stpm2. Features of mfp include elimination of weakly in‡uential variables by backward elimination and determination of functional form for continuous predictors using fractional polynomial functions. Further Stata software is available on request from the …rst author to carry out the computations described in the paper. References [1] F. E. Harrell. Regression modeling strategies, with applications to linear models, logistic regression, and survival analysis. Springer, New York, 2001. [2] J.C. Wyatt and D. G. Altman. Prognostic models: clinically useful or quickly forgotten? British Medical Journal, 311:1539–1541, 1995. [3] K. G.M. Moons, D. G. Altman, P. Royston, and Y. T. Vergouwe. Generalisability, impact and application of prediction models in clinical practice. 2007. In preparation. [4] Y. Vergouwe, E. W. Steyerberg, R. S. Foster, J. D. F. Habbema, and J. P. Donohue. Validation of a prediction model and its predictors for the histology of residual masses in nonseminomatous testicular cancer. Journal of Urology, 165:84–88, 2001. [5] J. Hermans, A. D. G. Krol, K. van Groningen, P. M. Kluin, J. C. Kluin-Nelemans, M. H. H. Kramer, E. M. Noordijk, F. Ong, and P. W. Wijermans. International Prognostic Index for aggressive non-HodgkinŠs lymphoma is valid for all malignancy grades. Blood, 86:1460– 1463, 1995. [6] V. Ficarra, G. Martignoni, C. Lohse, G. Novara, M. Pea, S. Cavalleri, and W. Artibani. External validation of the Mayo Clinic stage, size, grade and necrosis (SSIGN) score to predict cancer speci…c survival using a European series of conventional renal cell carcinoma. Journal of Urology, 175:1235–1239, 2006. [7] F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: issues in developing models, evaluating assumptions and accuracy, and measuring and reducing errors. Statistics in Medicine, 15:361–387, 1996. [8] H. C. van Houwelingen. Validation, calibration, revision and combination of prognostic survival models. Statistics in Medicine, 19:3401–3415, 2000. 22 [9] D. G. Altman and P. Royston. What do we mean by validating a prognostic model? Statistics in Medicine, 19:453–473, 2000. [10] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F. L. Baehner, M. G. Walker, D. Watson, T. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England Journal of Medicine, 351:2817–2826, 2004. [11] M. Buyse, S. Loi, L. van’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. Saghatchian d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. Harris, J. Bogaerts, P. Therasse, A. Floore, M. Amakrane, F. Piette, E. Rutgers, C. Sotiriou, F. Cardoso, and M. J. Piccart on behalf of the TRANSBIG Consortium. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute, 98:1183–1192, 2006. [12] A. Burton and D. G. Altman. Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer, 91:4–8, 2004. [13] P. Royston and M. K. B. Parmar. Flexible proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment e¤ects. Statistics in Medicine, 21:2175–2197, 2002. [14] N. Younes and J. Lachin. Link-based models for survival data with interval and continuous time censoring. Biometrics, 53:1199–1211, 1997. [15] F. J. Aranda-Ordaz. On two families of transformations to additivity for binary response data. Biometrika, 68:357–363, 1981. [16] P. Royston and W. Sauerbrei. A new measure of prognostic separation in survival data. Statistics in Medicine, 23:723–748, 2004. [17] M. E. Miller and S. L. Hui. Validation techniques for logistic regression models. Statistics in Medicine, 10:1213–1226, 1991. [18] P. M. Grambsch and T. M. Therneau. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, 81:515–526, 1994. [19] W. Sauerbrei, P. Royston, and M. Look. A new proposal for multivariable modelling of time-varying e¤ects in survival data based on fractional polynomial time-transformation. Biometrical Journal, 49:453–473, 2007. [20] J. F. Lawless. Statistical models and methods for lifetime data. Wiley, New York, 1982. [21] I. Helland. On the interpretation and use of R2 in regression analysis. Biometrics, 43:61–69, 1987. 23 [22] E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher. Assessment and comparison of prognostic classi…cation schemes for survival data. Statistics in Medicine, 18:2529–2545, 1999. [23] M. Schemper. The explained variation in proportional hazards regression. Biometrika, 77:216–218, 1990. [24] P. Royston. Explained variation for survival models. Stata Journal, 6:83–96, 2006. [25] W. Sauerbrei, G. Bastert, H. Bojar, C. Beyerle, R. L. A. Neumann, C. Schmoor, and M. Schumacher for the German Breast Cancer Study Group. Randomized 2 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients: An update based on 10 years’ follow-up. Journal of Clinical Oncology, 49:453–473, 2000. [26] W. Sauerbrei and P. Royston. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society (Series A), 162:71–94, 1999. [27] P. Royston, M. K. B. Parmar, and R. Sylvester. Construction and validation of a prognostic model across several studies, with an application in super…cial bladder cancer. Statistics in Medicine, 23:907–926, 2004. [28] P. Royston. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. Statistica Neerlandica, 55:89–104, 2001. [29] P. W. Dickman, A. Sloggett, M. Hills, and T. Hakulinen. Regression models for relative survival. Statistics in Medicine, 23:51–64, 2004. [30] P. Royston. Flexible parametric alternatives to the Cox model, and more. Stata Journal, 1:1–28, 2001. [31] P. C. Lambert and P. Royston. Further development of ‡exible parametric models for survival analysis. The Stata Journal, 9:265–290, 2009. [32] StataCorp. Stata Reference Manual, version 11. Stata Press, 2009. 24