External validation and updating of a prognostic survival model

External validation and updating of a prognostic survival
model 1
Patrick Royston & Mahesh K B Parmar
Hub for Trials Methodology Research
MRC Clinical Trials Unit and University College London
222 Euston Road,
Douglas G. Altman
Centre for Statistics in Medicine, University of Oxford
17 March 2010
We propose the view that external validation of a prognostic survival model means evaluating its performance in an independent (secondary) dataset obtained separately from the data
in the original (primary) study, with a minimum of change or reconstruction of the model. The
primary and secondary datasets should consist of patients with a set of prognostically relevant
predictors in common, comparably de…ned time-to-event outcomes with similar follow-up times,
and the same clinical condition observed in similar settings. We suggest graphical and analytical techniques for evaluating the …t of a proposed model in the secondary dataset. The analyses
involve evaluating by how much minor changes to the model in the secondary dataset improve
the …t there. The ‘ideal’ model does not require any such modi…cation. Other assessments
of performance we discuss include a goodness of …t test and a measure of explained variation.
We note that full validation of a model requires the model to provide a complete probabilistic
description of the data, su¢ cient to predict the survival probabilities at any relevant timepoint
and for any combination of values of the prognostic factors. The Cox proportional hazards
model was designed to estimate the e¤ects of covariates on the hazard function, but not to
estimate or predict survival probabilities; by de…nition, it does not satisfy our full validation
requirement. Instead, we recommend use of adequately ‡exible parametric survival models,
both for estimation and validation. We focus on the families of models proposed by Royston &
Parmar. We show how the techniques may be used in the analysis of two datasets in cancer.
Techniques for validation of regression models may be classi…ed as internal or external. The key
element of internal validation is that only one dataset (the primary dataset) is used. Internal
Research report No. 307, Department of Statistical Science, University College London. Date: March 2010.
validation always involves some form of data re-use. One older technique is to split the data
at random, develop a model on one portion and evaluate it on the other. More recent and
arguably more appropriate techniques include k-fold cross-validation and bootstrap resampling
[1]. While internal validation can throw light on possible fragilities (such as ‘over-optimism’) of
the model, it can never be an adequate substitute for validation on independent data (external
validation). Unvalidated models are likely to be “quickly forgotten” [2].
In external validation, two independent datasets (primary and secondary) are required.
Typically, the primary dataset is collected …rst and a prognostic model is derived from it.
Later, further relevant data are assembled to form the secondary dataset and the performance
of the model is evaluated. In this paper, we propose the view that external validation of a
prognostic survival model derived on the primary dataset means evaluating its performance in
the secondary dataset with a minimum of change or reconstruction of the model. Prerequisites
of such analyses are that the primary and secondary datasets comprise patients having a set of
prognostically relevant predictors in common, comparably de…ned time-to-event outcomes with
roughly similar follow-up times, and the same clinical condition observed in similar settings.
The acceptable degree of di¤erence between the populations from which the samples were drawn
is a matter of debate [3]; for example, it may be unreasonable to expect a model derived on a
sample of adults to perform well in children. However, we do not address that issue here. The
populations and samples under study are taken as given.
External validation of prognostic models with a binary outcome (usually, logistic regression
models) has become quite a well-established practice in the medical literature. An example is
reference [4]. However, the task there is much simpler than for survival models, since predictions
are essentially estimates of the patient- or group-speci…c probability of the event of interest,
conditional on covariates, at a prede…ned, …xed point in time. Estimating and comparing
survival probabilities across the whole epoch since study inception are avoided. By contrast,
there is little statistical literature on formal techniques for external validation of models for timeto-event data. Authors of medical papers tend to con…ne themselves to creating a prediction
score or nomogram (a graphical version of the score) on the primary data (or using an existing
score), generating prognostic groups from the score, and plotting Kaplan-Meier survival curves
for the prognostic groups in the secondary data (for example, references [5], [6]). A measure
of discrimination (see below) of the score, such as Harrell’s concordance or c-index [7], may be
evaluated on the primary and secondary datasets, and compared. Sometimes, a signi…cance
test, based on the logrank test or a Cox model, of the relative hazards among the prognostic
groups in the secondary dataset is performed. However, these approaches do not adequately
address the question of how well the model works in the secondary dataset, the main topic of
the present paper.
A prognostic survival model has a systematic component, typically several predictors relevant to the disease in question whose e¤ects are combined linearly into a risk score or prognostic index, and a stochastic part describing the probabilistic element, almost always represented
through the (unspeci…ed) hazard function in a Cox proportional hazards (PH) model. We agree
with van Houwelingen, who states “Generally speaking, validation of well-speci…ed (quantitative) models is better de…ned than validation of vague (qualitative) models that only produce
subgroups.” [8] More precisely, we regard the characteristic aim of a survival model, which
typically operates with right-censored time-to-event data, to be to estimate or predict survival
probabilities at given times after the temporal origin at either the individual or the group
level, conditional on covariates or prognostic factors. The terms ‘covariates’, ‘predictors’ and
‘prognostic factors’ are used interchangeably in the present context. While in our validation
activities we largely focus on this aspect of model performance, we also look at global measures
of explained variation and goodness of …t.
The predictors used in the model for the primary dataset must somehow be selected from
a set of candidate variables. The process can be done in many ways (e.g. use prespeci…ed
predictors based on literature evidence and/or medical knowledge, use all available predictors,
select variables by criteria of statistical importance, etc.) We do not consider in any detail the
method of selecting predictors in the primary dataset. We concentrate on external validation,
interpreted as the model’s performance in a separate, secondary dataset.
The concept of validation appears at face value to be straightforward: either a model ‘validates’or it does not. The dichotomy is deceptive. In practice, assessment of model performance
in new data does not give a simple ‘yes/no’answer. Many performance measures are available,
and each of them is a value on a continuous scale. Subjective judgment of performance is
therefore always needed, guided by clinical aims of the model [9]. Traditionally, performance
measures are classi…ed in terms of ‘calibration’or ‘discrimination’. Calibration relates to bias
in predictions, and discrimination to the ability of a model to distinguish between outcomes of
patients with putatively di¤erent risks of an event. However, the attributes are not necessarily independent. Further, measures of calibration or discrimination do not directly re‡ect the
goodness of …t of a model.
Validation is becoming an increasingly important issue in prognostic studies, as interest in
prognostic and ‘predictive’molecular and genetic markers grows, e.g. references [10], [11]. (A
predictive marker is one that predicts response to treatment. For example, in primary breast
cancer, oestrogen receptor status of the tumour predicts whether a patient’s cancer responds
to the hormonal drug tamoxifen.) Practical guidance on the assessment of such markers is
however very limited.
The Cox proportional hazards model is a near-universal tool in the analysis of prognostic
models in clinical medicine [12]. However, it was not formulated with our concept of full external
validation in mind. The main issue is that a parametric estimate of the baseline survival function
is not available (note that ‘baseline’refers to zero values of the covariates, not to time t = 0).
Prediction in new data of survival probabilities from a Cox model is therefore problematic.
Further, covariate e¤ects must conform to the proportional hazards assumption, which is quite
limiting, particularly in data with medium- or long-term follow-up. We also work in some detail
with three families of ‡exible parametric survival (‘Royston-Parmar’) models [13]. The three
families assume proportional (cumulative) hazards, proportional (cumulative) odds and probitscale covariate e¤ects, respectively. By the ‘scale’ of a model we mean the transformation of
the survival function on which covariate e¤ects are assumed linear. For example, covariates in
the PH model are assumed to act linearly on the log-minus-log transformation of the survival
function, referred to as hazards scaling. In Royston-Parmar models, the scale transformation
(or link function) of the baseline survival function is approximated parametrically by a restricted
cubic spline function (see section 2.2). In the simplest cases, the spline function reduces to a
linear function of log survival time. The models then simplify to the Weibull, log-logistic and
lognormal distributions, respectively, and so are seen as generalizations of these more familiar
survival models.
The goal of the paper is to propose and demonstrate a practical approach to full validation
of a model for time-to-event data. To keep things relatively simple, we do not consider models
with time-dependent covariate e¤ects— either time-varying e¤ects of static covariates, or (a
much less common case) dynamic covariates whose values change over time. We aim to show
that ‡exible parametric modelling is necessary if validation of both the systematic and the
stochastic components of a model is required (as it should be). We propose a technique which
teases out three aspects of possible lack of …t of a model on a secondary dataset: the location
of the baseline survival function, its shape, and calibration of the linear predictor or prognostic
The plan of the paper is as follows. Section 2 outlines the Cox and Royston-Parmar models.
Section 3 describes the methodology, discussing model assessment, goodness of …t, validation,
prediction of survival probabilities, graphical display of results and explained variation. Section
4 describes two cancer datasets and prognostic models used to predict time-to-event outcomes
in them. Section 5 presents analyses of the datasets using the performance measures and
approaches to validation described in section 3. Section 6 is a discussion.
Cox model
In a Cox model with covariate vector x and parameter vector , the stochastic component is
the (hypothetical) baseline hazard function, h0 (t) = h (t; 0), and the systematic part is the
prognostic index, (i.e. eta) = x . The baseline hazard function is ‘hypothetical’ in the
sense of depending on how the covariates x are coded and/or centered; there is no guarantee or
necessity that any patient in the sample has x = 0. Estimation of by maximizing the partial
log likelihood does not require an explicit formulation of h0 (t), which is not itself estimated,
hence the description of the model as ‘semi-parametric’. See section 3.5 for the implications of
the semi-parametric nature of the Cox model for validation.
Hazards-scaled Royston-Parmar models may be seen as parametric extensions of the Cox
model that retain the stochastic/systematic decomposition. Writing the Cox model for the
hazard function h (t; x) as
h (t; x) = h0 (t) exp (x ) ;
and integrating with respect to time, gives the equivalent model
H (t; x) = H0 (t) exp (x ) ;
a proportional cumulative hazards model. Royston-Parmar models include a parametric representation of H0 (t), whereas in the Cox model H0 (t) is left unspeci…ed. See section 2.2 for a
general de…nition of Royston-Parmar models.
Royston-Parmar survival models
Royston & Parmar [13] described families of ‡exible parametric survival models resembling
generalized linear models with various link functions. The models rely on transformation of the
survival function by a link function g (:),
g [S (t; x)] = g [S0 (t)] + x ;
where S0 (t) = S (t; 0) is the baseline survival function and is a vector of parameters to be
estimated for covariate vectors x. In earlier work, Younes & Lachin [14] used the parameterised
link function g (z; ) = ln z
1 = proposed by Aranda-Ordaz [15], in which = 1 corresponds to the proportional odds model and ! 0 to the proportional hazards (PH) model.
Royston & Parmar [13] investigated these two models and also considered the probit model for
1 (s), where
1 (s) is the inverse standard Normal distribution function.
which g (s) =
By letting
! 0 and exponentiating (1), proportional cumulative hazards models are
obtained, namely
H (t; x) = H0 (t) exp (x )
where H (t; x) is the cumulative hazard function and H0 (t) = H (t; 0) is the baseline cumulative
hazard function. This is also a PH model. Royston & Parmar [13] approximated ln H0 (t) by
a restricted cubic spline function of log time with two ‘boundary’knots and m interior knots,
ln H0 (t) = 0 + 1 ln t + 2 v1 (ln t) + : : : + m+1 vm (ln t)
where v (ln t) = ln t; v1 (ln t) ; : : : ; vm (ln t) is a vector of spline basis functions (see Royston &
Parmar [13] for details). Ignoring the constant term 0 , a spline with m knots has m+1 degrees
of freedom (d.f.). The PH model obtained by substituting (3) for H0 (t) in (2) is fully speci…ed
parametrically and may be written
ln H (t; x) =
+ v (ln t)
If we write the prognostic index as = x then model (4) may be extended slightly or ‘recalibrated’as
ln H (t; x) = 0 + v (ln t) +
Since the term v (ln t) in (5) has not changed, the shape of the baseline cumulative hazard
function is unaltered, but its location changes through 0 +
when 0 6= 0 or
6= 1.
Estimation of
is relevant in validation studies where recalibration is of interest (see reference
[8], for example). Regression on the prognostic index in the primary (derivation) data yields
1. In the secondary (validation) data, a value of b much less than 1 may indicate a
model with poor performance. If b = 0 the model is probably useless.
Preliminary investigation
As a …rst step before any modelling starts, it is worthwhile plotting the overall Kaplan-Meier
survival curves by dataset. If survival varies noticeably across the datasets, inaccurate prediction of survival probabilities from the primary to the secondary datasets may result. If
di¤erences in the shapes of the survival curves are present, systematic di¤erences in the study
populations may be indicated.
Among possible exploratory analyses, it is also worth comparing the distributions of the
covariates between the primary and secondary datasets. Substantial di¤erences here may also
indicate important di¤erences in the study populations.
Components of parametric survival models
The prognostic index, , is of central importance in validation and assessment of model performance. The spread or variance of across patients indicates the degree of discrimination,
predictive ability or variation explained by the model [16]. A large variance suggests that a
model has a (relatively) good ability to discriminate between patients who have di¤erent risks
of an event. The role of the prognostic index has been widely recognised, but that of the
baseline survival function, S0 (t), has often been ignored, presumably because it is not part of
Cox model. However, S0 (t) is arguably of equal importance, since it determines the pattern
of absolute survival probabilities over time, whereas the prognostic index expresses the (log)
relative risk (hazard, etc) of one individual in the sample relative to another, or relative to a
hypothetical baseline individual for whom all covariate values are zero. An appraisal of a survival model is incomplete without considering both components. For a model to be clinically
useful, it must estimate actual probabilities.
We present an approach to assessing model performance in independent data that takes into
account both the prognostic index and the baseline distribution function. In the proportional
cumulative hazards model (4), the baseline distribution function is the baseline log cumulative
hazard function. Two aspects of the baseline distribution function, represented by the parameters 0 and in (4), may be distinguished. The …rst, 0 , is a type of intercept and indicates the
position in ‘log relative risk space’of the baseline distribution function. A higher value of 0
indicates generally higher risk and hence lower survival. As already mentioned, the parameter
vector controls the shape of the baseline distribution function and hence of the predicted survival and hazard functions. The prognostic index raises (when > 0) or lowers (when < 0)
the risk for an individual relative to 0 . In a validation study, interest lies in whether values
of 0 , or from the primary dataset still hold good in the secondary dataset. Sometimes a
simple recalibration
of markedly improves model …t in the secondary dataset.
Model validation and updating
In the secondary dataset, the prognostic index is calculated using values of the parameter
vector b estimated on the primary dataset and applied to the covariate values of the secondary
dataset. The components of are not re-estimated on the secondary dataset. Similarly, the
baseline survival function is calculated in the secondary dataset using the parameter vector
(b0 ; b ) estimated on the primary dataset, together with the (log) time values in the secondary
dataset and the set of spline knots used in the primary dataset.
It is possible that the predicted survival probabilities, derived from b0 ; b and b , are suf…ciently accurate in the secondary dataset to remove the need to ‘tune’ any of the model
parameters. This is the ideal scenario regarding validation; we refer to it as Case 1. We now
describe six additional cases, which comprise a structured, ordered set of models allowing one
to assess and if necessary improve the …t of the original model in the secondary dataset. The
adjustments provide a route to updating a given model in a simple way, thus avoiding complete
reconstruction (see van Houwelingen [8] for a related approach). Of course, in some situations
reconstruction may be unavoidable. In any case, an updated model should be re-validated in
further independent data.
‘Case 1’(see section 3.2) is the situation that the parameters of model (4) estimated on the
primary dataset also provide an adequate …t to the secondary dataset. Going beyond that, we
consider four general and parsimonious ways to update the model on the primary dataset using
data from the secondary dataset. These are summarized in Table 1. [TABLE 1 NEAR HERE]
The need for such updating, if detected, indicates di¤erent types of lack of …t of the original
Not re-estimated
0 re-estimated
0 and
Case 1
Case 2a
Case 2b
Case 3
Case 4
Case 5a
[Case 5b ]
Case 6
Table 1: De…nition of Cases 1-6 in structured updating of a ‡exible parametric survival model
on a secondary dataset.
Infeasible (see text)
1. Re-estimate 0 (Case 2a). Altering 0 moves the entire baseline distribution function
up or down. The shift is sometimes referred to as ‘recalibration in the large’ [1]. A
situation in which 0 changes is where the susceptibility of the patients to an event di¤ers
systematically between the primary and secondary datasets.
2. Re-estimate (but not 0 ) (Case 2b). This is a more radical revision. Changing
the shape of the baseline distribution, but not its general level.
3. Re-estimate
(Case 3). Similar to Case 2b, except that the general level is also
4. Regarding as a continuous covariate with a linear e¤ect in the secondary dataset, estimate its slope,
(Cases 4, 5 and 6). This idea, which was proposed for the logistic
model by Miller [17], posits that the e¤ect of is linear but may be mis-calibrated (i.e.
6= 1). Since b is often < 1,
is sometimes described as a shrinkage factor.
Table 1 shows how Cases 1-6 relate to the types of model revision listed above. [TABLE 1
NEAR HERE] The reason why there is no Case 5b is that estimating
and re-estimating
with 0 …xed is infeasible. Multiplying by
a¤ects the value of 0 .
A nice feature of this schema is that almost all of the pairs of models moving down each
column and across each row of Table 1 are nested. The exception is that Case 2a is not nested
in Case 2b, although Cases 2a and 2b nest Case 1 and are nested in Case 3. The deviances
(minus twice the maximised log likelihood) of the 7 models may be tabulated and compared.
The d.f. of the 7 models corresponding to Cases 1-6 are 0, 1, 1 + m, 2 + m, 1, 2, 3 + m,
respectively. Likelihood ratio tests based on 2 approximations may be performed to compare
pairs of nested models.
These simple adjustments to the original model may produce major or minor changes in the
goodness of …t (deviance). Such information may guide the interpretation of the results, and
where appropriate, the choice of a ‘…nal’updated model for the secondary dataset.
Goodness of …t
Attention should also be paid to goodness of …t of the model in the primary and secondary
datasets. A common cause of poor …t of a survival model is failure of the proportionality
assumption (e.g. non-proportional hazards). A simple ‘global’or ‘average’assessment of nonproportionality of the prognostic index on the chosen scale of a Royston-Parmar model is to
test the interaction between the spline basis functions and the prognostic index. Such tests are
discussed in principle in section 4.1 of Royston & Parmar [13].
The …rst step is to estimate b on the primary dataset using model (4). In the next step, (4)
is extended to include the interaction between b and the spline basis functions v (ln t), giving
(for cumulative hazards scaling) the following model:
ln H (t; x) =
+ v (ln t)
b + interaction
where interaction = y1 b ln t + y2 bv1 (ln t) + : : : + ym+1 bvm (ln t). The interaction induces nonproportionality of cumulative hazards by allowing the shape of the distribution function to vary
with b. All the parameters in (6) are estimated by maximum likelihood in the primary dataset.
To test for lack of …t, the di¤erence in deviance between (4) and (4) is compared with a 2
distribution on m + 1 d.f.
Given b from the primary dataset (now evaluated on the secondary dataset), the same
approach to testing an interaction is applied in the secondary dataset. Since the parameters of
(6) are re-estimated in the secondary dataset, …rst without and then with the interaction term,
the two models that are compared resemble Case 6 (see Table 1). Thus there is recalibration
of the prognostic index and re-estimation of the distribution function. The goodness of …t
assessment in the secondary dataset, therefore, is of a revised model.
Since categorisation of the prognostic index is avoided, the approach is parsimonious, which
should confer good power. However, the properties of the test await detailed evaluation.
In the Cox model, a test of non-proportional hazards in the same spirit may be performed
by applying Grambsch-Therneau tests [18] of standardized Schoenfeld residuals for in the
primary and secondary datasets. Other possibilities are described by Sauerbrei et al [19].
Validation of a Cox model
Principles similar to those just described for parametric survival models may be applied to
validating a Cox PH model, but with restrictions. As already mentioned, the Cox model does
not include a transportable estimate of the baseline survival function, making prediction of
the latter from the primary to the secondary dataset infeasible. Two updating options remain
from the list given in Table 1, equivalent to Cases 3 and 6 in section 3.3: either
= 1 (Case
3, no recalibration) or
is estimated (Case 6, recalibration is performed). In each case, the
(possibly recalibrated) prognostic index is applied to the baseline survival function, estimated
by a standard method (such as Lawless [20]) at = 0 in the secondary dataset. To test the
= 1, i.e. to assess the need for recalibration, the deviance di¤erence between
Cases 3 and 6, based on the partial likelihood, may be compared with a 2 distribution on 1
d.f. Further investigation of possible model weaknesses of the type available with a parametric
model cannot be done with the Cox model.
An alternative approach to validating a Cox model is to regard a Royston-Parmar PH model
as a Cox model with a parametric baseline distribution function. Validation is done exactly as
described in section 3.3. The steps of the analysis are
1. Fit the Cox model to the primary data and estimate the prognostic index, ;
2. In Royston-Parmar PH models for the primary data with o¤set, search for a parsimonious spline function for the baseline cumulative hazard function;
3. Apply the methods of section 3.3 to the secondary dataset.
In fact, since Cox and Royston-Parmar PH models yield almost identical parameter estimates [13], one could dispense with …tting the Cox model and regard validating a RoystonParmar model as equivalent to validating a Cox model with the same covariates.
Assessment of the discrimination of
discussed in section 3.7.
in a secondary dataset is straightforward, and is
Prediction of survival probabilities and graphical display
A cardinal strength of parametric survival models is their ability to predict survival probabilities
for individuals and groups in the primary and secondary datasets. Survival probabilities in
groups may be estimated by averaging the survival curves for individuals at the each of the
observed failure and censoring times. The group survival probabilities may then be compared
with Kaplan-Meier curves.
Risk groups must …rst be created by categorising the (continuous) prognostic index, .
Clearly many such schemes are possible. For the purpose of validation (but not necessarily
more generally), we suggest forming three groups using cutpoints at the …rst quartile and third
quartile of the distribution of across the individuals in the primary dataset. (If validation
of more extreme predicted risks is important, the cutpoints can be adjusted accordingly.) The
predicted survival curves are averaged over all individuals in a given group at the observed
failure and censoring times. Predicted and observed (Kaplan-Meier) curves may be compared
graphically. As seen in the examples to come, the resulting graph, with all groups displayed
in the same plot, indicates the predictive performance of the model over the entire time range.
We strongly recommend producing such plots, which with parametric models may be done for
Cases 1-6.
For the Cox model, similar plots may be produced, but as discussed in section 3.5, they are
limited to Cases 3 and 6. The predicted survival probabilities are again obtained by averaged
individual survival curves in risk groups. The predicted survival curve for an individual with
prognostic index is S0 (t)exp( ) , where S0 (t) is the baseline survival distribution estimated in
the secondary dataset.
Similar graphical comparisons of observed and predicted survival probabilities may also be
produced for the model for the primary dataset. They are quite helpful in assessing the accuracy
of the predicted survival probabilities over time. Comparison of the plots between the primary
and secondary datasets is also informative. For example, the plots may reveal lack of model …t
on the primary or secondary dataset (or both), or that model …t is worse (or indeed better) in
the secondary than the primary dataset.
Measure of explained variation suitable for validation studies
The explained variation statistic R2 for the linear regression model y
may be written as
var ( )
R2 = 2
+ var ( )
where var( ) is the the variance of the distribution of across the sample (see eqn. (9) of
Helland [21]). Several authors have proposed versions of explained variation statistics for use
with censored survival data. These measures have not necessarily been extensions of the linear
regression case. For example, Graf et al [22] based their proposed measure on the Brier score
for survival data (a measure of predictive accuracy at the individual level), while others (e.g.
Schemper [23]) have worked with the survival curves estimated from a model.
2 , a transformation of their DRoyston & Sauerbrei [16] proposed a measure called RD
measure of discrimination, for use with the Cox model and other fairly general models for
2 is de…ned analogously to R2 in (7) as
censored survival data. RD
D2 = 2
2 + D2 =
where 2 = 8= and D is the slope of the regression of the survival outcome on scaled Normal
order statistics corresponding to . See Royston & Sauerbrei [16] for justi…cation and for
explanation of the motivation, interpretation and properties of D. Under the assumption of
normality of the distribution of across the sample, D2 = 2 is an estimate of var( ) that is
nevertheless reasonably robust to departures from normality of . Furthermore, D and hence
2 are insensitive to outliers in
when the underlying distribution of is normal but possibly
contaminated by a small number of extreme values.
2 has the advantages that it was designed for use with Cox models
For present purposes, RD
and Royston-Parmar models of all three types (hazard, odds and probit scaling), and that it is
suitable for assessing explained variation in a secondary dataset (albeit in a ‘Case 6’scenario,
since the outcome must be regressed on in the secondary dataset). Although D changes
2 by altering the value of 2 . As a
according to the scaling, the scaling is allowed for in RD
2 does not usually change much across di¤erent scalings. See Appendix A of Royston
result, RD
& Sauerbrei [16] for further details, including the values of 2 that are used in (8) with each
2 is available [24].
type of model. Stata software to compute D and RD
Example datasets and models
Breast cancer
From July 1984 to December 1989, the German Breast Cancer Study Group recruited 720
patients with primary node positive breast cancer into a 2 2 trial investigating the e¤ectiveness
of three versus six cycles of chemotherapy and of additional hormonal treatment with tamoxifen.
We analysed the recurrence-free survival time of the 686 patients who had complete data for the
prognostic factors age (x1), menopausal status (x2), tumour size (x3), tumour grade (2 dummy
variables x4a, x4b), number of positive lymph nodes (x5), progesterone (x6) and oestrogen (x7)
receptor status. With updated follow-up [25], 398 patients had had an event recorded. This is
the primary dataset. The secondary dataset is from a trial of 303 patients (158 events) with
identical inclusion criteria, procedures and recorded prognostic factors, but studying a di¤erent
intervention. For comparability, we truncated follow-up time at 9 years in both datasets, thus
censoring 17 (3:1%) of the total of 556 events.
All Cox models for the primary dataset we investigated had signi…cant non-proportional
hazards (see also section 5.1). Using multivariable fractional polynomial (MFP) methodology
[26], which combines backward elimination of weakly in‡uential predictors with selection of fractional polynomial functions for continuous predictors, we developed a Royston-Parmar model
with probit scaling and m = 1 knot (see section 2.2). It included the following predictors: x1 1 ,
x1 0:5 , x3, x4a, log x5, log (x6+1). This model was validated (i.e. evaluated) on the secondary
dataset. We also evaluated the Cox model III reported in Table 4 of reference [26].
Figure 1 shows Kaplan-Meier curves for the two studies (datasets). [FIGURE 1 NEAR
HERE] The survival curves are extremely similar up to about 3 years and not much di¤erent
after that.
Prim ary dataset
Secondary dataset
Recurrence-free survival tim e, yr
Num ber at risk (number of events)
Prim ary
Figure 1: Breast cancer datasets. Kaplan-Meier curves for recurrence-free survival by dataset
Bladder cancer
Over the past two decades, the European Organisation for the Research and Treatment of
Cancer (EORTC) and the UK Medical Research Council (MRC) Working Party on Super…cial
Bladder Cancer have performed several phase III randomised controlled trials for the prophylactic treatment of stage T1 /Ta super…cial bladder cancer following transurethral resection.
Further clinical details are given by Royston et al [27] and references therein. Royston et al
[27] developed a Cox model for the progression-free survival of the 3441 patients (721 events)
in all 9 trials, adjusting for study di¤erences. We combined the seven EORTC trials to form
the primary dataset and the two MRC trials to constitute the secondary dataset. Since the aim
of the analysis was to illustrate the methodology of validation, the small overall di¤erences in
progression-free survival between the studies within each combined dataset were ignored. We
truncated the follow-up at 10 years, thus censoring 21 (2:9%) events overall.We took the prognostic model (10 terms) described in Table 3 of reference [27], …tted it to the primary dataset
and evaluated it on the secondary dataset. A Royston-Parmar proportional cumulative hazards
model with m = 1 (i.e. one interior knot) and a Cox model were used.
Figure 2 shows Kaplan-Meier curves for the two sets of trials (datasets). [FIGURE 2 NEAR
HERE] The survival curves di¤er in shape, with a much higher hazard in the early follow-up
Prim ary dataset
Secondary dataset
Progression-free survival tim e, yr
Num ber at risk (number of events)
Prim ary
2610 (280) 1934 (167) 1352 (86)
Figure 2: Bladder cancer data. Kaplan-Meier curves for progression-free survival by dataset
period in the primary dataset. Notice also the high attrition rates in each dataset, with many
patients being lost to follow up at all stages of each study. Attrition has the potential to induce
bias if it is not at random with respect to the survival outcome.
Breast cancer example
Column (a) of Table 2 shows the results of the validation analysis for the breast cancer datasets.
[TABLE 2 NEAR HERE] Note that the Royston-Parmar models had probit link functions,
whereas the Cox model assumed proportional hazards.
The Royston-Parmar models corresponding to Cases 1-6 all …t about equally well in the
secondary dataset, suggesting that the model estimated on the primary dataset may be used
to predict in the secondary dataset without the need to modify the baseline distribution or the
prognostic index (Case 1). This is the benchmark or ideal validation scenario. The goodness of
…t test for the prognostic index of the probit model is not signi…cant at the 5% level in either the
primary or the secondary dataset. Figure 3 compares the observed and predicted recurrence-free
Breast cancer
Bladder cancer
Case 1
Case 2a
Case 2b
Case 3
Case 4
Case 5a
Case 6
Calibration slope b (SE)
2 (SE) primary data
2 (SE) secondary data
GOF primary data (P -value)
GOF secondary data (P -value)
0:99 (0:13)
0:23 (0:03)
0:22 (0:05)
1:26 (0:13)
0:23 (0:02)
0:30 (0:05)
Case 3
Case 6
Calibration slope b (SE)
2 (SE) primary data
2 (SE) secondary data
GOF primary data (P -value)
GOF secondary data (P -value)
0:95 (0:14)
0:20 (0:03)
0:17 (0:04)
1:27 (0:14)
0:23 (0:02)
0:31 (0:05)
Table 2: Model deviances, calibration constants, goodness of …t and explained variation for
Cases 1-6 for (a) breast and (b) bladder cancer datasets. Probit and proportional-hazards
Royston-Parmar models were …tted to the breast and bladder cancer datasets, respectively.
GOF denotes goodness of …t
Deviances for Cases 1-6 are based on maximized likelihoods
Deviances for Cases 3 and 6 are based on maximized partial likelihoods
survival probabilities for the primary data and Cases 1-6. [FIGURE 3 NEAR HERE] Although
0 .25 .5 .75
0 .25 .5 .75
0 .25 .5 .75
0 .25 .5 .75
Case 4
0 .25 .5 .75
0 .25 .5 .75
Case 3
Case 2b
0 .25 .5 .75
Case 6
Case 5a
0 .25 .5 .75
Survival probability
Case 2a
Case 1
Primary data
Recurrence-free survival time, yr
Figure 3: Breast cancer data. Plots of observed and predicted progression-free survival probabilities from a Royston-Parmar probit model in 3 prognostic groups for the primary data and
Cases 1-6 for the secondary data (see text for details). Dashed lines, predicted probabilities;
solid lines and shaded areas, Kaplan-Meier estimates with their 95% pointwise con…dence intervals. Note that CIs for the central group sometimes overlap and obscure those for the outer
the predicted survival probabilities are not a perfect …t in the poor prognosis group in the
secondary dataset, the number of events beyond 4 years is very small. The results for Case 6 (4
additional parameters …tted) are similar to those for Case 1 (no additional parameters …tted).
2 values are similar between the two datasets. The value of b is near 1.
The RD
The results from the Cox model are similar to those from the Royston-Parmar probit model
(see Table 2). However, the PH assumption for the prognostic index is violated in the primary
dataset (P = 0:004, Grambsch-Therneau test). Figure 4 shows the observed and predicted
survival probabilities. [FIGURE 4 NEAR HERE] Slight lack of …t is visible in the extreme
groups, the pattern being similar between the primary and secondary datasets.
Finally, as described in section 3.5, we validated a Cox model with the same covariates as the
probit model through the ‘equivalent’Royston-Parmar PH model with 2 df (one interior knot).
Case 6
Survival probability
Case 3
Primary data
Recurrence-free survival time, yr
Figure 4: Breast cancer data. Plots of observed and predicted recurrence-free survival probabilities from a Cox model in 3 prognostic groups for the primary dataset and for Cases 3 and 6
in the secondary dataset (see text for details). Dashed lines, predicted probabilities; solid lines
and shaded areas, Kaplan-Meier estimates with their 95% pointwise con…dence intervals
The comparison of Cases 1-6 was essentially as given in Table 2, except that the deviances of
all the models were around 660, compared with around 650 for the probit models in Table 2.
This shows that the probit models are a rather better …t.
Bladder cancer example
Column (b) of Table 2 shows the results for Cases 1-6 for the bladder cancer datasets and
Cases 3 and 6 for the Cox model. The pattern of deviance di¤erences suggests that estimating
both 0 and is required in the Royston-Parmar model, pointing to a di¤erence in shape and
position of the baseline distributions between the datasets. The …t of the Case 3 model is not
signi…cantly worse than that of Case 6, although Case 6 is marginally better (P = 0:054). The
calibration slope in Case 6 is 1:26 (SE 0:13).
Figure 5 shows the observed and predicted survival probabilities in 3 prognostic groups
for the primary data and Cases 1-6 for the secondary data. [FIGURE 5 NEAR HERE] The
Case 4
Case 3
Case 2b
Case 6
Case 5a
Survival probability
Case 2a
Case 1
Primary data
Progression-free survival time, yr
Figure 5: Bladder cancer data. Plots of observed and predicted progression-free survival probabilities from a Royston-Parmar proportional hazards model in 3 prognostic groups for the
primary data and Cases 1-6 for the secondary data (see text for details). Dashed lines, predicted probabilities; solid lines and shaded areas, Kaplan-Meier estimates with 95% pointwise
con…dence intervals. Note that the vertical axes begin at 0.4
observed (Kaplan-Meier) and predicted probabilities agree well in the primary data but rather
poorly in all cases in the secondary data. The goodness of …t tests align with this observation,
the P -values being 0:4 and 0:0009 in the primary and secondary data, respectively (Table 2).
2 ) is actually higher in the secondary data (0:31 vs. 0:23).
Explained variation (RD
The prognostic index was split at its 25th and 75th centiles in the primary dataset. In
the secondary dataset, the risk groups de…ned using the same cutpoints contain approximately
39=48=13 percent of patients. The case mix is di¤erent, with about half as many poor-prognosis
patients (see also Figure 2).
Results for the Cox model are also given in Table 2. Where available, they mirror closely
the results for the Royston-Parmar model. Figure 6 shows the observed and predicted survival
probabilities. [FIGURE 6 NEAR HERE] Apart from the absence of smoothness of the predicted
Case 6
Survival probability
Case 3
Primary data
Progression-free survival time, yr
Figure 6: Bladder cancer data. Plots of observed and predicted progression-free survival probabilities from a Cox model. For details, see Figure 4
survival functions, the results are similar to those in Figure 5.
The main conclusion from this example is that the prognostic index of the prognostic model
validates well, but that the baseline survival distributions are substantially di¤erent between the
primary and secondary datasets. A possible explanation could be that the patient populations
in the two trials datasets were somewhat dissimilar. The MRC patients had not yet had any
recurrence of their cancer, whereas a substantial proportion of the EORTC patients had had
recurrences. This may explain the higher hazard of early disease progression seen among the
EORTC patients. However, the shape of the survival distribution in the EORTC (primary)
patients is similar between patients who had and had not experienced a recurrence (data not
In our view, validation is not about re…tting the predictors in a postulated model on the
secondary dataset and comparing the estimated ’s, as some have done. It is about predicting
relevant quantities (such as survival distributions) from a model fully speci…ed on the primary
dataset to a secondary dataset (our Case 1), and examining its accuracy. Only parametric
models support this approach for time-to-event data. Since the Cox model does not estimate
the baseline distribution, it cannot be fully validated. To throw further light on possible lack
of …t, we suggest modifying the model in minor ways corresponding to our Cases 2a-6. Tests
related to these models may point to di¤erences in model …t between the primary and secondary
datasets. We describe the di¤erences in terms of the parameters 0 ,
. It is also
informative to apply a simple goodness of …t test; we suggest a proportionality test based on
the prognostic index, . A model for which Case 1 …tted as well as Cases 2a-6 and which passed
the goodness of …t test would validate as well as it possibly could, within the limitations of our
method. That is so with the breast cancer example, but not with the bladder cancer example.
Most prognostic survival models in the (medical) literature are reported as b ’s with SE or CI
from a Cox model. Kaplan-Meier survival curves in groups obtained from the prognostic index
are sometimes presented, but no attempt is made to report formal estimates of the baseline
survival distribution. We propose that the baseline survival distribution and the prognostic
index are equally important components of a survival model. Without the baseline distribution,
model speci…cation is incomplete, so both should be estimated and reported. Many parametric
survival models are available (e.g. exponential, Weibull, etc), but they are little used in clinical
research. Presumably the reason is lack of ‡exibility, and the concern that the …t may be
poor. Due to the incorporation of spline modelling of the baseline cumulative distribution, the
Royston-Parmar family can successfully model arbitrarily complex distributions. We suggest
that they have potential value in prognostic research, particularly in better reporting and
validation of survival models.
Due to the pre-eminence of the Cox model, models with other link functions are not used
very often. The Royston-Parmar family o¤ers three di¤erent link functions. In the breast
cancer data, which has moderately long follow-up, we found that a probit model provided a
much better …t than a proportional hazards model (see also [28]), and also validated well.
We have concentrated here on external validation of survival models. Internal validation
is no substitute, of course, but our methods could be used, for example, with internal datasplitting. With the latter, the data are split (usually unequally) at random into a primary
and secondary dataset. Model development is done on the primary data and validation on the
secondary data, once testing and ‘tuning’of the model on the primary data has been completed.
Although our approach is formal in the sense that it provides estimates and signi…cance
tests relevant to the validation task, we regard the graphical component as equally important.
The plots of observed and predicted survival probabilities, in both the primary and secondary
datasets, are a useful tool for illustrating the amount of discrimination o¤ered by the model in
each dataset and the nature of the survival distributions, and may point to areas of lack of …t.
Varying prognostic classi…cation schemes could be used, depending on the clinical context. For
example, particular categorisations of the prognostic index in the secondary dataset could be
used to study predictions at the extremes of the risk distribution.
Perhaps the most challenging questions in validation are what do we expect to see, and what
is it reasonable to expect? A suggested benchmark of ‘optimal’performance is for a model to
give satisfactory performance in Case 1 (i.e. b0 and b not requiring revision), to have b near
2 values and to pass a goodness of …t test. The
1, and, in both datasets, to have comparable RD
probit model in the breast cancer example conforms to the benchmark. The bladder cancer
model does not; the predicted survival probabilities do not match the observed probabilities
very closely and Case 6 is a signi…cantly better …t than Case 1, both indicating lack of …t.
2 statistic, is actually better in
Nevertheless, the model’s discrimination, according to the RD
the secondary dataset. The clinical aim(s) are paramount. If the aim is accurate prediction of
survival probabilities, the bladder cancer model does not perform very well. If the aim is to …nd
prognostic factors leading to a risk score that discriminates well, the opposite is the case. There
is no absolute standard of performance here. Nevertheless, our proposed benchmark may be
helpful in that it supplies a practical framework in which di¤erent aspects of model validation
may be assessed systematically.
A situation we have not considered here is whether it is possible to validate (in our sense) a
published model for which the original (primary) data are not available. With the Cox model,
the answer is clearly ‘no’, since the baseline survival function is not available. With parametric
models, the answer is a provisional ‘yes’, depending on publication of the full speci…cation of
the model and its estimated parameters. For Royston-Parmar models, for example, one would
need to know the scale (hazards, odds or probit) of the model, the knot position(s) on t, and
the estimates of 0 ,
and . Precisely how parametric survival models should be reported
with possible validation (and other uses) in mind is a topic for further research.
Do we recommend the routine use of (‡exible) parametric survival models, both for modelling and for validation? The answer is ‘yes’— provided the chosen model …ts the data. In many
datasets (but by no means all) this rules out too-simple models, such as the exponential and
Weibull. Although the Cox model is convenient, its strength lies mainly in estimating relative
hazards for covariate e¤ects. It is ‘semi-parametric’in nature and was never intended for the
task of comprehensive prediction of survival probabilities and other important quantities. Although we favour Royston-Parmar models, other ‡exible parametric models exist, for example
those based on the Poisson distribution (e.g. reference [29]). The original implementation of
Royston-Parmar models is through the stpm user-written command for Stata [30]; an improved
version, stpm2, is available [31]. An undocumented feature of the mfp command in version 11
of Stata [32] allows users to develop MFP models [26] using stpm or stpm2. Features of mfp include elimination of weakly in‡uential variables by backward elimination and determination of
functional form for continuous predictors using fractional polynomial functions. Further Stata
software is available on request from the …rst author to carry out the computations described
in the paper.
[1] F. E. Harrell. Regression modeling strategies, with applications to linear models, logistic
regression, and survival analysis. Springer, New York, 2001.
[2] J.C. Wyatt and D. G. Altman. Prognostic models: clinically useful or quickly forgotten?
British Medical Journal, 311:1539–1541, 1995.
[3] K. G.M. Moons, D. G. Altman, P. Royston, and Y. T. Vergouwe. Generalisability, impact
and application of prediction models in clinical practice. 2007. In preparation.
[4] Y. Vergouwe, E. W. Steyerberg, R. S. Foster, J. D. F. Habbema, and J. P. Donohue.
Validation of a prediction model and its predictors for the histology of residual masses in
nonseminomatous testicular cancer. Journal of Urology, 165:84–88, 2001.
[5] J. Hermans, A. D. G. Krol, K. van Groningen, P. M. Kluin, J. C. Kluin-Nelemans, M. H. H.
Kramer, E. M. Noordijk, F. Ong, and P. W. Wijermans. International Prognostic Index
for aggressive non-HodgkinŠs lymphoma is valid for all malignancy grades. Blood, 86:1460–
1463, 1995.
[6] V. Ficarra, G. Martignoni, C. Lohse, G. Novara, M. Pea, S. Cavalleri, and W. Artibani.
External validation of the Mayo Clinic stage, size, grade and necrosis (SSIGN) score to
predict cancer speci…c survival using a European series of conventional renal cell carcinoma.
Journal of Urology, 175:1235–1239, 2006.
[7] F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: issues in
developing models, evaluating assumptions and accuracy, and measuring and reducing
errors. Statistics in Medicine, 15:361–387, 1996.
[8] H. C. van Houwelingen. Validation, calibration, revision and combination of prognostic
survival models. Statistics in Medicine, 19:3401–3415, 2000.
[9] D. G. Altman and P. Royston. What do we mean by validating a prognostic model?
Statistics in Medicine, 19:453–473, 2000.
[10] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F. L. Baehner, M. G. Walker,
D. Watson, T. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark.
A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer.
New England Journal of Medicine, 351:2817–2826, 2004.
[11] M. Buyse, S. Loi, L. van’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. Saghatchian
d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. Harris, J. Bogaerts, P. Therasse, A. Floore,
M. Amakrane, F. Piette, E. Rutgers, C. Sotiriou, F. Cardoso, and M. J. Piccart on behalf
of the TRANSBIG Consortium. Validation and clinical utility of a 70-gene prognostic
signature for women with node-negative breast cancer. Journal of the National Cancer
Institute, 98:1183–1192, 2006.
[12] A. Burton and D. G. Altman. Missing covariate data within cancer prognostic studies: a
review of current reporting and proposed guidelines. British Journal of Cancer, 91:4–8,
[13] P. Royston and M. K. B. Parmar. Flexible proportional-hazards and proportional-odds
models for censored survival data, with application to prognostic modelling and estimation
of treatment e¤ects. Statistics in Medicine, 21:2175–2197, 2002.
[14] N. Younes and J. Lachin. Link-based models for survival data with interval and continuous
time censoring. Biometrics, 53:1199–1211, 1997.
[15] F. J. Aranda-Ordaz. On two families of transformations to additivity for binary response
data. Biometrika, 68:357–363, 1981.
[16] P. Royston and W. Sauerbrei. A new measure of prognostic separation in survival data.
Statistics in Medicine, 23:723–748, 2004.
[17] M. E. Miller and S. L. Hui. Validation techniques for logistic regression models. Statistics
in Medicine, 10:1213–1226, 1991.
[18] P. M. Grambsch and T. M. Therneau. Proportional hazards tests and diagnostics based
on weighted residuals. Biometrika, 81:515–526, 1994.
[19] W. Sauerbrei, P. Royston, and M. Look. A new proposal for multivariable modelling of
time-varying e¤ects in survival data based on fractional polynomial time-transformation.
Biometrical Journal, 49:453–473, 2007.
[20] J. F. Lawless. Statistical models and methods for lifetime data. Wiley, New York, 1982.
[21] I. Helland. On the interpretation and use of R2 in regression analysis. Biometrics, 43:61–69,
[22] E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher. Assessment and comparison of
prognostic classi…cation schemes for survival data. Statistics in Medicine, 18:2529–2545,
[23] M. Schemper. The explained variation in proportional hazards regression. Biometrika,
77:216–218, 1990.
[24] P. Royston. Explained variation for survival models. Stata Journal, 6:83–96, 2006.
[25] W. Sauerbrei, G. Bastert, H. Bojar, C. Beyerle, R. L. A. Neumann, C. Schmoor, and
M. Schumacher for the German Breast Cancer Study Group. Randomized 2 2 trial
evaluating hormonal treatment and the duration of chemotherapy in node-positive breast
cancer patients: An update based on 10 years’ follow-up. Journal of Clinical Oncology,
49:453–473, 2000.
[26] W. Sauerbrei and P. Royston. Building multivariable prognostic and diagnostic models:
transformation of the predictors by using fractional polynomials. Journal of the Royal
Statistical Society (Series A), 162:71–94, 1999.
[27] P. Royston, M. K. B. Parmar, and R. Sylvester. Construction and validation of a prognostic
model across several studies, with an application in super…cial bladder cancer. Statistics
in Medicine, 23:907–926, 2004.
[28] P. Royston. The lognormal distribution as a model for survival time in cancer, with an
emphasis on prognostic factors. Statistica Neerlandica, 55:89–104, 2001.
[29] P. W. Dickman, A. Sloggett, M. Hills, and T. Hakulinen. Regression models for relative
survival. Statistics in Medicine, 23:51–64, 2004.
[30] P. Royston. Flexible parametric alternatives to the Cox model, and more. Stata Journal,
1:1–28, 2001.
[31] P. C. Lambert and P. Royston. Further development of ‡exible parametric models for
survival analysis. The Stata Journal, 9:265–290, 2009.
[32] StataCorp. Stata Reference Manual, version 11. Stata Press, 2009.