with a practical emphasis on fractional polynomials and applications in clinical epidemiology
Professor Patrick Royston,
MRC Clinical Trials Unit, London.
Berlin, April 2005.
8/4/2005 1
The problem …
“Quantifying epidemiologic risk factors using non-parametric regression: model selection remains the greatest challenge”
Rosenberg PS et al, Statistics in Medicine 2003; 22:3369-3381
Trivial nowadays to fit almost any model
To choose a good model is much harder
8/4/2005 2
Overview
•
Context and motivation
•
Introduction to fractional polynomials for the univariate smoothing problem
•
Extension to multivariable models
• More on spline models
• Stability analysis
• Stata aspects
•
Conclusions
8/4/2005 3
Motivation
•
Often have continuous risk factors in epidemiology and clinical studies – how to model them?
• Linear model may describe a dose-response relationship badly
‘Linear’ = straight line =
0
+
1
X + … throughout talk
•
Using cut-points has several problems
• Splines recommended by some – but are not ideal
Lack a well-defined approach to model selection
‘Black box’
Robustness issues
8/4/2005 4
Problems of cut-points
•
Step-function is a poor approximation to true relationship
Almost always fits data less well than a suitable continuous function
• ‘Optimal’ cut-points have several difficulties
Biased effect estimates
Inflated P-values
Not reproducible in other studies
8/4/2005 5
Example datasets
1. Epidemiology
•
Whitehall 1
17,370 male Civil Servants aged 40-64 years
Measurements include: age, cigarette smoking,
BP, cholesterol, height, weight, job grade
Outcomes of interest: coronary heart disease, allcause mortality
logistic regression
Interested in risk as function of covariates
Several continuous covariates
Some may have no influence in multivariable context
8/4/2005 6
Example datasets
2. Clinical studies
•
German breast cancer study group (BMFT-2)
Prognostic factors in primary breast cancer
Age, menopausal status, tumour size, grade, no. of positive lymph nodes, hormone receptor status
Recurrence-free survival time
Cox regression
686 patients, 299 events
Several continuous covariates
Interested in prognostic model and effect of individual variables
8/4/2005 7
Example:
Systolic blood pressure vs. age
Whitehall 1: BP vs age
8/4/2005
40 45 50
Age, years
55 60 65
8
Example: Curve fitting
(Systolic BP and age – not linear)
Whitehall 1: BP vs age
95% CI
Linear function
FP1 function
Running line
8/4/2005
40 45 50
Age, years
55 60 65
9
Empirical curve fitting: Aims
•
Smoothing
•
Visualise relationship of Y with X
•
Provide and/or suggest functional form
8/4/2005 10
Some approaches
• ‘Non-parametric’ (local-influence) models
Locally weighted (kernel) fits (e.g. lowess )
Regression splines
Smoothing splines (used in generalized additive models)
•
Parametric (non-local influence) models
Polynomials
Non-linear curves
Fractional polynomials
Intermediate between polynomials and non-linear curves
8/4/2005 11
Local regression models
• Advantages
Flexible – because local!
May reveal ‘true’ curve shape (?)
•
Disadvantages
Unstable – because local!
No concise form for models
Therefore, hard for others to use – publication,compare results with those from other models
Curves not necessarily smooth
‘Black box’ approach
Many approaches – which one(s) to use?
8/4/2005 12
Polynomial models
•
Do not have the disadvantages of local regression models, but do have others:
•
Lack of flexibility (low order)
•
Artefacts in fitted curves (high order)
•
Cannot have asymptotes
8/4/2005 13
Fractional polynomial models
•
Describe for one covariate, X
multiple regression later
• Fractional polynomial of degree m for X with powers p
1
, … , p m is given by
FP m ( X ) =
1
X p
1
+ … + m
X p m
•
Powers p
1
,…, p m are taken from a special set
{
2,
1,
0.5, 0, 0.5, 1, 2, 3}
•
Usually m = 1 or m = 2 is sufficient for a good fit
8/4/2005 14
FP1 and FP2 models
•
FP1 models are simple power transformations
• 1/ X 2 , 1/ X , 1/
X , log X ,
X , X , X 2 , X 3
8 models
•
FP2 models are combinations of these
For example
1
(1/ X ) +
2
( X 2 )
28 models
• Note ‘repeated powers’ models
For example
1
(1/ X ) +
2
(1/ X )log X
8 models
8/4/2005 15
FP1 and FP2 models: some properties
•
Many useful curves
•
A variety of features are available:
Monotonic
Can have asymptote
Non-monotonic (single maximum or minimum)
Single turning-point
•
Get better fit than with conventional polynomials, even of higher degree
8/4/2005 16
Examples of FP2 curves
- varying powers
(-2, 1) (-2, 2)
(-2, -1)
8/4/2005
(-2, -2)
17
Examples of FP2 curves
- single power, different coefficients
(-2, 2)
4
2
0
-2
-4
10 20 30 x
40
8/4/2005
50
18
A philosophy of function selection
•
Prefer simple (linear) model
•
Use more complex (non-linear) FP1 or FP2 model if indicated by the data
•
Contrast to local regression modelling
Already starts with a complex model
8/4/2005 19
Estimation and significance testing for FP models
•
Fit model with each combination of powers
FP1: 8 single powers
FP2: 36 combinations of powers
•
Choose model with lowest deviance (MLE)
•
Comparing FP m with FP( m
1):
compare deviance difference with
2 on 2 d.f.
one d.f. for power, 1 d.f. for regression coefficient
supported by simulations; slightly conservative
8/4/2005 20
Selection of FP function
•
Has flavour of a closed test procedure
• Use
2 approximations to get P-values
•
Define nominal P-value for all tests (often 5%)
•
Fit linear and best FP1 and FP2 models
•
Test FP2 vs. null – test of any effect of X (
2 on 4 df)
•
Test FP2 vs linear – test of non-linearity (
2 on 3 df)
•
Test FP2 vs FP1 – test of more complex function against simpler one (
2 on 2 df)
8/4/2005 21
Example: Systolic BP and age
8/4/2005
Model
FP2 v FP1 d.f.
Deviance difference
FP2 v Null 4
FP2 v Linear 3
2
944.57
29.95
3.29
Pvalue
0.000
0.000
0.2
Reminder:
FP1 had power 3:
1
X 3
FP2 had powers (1,1):
1
X +
2
X log X
22
Aside: FP versus spline
•
Why care about FPs when splines are more flexible?
•
More flexible
more unstable
More chance of ‘over-fitting’
•
In epidemiology, dose-response relationships are often simple
•
Illustrate by small simulation example
8/4/2005 23
FP versus spline (continued)
•
Logarithmic relationships are common in practice
• Simulate regression model y =
0
+
• Error is normally distributed N(0,
2 )
1 log( X ) + error
•
Take
0
= 0,
1
= 1; X has lognormal distribution
•
Vary
= {1, 0.5, 0.25, 0.125}
•
Fit FP1, FP2 and spline with 2, 4, 6 d.f.
•
Compute mean square error
•
Compare with mean square error for true model
8/4/2005 24
FP vs. spline (continued)
Sigma = 1 Sigma = 0.5
0 2 x
4
Sigma = 0.25
6 0 2 x
4
Sigma = 0.125
6
8/4/2005
0 2 x
4 6 0 2 x
4 6
25
FP vs. spline (continued)
FP1 and spline with 2 df
Solid: FP1; dashed: spline 2 df
0 2 4 6 0 2 4 6
8/4/2005
0 2 4 6 0 2 4 6
26
FP vs. spline (continued)
FP2 and spline with 4 df
0 1 2 3 4 5 0 1 2 3 4 5
8/4/2005
0 1 2 3 4 5 0 1 2 3 4 5
27
FP vs. spline (continued)
FP vs. spline: prediction error
8/4/2005
.125
True
Spline 2df
.25
sigma
FP1
Spline 4df
.5
FP2
Spline 6df
1
28
FP vs. spline (continued)
•
In this example, spline usually less accurate than FP
•
FP2 less accurate than FP1 (over-fitting)
•
FP1 and FP2 more accurate than splines
•
Splines often had non-monotonic fitted curves
Could be medically implausible
•
Of course, this is a special example
8/4/2005 29
Multivariable FP (MFP) models
•
Assume have k > 1 continuous covariates and perhaps some categoric or binary covariates
•
Allow dropping of non-significant variables
•
Wish to find best multivariable FP model for all X
’s
•
Impractical to try all combinations of powers
•
Require iterative fitting procedure
8/4/2005 30
Fitting multivariable FP models
(MFP algorithm)
•
Combine backward elimination of weak variables with search for best FP functions
•
Determine fitting order from linear model
•
Apply FP model selection procedure to each X in turn
fixing functions (but not
’s) for other
X
’s
•
Cycle until FP functions (i.e. powers) and variables selected do not change
8/4/2005 31
Example: Prognostic factors in breast cancer
•
Aim to develop a prognostic index for risk of tumour recurrence or death
•
Have 7 prognostic factors
4 continuous, 3 categorical
•
Select variables and functions using 5% significance level
8/4/2005 32
Univariate linear analysis
X
1
X
2
X
3
X
4a
X
4b
X
5
X
6
X
7
Variable Name
Age
Menopausal status
Tumour size
2
0.58
0.28
15.68
Grade 2 or 3
Grade 3
19.92
8.19
No. of positive lymph nodes 50.02
Progesterone receptor status 34.04
Oestrogen receptor status 4.70
8/4/2005 33
Univariate FP2 analysis
Variable
X
1
age
Powers
2 d.f.
(
2,
0.5) 17.61
4
X
3
size (
1,
3) 19.81
4
X
5
nodes (1, 2)
X
6
PgR (
0.5, 0)
X
7
ER (
2,
1)
P
0.001
0.001
Gain
17.03
4.13
81.36
4 < 0.001
31.34
52.73
23.07
4
4
< 0.001
< 0.001
18.69
18.37
Gain compares FP2 with linear on 3 d.f.
All factors except for X
3 have a non-linear effect
8/4/2005 34
Multivariable FP analysis
Variable
X
X
X
1
3
5
age
size
nodes
X
6
PgR
X
7
ER
0.5
Out
X
2
mens.
Out
X
4a
grad 2/3 In
X
4b
grad 3 Out
FP etc.
2 d.f.
(
2,
0.5) 19.33
4
P
0.001
Out
(
2,
1)
5.31
74.14
4
4
0.3
<0.001
32.70
2.15
0.21
4.59
0.15
4 <0.001
4 0.7
1 0.6
1 0.03
1 0.7
8/4/2005 35
Comments on analysis
•
Conventional backwards elimination at 5% level selects X
4a
, X
5
, X
6
, and X
1 is excluded
•
FP analysis picks up same variables as backward elimination, and additionally X
1
•
Note considerable non-linearity of X
1
•
X
1 and X
5 has no linear influence on risk of recurrence
•
FP model detects more structure in the data than the linear model
8/4/2005 36
Plots of fitted FP functions
Breast cancer: Fitted FP functions
Age Nodes
20 40
Age, years
60
Progesterone receptor
80 0 10 20 30 40
No. of positive lymph nodes
50
8/4/2005
0 500 1000 1500 2000 2500
Progesterone receptor status
37
Survival by risk groups
Prognostic classification scheme
8/4/2005
0 2 4
Recurrence-free survival, yr
Group = Low risk
Group = High risk
6
Group = Medium risk
8
38
Robustness of FP functions
•
Breast cancer example showed non-robust functions for nodes – not medically sensible
•
Situation can be improved by performing covariate transformation before FP analysis
•
Can be done systematically (work in progress)
•
Sauerbrei & Royston (1999) used negative exponential transformation of nodes
exp(–0.12 * number of nodes)
8/4/2005 39
Making the function for lymph nodes more robust
8/4/2005
0 10
Original
Exponential transformation
20 30
No. of positive lymph nodes
40 50
40
2 nd example: Whitehall 1
MFP analysis
8/4/2005
Covariate
Age
Cigarettes
Systolic BP
Total cholesterol
Height
Weight
Job grade
FP etc.
Linear
0.5
-1, -0.5
Linear
Linear
-2, 3
In
No variables were eliminated by the MFP algorithm
Weight is eliminated by linear backward elimination
41
Plots of FP functions
Whitehall 1: multivariable FP analysis
Age Cigarettes Systolic BP
40 45 50 55 60 65
Age at entry
Total cholesterol
0 20 40
Cigarettes/day
Weight
60 50 100 150 200 250 300
Systolic BP
Height
8/4/2005
0 5 10
Cholesterol/ mmol/l
15 40 60 80 100 120 140
Weight/kgs
140 160 180
Height/cms
200
42
A new multivariable regression algorithm with spline functions
• Inspired by closed test procedure for selecting an FP function
•
Start with predefined number of knots
Determines maximum complexity of function
• Use predetermined knot positions
E.g. at fixed percentile positions of distn. of x
•
Simplest function (default) is linear
•
Closed test procedure to reduce the knot set if some knots are not significant
•
Apply backfitting procedure as in mfp
•
Implemented in Stata as new command mrsnb
8/4/2005 43
Splines: Breast cancer example
•
Selects variables similar to mfp
Grade 2/3 omitted, otherwise selected variables are identical
•
Knots: age(46, 53); transformed nodes(linear);
PgR(7, 132)
•
Deviance of selected model almost identical to mfp model
8/4/2005 44
Plots of fitted FP functions
20 40
Age, years
60 80 0 10 20 30 40
No. of positive lymph nodes
50
8/4/2005
0 500 1000 1500 2000 2500
Progesterone receptor status
Solid lines, FP; dashed lines, spline
45
Improving the robustness of spline models
•
Often have covariates with positively skew distributions – can produce curve artefacts
•
Simple approach is to log-transform covariates with a skew distribution – e.g.
1
> 0.5
•
Then fit the spline model
•
In the breast cancer example, this approach gives a more satisfactory log function for PgR
8/4/2005 46
Stability of FP models
•
Models (variables, FP functions) selected by statistical criteria – cut-off on P-value
• Approach has several advantages …
• … and also is known to have problems
Omission bias
Selection bias
Unstable – many models may fit equally well
8/4/2005 47
Stability investigation
•
Instability may be studied by bootstrap resampling
(sampling with replacement)
Take bootstrap sample B times
Select model by chosen procedure
Count how many times each variable is selected
Summarise inclusion frequencies & their dependencies
Study fitted functions for each covariate
•
May lead to choosing several possible models, or a model different from the original one
8/4/2005 48
Bootstrap stability analysis of the breast cancer dataset
•
5000 bootstrap samples taken (!)
•
MFP algorithm with Cox model applied to each sample
•
Resulted in 1222 different models (!!)
•
Nevertheless, could identify stable subset consisting of 60% of replications
Judged by similarity of functions selected
8/4/2005 49
Bootstrap stability analysis of the breast cancer dataset
Variable
Age
Menopausal status
Tumour size
Grade 2/3
Grade 3
Lymph nodes
Model selected
FP1
FP2
—
FP1
FP2
—
—
FP1
Progesterone receptors FP1
FP2
Oestrogen receptors FP1
FP2
% bootstraps model selected
16
76
20
34
6
58
9
100
95
4
13
6
8/4/2005 50
Bootstrap analysis: summaries of fitted curves from stable subset
1
0
6
4
2
0
20 30 40 50 60 70 80
Age, years
2
-1
0 10 20
Number of positive lymph nodes
30
1
0
-1
-2
-3
0
1
25 50 75
Tumour size, mm
100
0
-1
0 250
PgR, fmol/L
500
8/4/2005 51
Presentation of models for continuous covariates
•
The function + 95% CI gives the whole story
•
Functions for important covariates should always be plotted
•
In epidemiology, sometimes useful to give a more conventional table of results in categories
•
This can be done from the fitted function
8/4/2005 52
Example: Cigarette smoking and all-cause mortality (Whitehall 1)
Cigarettes per day Number OR (model based)
Range Ref.
point
At risk Dyin g
Estimate
0 (referent) 0 10103 690 1.00
95% CI
--
1-10
11-20
21-30
31-40
41-50
51-60
5
15
25
55
2254
3448
1117
12
243
494
185
2
1.69
2.25
2.60
3.25
1.59, 1.80
2.04, 2.49
2.31, 2.91
35 283 48 2.86
2.52, 3.24
45 43 8 3.07
2.68, 3.52
2.82, 3.75
8/4/2005 53
Other issues (1)
•
Handling continuous confounders
May use a larger P-value for selection e.g. 0.2
Not so concerned about functional form here
•
Binary/continuous covariate interactions
Can be modelled using FPs (Royston & Sauerbrei
2004)
Adjust for other factors using MFP
8/4/2005 54
Other issues (2)
•
Time-varying effects in survival analysis
Can be modelled using FP functions of time
(Berger; also Sauerbrei & Royston, in progress)
•
Checking adequacy of FP functions
May be done by using splines
Fit FP function and see if spline function adds anything, adjusting for the fitted FP function
8/4/2005 55
Stata aspects
•
Command mfp is part of Stata 8
• Example of use:
mfp stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, select(0.05, hormon:1)
•
Command mrsnb is available from PR
•
Example of use:
mrsnb stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, select(0.05, hormon:1)
•
Command mfpboot is available from PR
Does bootstrap stability analysis of MFP models
8/4/2005 56
Concluding remarks (1)
•
FP method in general
No reason (other than convention) why regression models should include only positive integer powers of covariates
FP is a simple extension of an existing method
Simple to program and simple to explain
Parametric, so can easily get predicted values
FP usually gives better fit than standard polynomials
Cannot do worse, since standard polynomials are included
8/4/2005 57
Concluding remarks (2)
•
Multivariable FP modelling
Many applications in general context of multiple regression modelling
Well-defined procedure based on standard principles for selecting variables and functions
Aspects of robustness and stability have been investigated (and methods are available)
Much experience gained so far suggests that method is very useful in clinical epidemiology
8/4/2005 58
Some references
•
Royston P, Altman DG (1994) Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Applied Statistics 43 :
429-467
•
Royston P, Altman DG (1997) Approximating statistical functions by using fractional polynomial regression. The Statistician 46 : 1-12
•
Sauerbrei W, Royston P (1999) Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. JRSS(A)
162: 71-94. Corrigendum JRSS(A) 165: 399-400, 2002
•
Royston P, Ambler G, Sauerbrei W. (1999) The use of fractional polynomials to model continuous risk variables in epidemiology. International Journal of
Epidemiology , 28 : 964-974.
• Royston P, Sauerbrei W (2004). A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Statistics in Medicine 23 : 2509-2525.
• Royston P, Sauerbrei W (2003) Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation.
Statistics in Medicine 22 : 639-659.
•
Armitage P, Berry G, Matthews JNS (2002) Statistical Methods in Medical
Research . Oxford, Blackwell.
8/4/2005 59