4 The Proposed Method

advertisement
Age-based Multilevel Regression Modelling of Melanoma
Incidence in the USA
Antony Brown Carsten Maple Malcolm Keech
Computing and Information Systems
University of Luton
Park Square, Luton, Bedfordshire, LU1 3JU
ENGLAND
Abstract
The changing incidence of cancer is an increasing
problem, and it would be of great benefit to be able to
predict future changes. Brown and Maple have previously
proposed a method for modelling such data [2]. The
model appeared to perform better than existing methods
when applied to data available at that time. In this work
we apply the techniques to newly available data that
confirms the accuracy of the estimates that were predicted
in [2]. We also present suggestions for improvements to
the method.
Key-Words: Epidemiology, Melanoma, Modelling,
Regression, Prediction.
1
Introduction
Prediction of future incidence of cancer rates allows better
planning and resource allocation for prevention and
treatment. This becomes increasingly important for
cancers whose incidence is on the increase, as without
proper prediction the burden could far outstrip the
provisions made. In the past, a range of modelling
techniques have been applied to various factors of this
problem.
Accurate modelling and prediction is useful since any
trends that are identified can be compared to underlying
trends in other phenomena. This comparison can then be
used to help confirm or deny causative factors for the
disease. Likewise, known (or suspected) causative factors
may be included in a model to improve the results of
predictions.
The type of data required for a study depends on the
model used and the intentions of the research. There is a
balance to be struck between a depth and breadth of data.
Generally, the more variables obtained for each
individual, the fewer individuals used in the data. For
example, a clinical trial would be able to obtain a large
amount of information about its participants, including
such things as familial history, eye colour and even
genetic samples. However, the number of participants
would also be relatively small since collection of data this
detailed is very time consuming and costly. Conversely, a
cancer registry would have record data for a large amount
of people but, as the data would be drawn from a variety
of sources such as hospitals and clinics, only information
that is routinely collected would be available. Hence, it
may be considered that the volume of data is of the same
order for any reasonable study.
Models considering associations between genetic
makeup and melanoma would obviously need access to
the in-depth data that would come from a special clinical
study and would therefore be limited in the breadth of
data. A study looking at the links between age and
melanoma would benefit from having access to a much
larger pool of data but any predictions would be more
generalised than a model with more predictive variables.
This illustrates how the aim of a study determines the
breadth and hence depth of the data required.
This article uses data from the National Cancer
Institute's SEER program [1]. This data originally
consisted of incidence rates for the U.S. 1973-1997
categorised by sex and split into 5-year age groups. Since
the publication of [2], data from three further years, 19982000, have been publicly released and the models have
been used to predict these new data points. The data
comes from a large population, and so can be used to
predict the general trends in melanoma incidence for the
U.S. It is important to note that the number of melanoma
cases for ages less than 20 are too low to produce reliable
incidence rates, and so will be excluded from this study,
as was the case in our previous study.
This article reintroduces the previously presented
novel prediction method [2], evaluates its performance
with some newly available data and presents a proposed
extension of the method.
2
Epidemiology of Melanoma
Whilst the positive relationship between UV exposure and
other forms of skin cancer have long been shown to exist,
the relationship with melanoma is much more
complicated, as can be seen in [1], [3] and [4].
For example, from [3] it can be seen that the body sites
most commonly affected by melanoma are not those that
receive the most exposure to sunlight. In addition,
occupations that usually receive large amounts of sun
exposure (such as unskilled labour, farming etc…) have a
lower risk of melanoma than other occupations, contrary
to what might first be thought.
One proposed explanation for the relationship between
melanoma and sunlight is that it is not a cumulative effect,
but depends on sporadic exposure to larger amounts of
UV radiation than is usual. Positive links have been
suggested between melanoma incidence and economic
factors, as well as with managerial occupations [3]. This
might possibly be explained by the link between increased
salary and number of exotic foreign holidays, which
would expose the individual to unaccustomedly large
amounts of UV radiation for brief periods of time.
However, this type of association is very hard to confirm
as there are numerous other factors that could also play a
role.
As with the vast majority of cancers, gender has also
been shown to have a significant role in the epidemiology
of melanoma, see [3]. The overall incidence rates for
women are lower than for men, and the age specific
distribution for the sexes are different as well, which will
be shown later. There are also differences between the
sites of the body most commonly affected in men and
women.
Race also has an effect on melanoma incidence, with
darker skinned races having much lower rates than lighter
skinned races. The link between fair complexion and
increased risk of melanoma has been shown, with features
such as fair hair, light-coloured eyes and a tendency to
freckle all showing an increased risk [4].
The presented model, however, only looks at the effects
of age, sex and time on melanoma incidence, as we are
interested in a general population model rather than one
that concerns the risks of individuals. The data for these
variables is more readily available on the larger scale
needed to make a population specific prediction of use.
The data used will only come from those races classified
as ‘white’ as their incidence is much higher than other
races, and it is important to keep as many possible
causative variables constant.
3
Existing Methods
A variety of modelling techniques have been applied to
cancer incidence/mortality. In general, non-linear models
give better representations of the actual patterns present in
the data, and so are useful etiological studies .
However, non-linear models can prove unreliable when
outside of the range of data they were fitted to, they are
better suited to interpolation as opposed to. This is due to
the fact that they react best to local data rather than global
and as such they are better suited to interpolation not
extrapolation. Without some way to govern this effect,
they are often unsuitable for predictive purposes, for
which linear techniques prove more useful, since their
behaviour is more stable. This section will review some of
the techniques that have been most influential in the
development of the proposed novel method.
3.1 Linear, Log-linear and Non-Linear models
Linear and Log linear models with the purpose of
prediction of cancer were proposed by Dyba et al [5].
These models are non-linear in parameters, but linear in
form. This combines the flexibility of a non-linear model
with the stability of a linear one.
The models produce separate predictions for various age
groups within the population, but the parameters for all
age groups are determined at the same time. The
advantage of this is that it allows individual age groups to
have separate rates of increase, whilst still allowing each
rate to be influenced by all of the data.
These models are specifically designed for prediction
purposes, and as such they are not bound by any
constraints that closely resemble the physical processes
that actually take place. However, the predictions given
have significant prediction intervals, which decreases
their usefulness.
The ability to allow the future behaviour of each age
group to be influenced by the overall trend in the rest of
the data is a useful one, and efforts have been made to
incorporate this into the novel method proposed.
3.2 Spline Regression
The spline regression technique fits a series of polynomial
equations (usually quadratic or cubic) to the historical
data of the disease [6]. The flexibility of the spline allows
an extremely accurate fit to rapidly changing data. This
means that the spline model will usually give very
accurate results within the interval for which it was fitted,
but is much less accurate for prediction of values outside
that region.
Therefore, splines are well suited to studies trying to
find/prove associations between cancer incidence and
some causative factor, for example annual sunlight
exposure. The spline model can provide a very useful
guide to the relationship between exposure and disease,
see [6].
It is possible to apply cubic splines to the data, but
careful consideration has to be given to ensure that their
behaviour outside the data they are based on is consistent
with historic data.
3.3 Multilevel Modelling
A multilevel model simultaneously models a situation at
several levels of detail, with the purpose of seeing these
levels as a whole, rather than as independent pieces. Such
models can allow trends on one scale of the model to
affect those on another, given a more complete picture of
the real situation.
In the case of melanoma, this technique has been
applied to mortality by Langford, Bentham and
McDonald [7]. The model was used to investigate the
relationship between Ultra Violet B (UVB) exposure and
melanoma mortality in Europe. Thus, geographical
groupings were used as the different levels of the model.
The data is grouped into countries, which in turn are split
into geographical regions, each containing counties.
Several models were produced by iteratively fitting a
generalized least squares estimation to the mortality data
from nine European countries.
Previously, a quadratic relationship between melanoma
and latitude, in regards to Ultra Violet B (UVB) exposure,
had been identified. However, the results showed that
there was no clear relationship between melanoma
mortality and UVB. Some countries showed a positive
relationship, others showed no significant relationship and
a few even displayed a negative relationship. This
discrepancy is explained by the fact that whilst the whole
data set may display a positive correlation with UVB, the
multilevel model shows the underlying trends that
contribute to this overall effect. Thus, it becomes apparent
that the relationship with UVB varies with different
populations, and so is more complicated than was
previously supposed.
This model was applied to the evaluation of past trends
and their relationship to various causative factors and not
prediction. However, the ability to see trends in the data
from a variety of perspectives, and to determine how one
trend is affected, by another would be very useful in a
prediction model. Therefore the proposed method
includes a limited multilevel aspect and can be extended
to include further levels of detail.
4
The Proposed Method
This method incorporates the idea of a multilevel
approach, combined with a proportional view of the data.
All of the data is expressed as a proportion of the larger
data set, which produces two levels to the model. This
novel applications of proportions to this type of data helps
produce more consistent predictions
The raw data from the cancer registry is converted into
age-adjusted (to the standard world population) incidence
rates, yij where i is the age group and j is the gender, to
allow comparisons between age groups. In order to help
distinguish the incidence trend from random fluctuations,
data smoothing techniques are applied. In the case of this
method, a 3-year moving mean will be used. This is
applied a limited number of times to ensure the loss of
detail does not interfere with the regression process.
The model is created using two different sets of data,
the first being the yearly sum of the incidence rates from
2
S
all age groups,
j 1
 yij
. The second is the age
specific proportion of the sum of the incidence that the
Pij 
yij
S
. The product of these two
values gives the age specific incidence rate. Therefore,
the product of estimations to these values will be
estimations
to
the
age
specific
incidence
rates,
P 'ij S '  y 'ij .
The proportion value, Pij , contains information about
the age specific incidence, but it is in relation to S. This
helps reduce the effects of random variations in the data,
by expressing them as a proportion of the total as well as
in absolute terms. This allows regressions to be fitted to
the proportion and sum, and the two combined, to produce
a prediction for incidence.
The advantage of this is that the trend for each age
group is kept intact, but it is present in the context of the
other data groups. This allows more data points to
influence each model, which in turn increases the
accuracy of the predictions by allowing them to be guided
by the overall trend as well as the age specific trends.
Exponential regressions are used for the total
incidence, as it can be shown that these reflect the general
trend. A variety of regression types are used for the
proportion model, to help determine the ideal model type.
The models are fitted to a ten-year range of data, and
forecast over the five years following that range. The use
of just ten years worth of data for each model is intended
to ensure the fit represents only the latest trend of the
data, as their purpose is to predict future incidence.
The reason for approaching the problem in this manner
is that the actual incidence rates themselves contain too
many fluctuations, even after smoothing, to allow exact
fitting with regression techniques. In addition, the
variations in behaviour between age groups means that no
single regression model would be able to fit all age groups
well.
The motivation behind the desire to apply a single type
of model to all of the age groups is that each set of data is
not independent from the others. The people from one age
group are exposed to a similar environment to all of the
other groups, they also share some cultural effects with
the other groups. This means that the different incidence
rates are a reflection of each group's different biological
and behavioural responses to their common environment
and culture. Therefore, we make the assumption that the
trend for each age group is based, in part, on this
underlying function.
It is preferable that the model has some link to the
theorised real world situation, so using one type of model
to represent this underlying function is a justifiable goal.
5
15
i 1
incidence makes up,
Results
5.1 Data Analysis
The trend in incidence rates between 1973-2000 has many
interesting features, with the overall incidence rate
increasing by 320%. The differences between male and
female incidence have also increased over this time span.
Female incidence increased by 282.8 %, compared to a
366.7 % increase in males. This difference has increased
the disparity between male and female from 9.41% to
32.4 % since 1973.
Categorising the data into 5-year age groups, further
differences can be seen. The majority of the age groups
display a general tendency to increase over the whole
interval, with a variable rise and fall pattern.
Certain age groups show particularly interesting
behaviour. For example, in the 20-24 age range for males,
the trend increases until the start of the nineties, at which
point it begins to decrease. The result is only a 10%
increase over the time interval, whereas there is a 95%
increase for females in the same age group.
A global view of the data shows male incidence is
greater than female incidence. There exist differences
between male and female incidence within age groups, as
Fig. 1 illustrates. In younger age groups women have a
higher incidence than men; in older age groups this is
reversed. This may be caused by, amongst other factors,
differences in biological changes as men and women age.
This model uses a linear regression to fit
y '  (  i e it )(i t   i )
it
it
 y '   i e t  i e
(1)
5.2.2 Model 2
A Log-linear model is used to fit
Pij
, giving models of
the form:
y '  (  i e it )(i ln( t )   i )
it
it
 y '   i e ln( t )  i e
5.2.3
(2)
Model 3
An exponential model is used to fit
Pij
, giving models of
the form:
y ' i  (  i eit )(i e it )
40
 y ' i   i e i t
20
0
% Difference
, giving
models of the form.
60
-20 0
5
10
(3)
15
5.3 Results Using the Models
Each column of Tables 1 & 2 shows the standard error,
-40
-60
-80
S.E. 
-100
-120
-140
Age Group
Fig. 1. Percentage difference between Female and Male
(Female - Male) incidence rates, by age group.
5.2 The Models
The following terminology will be used in this section:
t is the year of diagnosis;
i ,  i ,  i ,  i are age specific constants determined by
the regression procedure;
 i , i , i ,  i
are the products of the regression
constants, used in the final model.
In this example, all models use an exponential model to fit
the overall incidence, with separate models being
produced for males and females.
5.2.1
Pij
Model 1
( yi  y ' i ) 2
, where n is the number of

n
i 1
n
years fitted, expressed as a percentage of the value
predicted by the model. The models were all fitted to the
10 years of data from 1986-1995 and used to predict the
incidence of the next five years. The data for 1996-2000
had no input at all to the model, so all predictions are
based solely on historical data. This measure of error is
equivalent to the actual percentage difference between the
predicted and actual value.
The model presented in [2] applied to the data 19791989, used to predict incidence in 1990-1994 produced
errors of the order of 10%. This was an improvement on
the 15% errors reported by Dyba [5] for their method.
Experiments actually showed that the average of 15%
only applied to females in certain age ranges and was
worse outside this data set.
Age
Model 1
Model 2
Model 3
20-24 years
38.746%
5.546%
22.581%
25-29 years
31.053%
2.068%
18.404%
30-34 years
34.552%
5.910%
12.743%
35-39 years
25.694%
4.016%
10.962%
40-44 years
2.303%
12.037%
4.552%
45-49 years
50-54 years
55-59 years
60-64 years
65-69 years
70-74 years
75-79 years
80-84 years
85+ years
2.643%
6.874%
1.706%
3.783%
3.052%
6.471%
2.006%
5.322%
7.108%
8.951%
4.569%
3.280%
4.324%
2.547%
4.958%
1.622%
7.227%
5.070%
3.386%
5.762%
1.466%
3.801%
3.117%
6.476%
2.093%
5.295%
7.188%
Table 1. Standard errors represented as percentages of the
predicted values of male melanoma incidence for the
period covering 1995-2000.
Age
Model 1
Model 2
Model 3
20-24 years
21.955%
5.482%
18.316%
25-29 years
13.792%
4.552%
10.960%
30-34 years
15.310%
3.321%
12.760%
35-39 years
5.718%
0.035%
4.815%
40-44 years
6.794%
0.642%
5.780%
45-49 years
2.219%
0.792%
2.278%
50-54 years
17.483%
10.989%
16.474%
55-59 years
8.925%
6.846%
8.807%
60-64 years
6.406%
4.657%
6.527%
65-69 years
7.151%
9.496%
7.141%
70-74 years
14.138%
16.648%
14.128%
75-79 years
11.412%
10.289%
11.482%
80-84 years
10.110%
8.916%
10.338%
85+ years
3.161%
2.799%
3.292%
Table 2. Standard errors represented as percentages of the
predicted values of female melanoma incidence for the
period covering 1995-2000.
6
Conclusions
The models behaved approximately as predicted, with
the errors being comparable with those for a more
historical data set. The previous paper [2] fitted the 19791989 data set to predict 1990-1994. For that data,
maximum errors were almost all of the order 20%, with
the majority of predictions well within 10%. The worst
errors occurred for the age group 20-24 and 50-54, which
can also be seen with our latest prediction.
The inflexibility of the linear Model 1 is still very much
in evidence, as it is unable to predict the wide variety of
data necessary. Model 2 fares much better, with the
logarithmic function well suited to most of the age
groups, with some fine tuning needed to get truly constant
results. Model 3 shows similar errors to Model 2. The
only significant deviations from this are with the 20-34
age groups, where the exponential function was unable to
cope with the sudden upswing in incidence that happened
in the final 2 years of the 20th Century.
The behaviour of the predictions of model 2 reveal
something interesting. For the males, the errors of
prediction are fairly evenly distributed over the age
groups. Whilst there is a peak at the 40-44 age range, the
surrounding errors indicate this is a peculiarity of that
particular data set as opposed to a trend in the errors.
There is, however, a noticeable trend in the errors for the
female prediction. The errors for those younger than 50
are much lower than those for the older age groups.
Similar differences can be seen in the other two model
types, with a difference in the average prediction error
between young (<50) and old (>50). This, combined with
the changing difference between male and female
incidence with regard to age (Fig. 1), suggests that the
behaviour of incidence trends is significantly different
between young and old age groups. This is not
unexpected as, looking at it from a multilevel perspective,
this merely states that people from the same population
have more in common with each other if they are of a
similar age. This is generally true both biologically and
also behaviourally, see [8].
These differences in the trends of incidence between the
sexes is very interesting. The exact reason for these
differences is unknown at this time, but it is most likely
caused by a combination of factors. There may be
behavioural differences between the sexes that cause men
to be more susceptible later in life. It is possible that
women tend to have a more constant exposure to
causative factors over their life, whereas a male’s
exposure increases with age. Another possibility is that
there may be some link between the menopause and a
reduced risk increase over time. This is suggested by the
fact that women have higher incidence than men earlier in
life, but after 40 the rate of change of incidence, with
respect to age, reduces. The possibility of these being
influencing factors requires further investigation.
7
Suggestions for Further Work
The method presented in this paper is a basic multilevel
model, where the levels are defined on the age parameter.
The two levels used are ‘all age groups’ and ‘individual
age groups’, with the proportion between the two being
the variable of interest. Having two levels in the data
allows the prediction of each age group to be influenced
by all of the data, rather than just a subset. The most
relevant data, that data pertaining to the specific age
group, has more weight when fitting the model. The extra
data points are used to guide the model to produce more
accurate and reliable predictions.
The results from the model on the new data is
encouraging, with the errors being consistent with those
for other time periods. Of course, improvement is needed
as unacceptable errors are still evident. Some of the poor
predictions can be filtered out by comparing the fit of the
model to the past data before future predictions are made.
Also, a wider variety of model types can be applied, to
find a better match to trends.
In addition, the changing relationship between male and
female incidence rates shown in Fig.1 indicates a change
in one of the underlying factors affecting the age groups.
Multilevel modelling is a widely used technique, which is
designed to include several ‘levels’ of inter-dependent
data into one model. As stated before, it has previously
been applied to melanoma data in a geographic context.
Regardless of the causes, it may be possible to include
these differences in the model by employing a multilevel
approach, taking the novel approach of using the age
variable as the basis of the levels. Better results might be
obtained using a model that was able to take into account
both the 5-year age group trends and the larger picture
regarding female incidence being larger than males in the
20-40 range. This will enable different equations to be
applied to different stages of the data, whilst still allowing
for the influence of the full range of the data set.
Each piece of data will be seen in the context of the
behaviour of the population as a whole, as well as the
behaviour of similar age groups. This will allow more
consistently accurate predictions, as the fit will be more
heavily weighted towards the most relevant data.
The question of selection of model type is also of great
importance, as can be seen by the changes in accuracy
between the two types of regression used here. In this
case, the model type is chosen by comparing the fits of
the models. This is done manually, which is of course not
ideal. What is required is an algorithm that would be able
to take two (or more) models and, based on various
criteria, choose the one that will produce the most reliable
prediction. Some standard criteria could be used, such as
Pearson's correlation coefficient and standard errors, as
well as more specific factors. For example, if the
proposed model predicts a turning point for the data in the
prediction range the behaviour after that point would be
uncertain, as there are no further data points to guide that
model with. In this case, the model should be rejected
before it is used to make a prediction, or the behaviour
after this point should in some way be influenced by the
behaviour of the rest of the data.
The fluctuating nature of the data is a great challenge to
accurately predicting the future incidence. Therefore,
inclusion of a appropriate cyclical term in the model
might enable more accurate predictions. Analysis using
periodograms has discovered cyclic patterns in certain
melanoma data, and might also prove useful for this data
set [9].
Ongoing work includes the development of a software
tool that encompasses the presented methods. This
application will act as a ‘black box’ for the user. A data
set will be read in and the program will analyse and fit a
variety of different models, from a library of such
techniques. Analysis based on the fit and stability of each
model will be made, and the program will select the
model that will produces the most consistently accurate
predictions of the data. This model, its predictions and its
confidence intervals can then be applied as the user sees
fit.
References
[1] Ries L A G, Eisner M P, Kosary C L, Hankey B F,
Miller B A, Clegg L, Edwards BK (eds)., SEER Cancer
Statistics Review, 1973-200 (National Cancer Institute.
Bethesda, MD, 2003).
[2]Brown A, Maple C, Prediction of Malignant melanoma
Incidence, Modelling and Simulation Conference 2003
proceedings, IASTED, 2003, 234-239.
[3] Graham S, Marshall J, Haughey B, Stoll H, Zielezny
M, Brasure J & West D, An inquiry into the epidemiology
of melanoma, American Journal of Epidemiology, 122(4),
1985, 606-619.
[4] Gellin G A, Kopf A W & Garfinkel L, Malignant
Melanoma: A Controlled Study of Possibly Associated
Factors, Archive of Dermatology, 99, 1969, 43-48.
[5] Dyba T, Hakulinen T & Paivarinta L, A Simple Nonlinear model in incidence prediction, Statistics in
Medicine, 16, 1997, 2297-2309.
[6] Boucher K M, Slattery M L, Berry T D, Quesenberry
C & Anderson K, Statistical Methods in Epidemiology: A
Comparison of Statistical Methods to Analyze Dose
response and Trend Analysis in Epidemiologic Studies,
Journal of Clinical Epidemiology, 51(12), 1998, 12231233.
[7] Langford, I H. Benthan, G et al, ‘Multilevel Modelling
of Geographically Aggregated Health Data: A Case Study
on Malignant Melanoma Mortality and UV Exposure in
the European Community’, Statistics in Medicine,17,
1998, 41-57.
[8]Wheeler S and Selby P, 'Confronting Cancer: Cause
and Prevention', (Penguin Books Ltd, Harmondsworth,
England, 1993).
[9] Dimitriov B D, Similar high-frequency cycles in the
annual levels of solar ultraviolet radiation, stratospheric
ozone concentration and incidence of malignant
melanoma of the skin, Department of Environmental
sciences and policy Journal, 1, 1998, 14-20.
Download