Age-based Multilevel Regression Modelling of Melanoma Incidence in the USA Antony Brown Carsten Maple Malcolm Keech Computing and Information Systems University of Luton Park Square, Luton, Bedfordshire, LU1 3JU ENGLAND Abstract The changing incidence of cancer is an increasing problem, and it would be of great benefit to be able to predict future changes. Brown and Maple have previously proposed a method for modelling such data [2]. The model appeared to perform better than existing methods when applied to data available at that time. In this work we apply the techniques to newly available data that confirms the accuracy of the estimates that were predicted in [2]. We also present suggestions for improvements to the method. Key-Words: Epidemiology, Melanoma, Modelling, Regression, Prediction. 1 Introduction Prediction of future incidence of cancer rates allows better planning and resource allocation for prevention and treatment. This becomes increasingly important for cancers whose incidence is on the increase, as without proper prediction the burden could far outstrip the provisions made. In the past, a range of modelling techniques have been applied to various factors of this problem. Accurate modelling and prediction is useful since any trends that are identified can be compared to underlying trends in other phenomena. This comparison can then be used to help confirm or deny causative factors for the disease. Likewise, known (or suspected) causative factors may be included in a model to improve the results of predictions. The type of data required for a study depends on the model used and the intentions of the research. There is a balance to be struck between a depth and breadth of data. Generally, the more variables obtained for each individual, the fewer individuals used in the data. For example, a clinical trial would be able to obtain a large amount of information about its participants, including such things as familial history, eye colour and even genetic samples. However, the number of participants would also be relatively small since collection of data this detailed is very time consuming and costly. Conversely, a cancer registry would have record data for a large amount of people but, as the data would be drawn from a variety of sources such as hospitals and clinics, only information that is routinely collected would be available. Hence, it may be considered that the volume of data is of the same order for any reasonable study. Models considering associations between genetic makeup and melanoma would obviously need access to the in-depth data that would come from a special clinical study and would therefore be limited in the breadth of data. A study looking at the links between age and melanoma would benefit from having access to a much larger pool of data but any predictions would be more generalised than a model with more predictive variables. This illustrates how the aim of a study determines the breadth and hence depth of the data required. This article uses data from the National Cancer Institute's SEER program [1]. This data originally consisted of incidence rates for the U.S. 1973-1997 categorised by sex and split into 5-year age groups. Since the publication of [2], data from three further years, 19982000, have been publicly released and the models have been used to predict these new data points. The data comes from a large population, and so can be used to predict the general trends in melanoma incidence for the U.S. It is important to note that the number of melanoma cases for ages less than 20 are too low to produce reliable incidence rates, and so will be excluded from this study, as was the case in our previous study. This article reintroduces the previously presented novel prediction method [2], evaluates its performance with some newly available data and presents a proposed extension of the method. 2 Epidemiology of Melanoma Whilst the positive relationship between UV exposure and other forms of skin cancer have long been shown to exist, the relationship with melanoma is much more complicated, as can be seen in [1], [3] and [4]. For example, from [3] it can be seen that the body sites most commonly affected by melanoma are not those that receive the most exposure to sunlight. In addition, occupations that usually receive large amounts of sun exposure (such as unskilled labour, farming etc…) have a lower risk of melanoma than other occupations, contrary to what might first be thought. One proposed explanation for the relationship between melanoma and sunlight is that it is not a cumulative effect, but depends on sporadic exposure to larger amounts of UV radiation than is usual. Positive links have been suggested between melanoma incidence and economic factors, as well as with managerial occupations [3]. This might possibly be explained by the link between increased salary and number of exotic foreign holidays, which would expose the individual to unaccustomedly large amounts of UV radiation for brief periods of time. However, this type of association is very hard to confirm as there are numerous other factors that could also play a role. As with the vast majority of cancers, gender has also been shown to have a significant role in the epidemiology of melanoma, see [3]. The overall incidence rates for women are lower than for men, and the age specific distribution for the sexes are different as well, which will be shown later. There are also differences between the sites of the body most commonly affected in men and women. Race also has an effect on melanoma incidence, with darker skinned races having much lower rates than lighter skinned races. The link between fair complexion and increased risk of melanoma has been shown, with features such as fair hair, light-coloured eyes and a tendency to freckle all showing an increased risk [4]. The presented model, however, only looks at the effects of age, sex and time on melanoma incidence, as we are interested in a general population model rather than one that concerns the risks of individuals. The data for these variables is more readily available on the larger scale needed to make a population specific prediction of use. The data used will only come from those races classified as ‘white’ as their incidence is much higher than other races, and it is important to keep as many possible causative variables constant. 3 Existing Methods A variety of modelling techniques have been applied to cancer incidence/mortality. In general, non-linear models give better representations of the actual patterns present in the data, and so are useful etiological studies . However, non-linear models can prove unreliable when outside of the range of data they were fitted to, they are better suited to interpolation as opposed to. This is due to the fact that they react best to local data rather than global and as such they are better suited to interpolation not extrapolation. Without some way to govern this effect, they are often unsuitable for predictive purposes, for which linear techniques prove more useful, since their behaviour is more stable. This section will review some of the techniques that have been most influential in the development of the proposed novel method. 3.1 Linear, Log-linear and Non-Linear models Linear and Log linear models with the purpose of prediction of cancer were proposed by Dyba et al [5]. These models are non-linear in parameters, but linear in form. This combines the flexibility of a non-linear model with the stability of a linear one. The models produce separate predictions for various age groups within the population, but the parameters for all age groups are determined at the same time. The advantage of this is that it allows individual age groups to have separate rates of increase, whilst still allowing each rate to be influenced by all of the data. These models are specifically designed for prediction purposes, and as such they are not bound by any constraints that closely resemble the physical processes that actually take place. However, the predictions given have significant prediction intervals, which decreases their usefulness. The ability to allow the future behaviour of each age group to be influenced by the overall trend in the rest of the data is a useful one, and efforts have been made to incorporate this into the novel method proposed. 3.2 Spline Regression The spline regression technique fits a series of polynomial equations (usually quadratic or cubic) to the historical data of the disease [6]. The flexibility of the spline allows an extremely accurate fit to rapidly changing data. This means that the spline model will usually give very accurate results within the interval for which it was fitted, but is much less accurate for prediction of values outside that region. Therefore, splines are well suited to studies trying to find/prove associations between cancer incidence and some causative factor, for example annual sunlight exposure. The spline model can provide a very useful guide to the relationship between exposure and disease, see [6]. It is possible to apply cubic splines to the data, but careful consideration has to be given to ensure that their behaviour outside the data they are based on is consistent with historic data. 3.3 Multilevel Modelling A multilevel model simultaneously models a situation at several levels of detail, with the purpose of seeing these levels as a whole, rather than as independent pieces. Such models can allow trends on one scale of the model to affect those on another, given a more complete picture of the real situation. In the case of melanoma, this technique has been applied to mortality by Langford, Bentham and McDonald [7]. The model was used to investigate the relationship between Ultra Violet B (UVB) exposure and melanoma mortality in Europe. Thus, geographical groupings were used as the different levels of the model. The data is grouped into countries, which in turn are split into geographical regions, each containing counties. Several models were produced by iteratively fitting a generalized least squares estimation to the mortality data from nine European countries. Previously, a quadratic relationship between melanoma and latitude, in regards to Ultra Violet B (UVB) exposure, had been identified. However, the results showed that there was no clear relationship between melanoma mortality and UVB. Some countries showed a positive relationship, others showed no significant relationship and a few even displayed a negative relationship. This discrepancy is explained by the fact that whilst the whole data set may display a positive correlation with UVB, the multilevel model shows the underlying trends that contribute to this overall effect. Thus, it becomes apparent that the relationship with UVB varies with different populations, and so is more complicated than was previously supposed. This model was applied to the evaluation of past trends and their relationship to various causative factors and not prediction. However, the ability to see trends in the data from a variety of perspectives, and to determine how one trend is affected, by another would be very useful in a prediction model. Therefore the proposed method includes a limited multilevel aspect and can be extended to include further levels of detail. 4 The Proposed Method This method incorporates the idea of a multilevel approach, combined with a proportional view of the data. All of the data is expressed as a proportion of the larger data set, which produces two levels to the model. This novel applications of proportions to this type of data helps produce more consistent predictions The raw data from the cancer registry is converted into age-adjusted (to the standard world population) incidence rates, yij where i is the age group and j is the gender, to allow comparisons between age groups. In order to help distinguish the incidence trend from random fluctuations, data smoothing techniques are applied. In the case of this method, a 3-year moving mean will be used. This is applied a limited number of times to ensure the loss of detail does not interfere with the regression process. The model is created using two different sets of data, the first being the yearly sum of the incidence rates from 2 S all age groups, j 1 yij . The second is the age specific proportion of the sum of the incidence that the Pij yij S . The product of these two values gives the age specific incidence rate. Therefore, the product of estimations to these values will be estimations to the age specific incidence rates, P 'ij S ' y 'ij . The proportion value, Pij , contains information about the age specific incidence, but it is in relation to S. This helps reduce the effects of random variations in the data, by expressing them as a proportion of the total as well as in absolute terms. This allows regressions to be fitted to the proportion and sum, and the two combined, to produce a prediction for incidence. The advantage of this is that the trend for each age group is kept intact, but it is present in the context of the other data groups. This allows more data points to influence each model, which in turn increases the accuracy of the predictions by allowing them to be guided by the overall trend as well as the age specific trends. Exponential regressions are used for the total incidence, as it can be shown that these reflect the general trend. A variety of regression types are used for the proportion model, to help determine the ideal model type. The models are fitted to a ten-year range of data, and forecast over the five years following that range. The use of just ten years worth of data for each model is intended to ensure the fit represents only the latest trend of the data, as their purpose is to predict future incidence. The reason for approaching the problem in this manner is that the actual incidence rates themselves contain too many fluctuations, even after smoothing, to allow exact fitting with regression techniques. In addition, the variations in behaviour between age groups means that no single regression model would be able to fit all age groups well. The motivation behind the desire to apply a single type of model to all of the age groups is that each set of data is not independent from the others. The people from one age group are exposed to a similar environment to all of the other groups, they also share some cultural effects with the other groups. This means that the different incidence rates are a reflection of each group's different biological and behavioural responses to their common environment and culture. Therefore, we make the assumption that the trend for each age group is based, in part, on this underlying function. It is preferable that the model has some link to the theorised real world situation, so using one type of model to represent this underlying function is a justifiable goal. 5 15 i 1 incidence makes up, Results 5.1 Data Analysis The trend in incidence rates between 1973-2000 has many interesting features, with the overall incidence rate increasing by 320%. The differences between male and female incidence have also increased over this time span. Female incidence increased by 282.8 %, compared to a 366.7 % increase in males. This difference has increased the disparity between male and female from 9.41% to 32.4 % since 1973. Categorising the data into 5-year age groups, further differences can be seen. The majority of the age groups display a general tendency to increase over the whole interval, with a variable rise and fall pattern. Certain age groups show particularly interesting behaviour. For example, in the 20-24 age range for males, the trend increases until the start of the nineties, at which point it begins to decrease. The result is only a 10% increase over the time interval, whereas there is a 95% increase for females in the same age group. A global view of the data shows male incidence is greater than female incidence. There exist differences between male and female incidence within age groups, as Fig. 1 illustrates. In younger age groups women have a higher incidence than men; in older age groups this is reversed. This may be caused by, amongst other factors, differences in biological changes as men and women age. This model uses a linear regression to fit y ' ( i e it )(i t i ) it it y ' i e t i e (1) 5.2.2 Model 2 A Log-linear model is used to fit Pij , giving models of the form: y ' ( i e it )(i ln( t ) i ) it it y ' i e ln( t ) i e 5.2.3 (2) Model 3 An exponential model is used to fit Pij , giving models of the form: y ' i ( i eit )(i e it ) 40 y ' i i e i t 20 0 % Difference , giving models of the form. 60 -20 0 5 10 (3) 15 5.3 Results Using the Models Each column of Tables 1 & 2 shows the standard error, -40 -60 -80 S.E. -100 -120 -140 Age Group Fig. 1. Percentage difference between Female and Male (Female - Male) incidence rates, by age group. 5.2 The Models The following terminology will be used in this section: t is the year of diagnosis; i , i , i , i are age specific constants determined by the regression procedure; i , i , i , i are the products of the regression constants, used in the final model. In this example, all models use an exponential model to fit the overall incidence, with separate models being produced for males and females. 5.2.1 Pij Model 1 ( yi y ' i ) 2 , where n is the number of n i 1 n years fitted, expressed as a percentage of the value predicted by the model. The models were all fitted to the 10 years of data from 1986-1995 and used to predict the incidence of the next five years. The data for 1996-2000 had no input at all to the model, so all predictions are based solely on historical data. This measure of error is equivalent to the actual percentage difference between the predicted and actual value. The model presented in [2] applied to the data 19791989, used to predict incidence in 1990-1994 produced errors of the order of 10%. This was an improvement on the 15% errors reported by Dyba [5] for their method. Experiments actually showed that the average of 15% only applied to females in certain age ranges and was worse outside this data set. Age Model 1 Model 2 Model 3 20-24 years 38.746% 5.546% 22.581% 25-29 years 31.053% 2.068% 18.404% 30-34 years 34.552% 5.910% 12.743% 35-39 years 25.694% 4.016% 10.962% 40-44 years 2.303% 12.037% 4.552% 45-49 years 50-54 years 55-59 years 60-64 years 65-69 years 70-74 years 75-79 years 80-84 years 85+ years 2.643% 6.874% 1.706% 3.783% 3.052% 6.471% 2.006% 5.322% 7.108% 8.951% 4.569% 3.280% 4.324% 2.547% 4.958% 1.622% 7.227% 5.070% 3.386% 5.762% 1.466% 3.801% 3.117% 6.476% 2.093% 5.295% 7.188% Table 1. Standard errors represented as percentages of the predicted values of male melanoma incidence for the period covering 1995-2000. Age Model 1 Model 2 Model 3 20-24 years 21.955% 5.482% 18.316% 25-29 years 13.792% 4.552% 10.960% 30-34 years 15.310% 3.321% 12.760% 35-39 years 5.718% 0.035% 4.815% 40-44 years 6.794% 0.642% 5.780% 45-49 years 2.219% 0.792% 2.278% 50-54 years 17.483% 10.989% 16.474% 55-59 years 8.925% 6.846% 8.807% 60-64 years 6.406% 4.657% 6.527% 65-69 years 7.151% 9.496% 7.141% 70-74 years 14.138% 16.648% 14.128% 75-79 years 11.412% 10.289% 11.482% 80-84 years 10.110% 8.916% 10.338% 85+ years 3.161% 2.799% 3.292% Table 2. Standard errors represented as percentages of the predicted values of female melanoma incidence for the period covering 1995-2000. 6 Conclusions The models behaved approximately as predicted, with the errors being comparable with those for a more historical data set. The previous paper [2] fitted the 19791989 data set to predict 1990-1994. For that data, maximum errors were almost all of the order 20%, with the majority of predictions well within 10%. The worst errors occurred for the age group 20-24 and 50-54, which can also be seen with our latest prediction. The inflexibility of the linear Model 1 is still very much in evidence, as it is unable to predict the wide variety of data necessary. Model 2 fares much better, with the logarithmic function well suited to most of the age groups, with some fine tuning needed to get truly constant results. Model 3 shows similar errors to Model 2. The only significant deviations from this are with the 20-34 age groups, where the exponential function was unable to cope with the sudden upswing in incidence that happened in the final 2 years of the 20th Century. The behaviour of the predictions of model 2 reveal something interesting. For the males, the errors of prediction are fairly evenly distributed over the age groups. Whilst there is a peak at the 40-44 age range, the surrounding errors indicate this is a peculiarity of that particular data set as opposed to a trend in the errors. There is, however, a noticeable trend in the errors for the female prediction. The errors for those younger than 50 are much lower than those for the older age groups. Similar differences can be seen in the other two model types, with a difference in the average prediction error between young (<50) and old (>50). This, combined with the changing difference between male and female incidence with regard to age (Fig. 1), suggests that the behaviour of incidence trends is significantly different between young and old age groups. This is not unexpected as, looking at it from a multilevel perspective, this merely states that people from the same population have more in common with each other if they are of a similar age. This is generally true both biologically and also behaviourally, see [8]. These differences in the trends of incidence between the sexes is very interesting. The exact reason for these differences is unknown at this time, but it is most likely caused by a combination of factors. There may be behavioural differences between the sexes that cause men to be more susceptible later in life. It is possible that women tend to have a more constant exposure to causative factors over their life, whereas a male’s exposure increases with age. Another possibility is that there may be some link between the menopause and a reduced risk increase over time. This is suggested by the fact that women have higher incidence than men earlier in life, but after 40 the rate of change of incidence, with respect to age, reduces. The possibility of these being influencing factors requires further investigation. 7 Suggestions for Further Work The method presented in this paper is a basic multilevel model, where the levels are defined on the age parameter. The two levels used are ‘all age groups’ and ‘individual age groups’, with the proportion between the two being the variable of interest. Having two levels in the data allows the prediction of each age group to be influenced by all of the data, rather than just a subset. The most relevant data, that data pertaining to the specific age group, has more weight when fitting the model. The extra data points are used to guide the model to produce more accurate and reliable predictions. The results from the model on the new data is encouraging, with the errors being consistent with those for other time periods. Of course, improvement is needed as unacceptable errors are still evident. Some of the poor predictions can be filtered out by comparing the fit of the model to the past data before future predictions are made. Also, a wider variety of model types can be applied, to find a better match to trends. In addition, the changing relationship between male and female incidence rates shown in Fig.1 indicates a change in one of the underlying factors affecting the age groups. Multilevel modelling is a widely used technique, which is designed to include several ‘levels’ of inter-dependent data into one model. As stated before, it has previously been applied to melanoma data in a geographic context. Regardless of the causes, it may be possible to include these differences in the model by employing a multilevel approach, taking the novel approach of using the age variable as the basis of the levels. Better results might be obtained using a model that was able to take into account both the 5-year age group trends and the larger picture regarding female incidence being larger than males in the 20-40 range. This will enable different equations to be applied to different stages of the data, whilst still allowing for the influence of the full range of the data set. Each piece of data will be seen in the context of the behaviour of the population as a whole, as well as the behaviour of similar age groups. This will allow more consistently accurate predictions, as the fit will be more heavily weighted towards the most relevant data. The question of selection of model type is also of great importance, as can be seen by the changes in accuracy between the two types of regression used here. In this case, the model type is chosen by comparing the fits of the models. This is done manually, which is of course not ideal. What is required is an algorithm that would be able to take two (or more) models and, based on various criteria, choose the one that will produce the most reliable prediction. Some standard criteria could be used, such as Pearson's correlation coefficient and standard errors, as well as more specific factors. For example, if the proposed model predicts a turning point for the data in the prediction range the behaviour after that point would be uncertain, as there are no further data points to guide that model with. In this case, the model should be rejected before it is used to make a prediction, or the behaviour after this point should in some way be influenced by the behaviour of the rest of the data. The fluctuating nature of the data is a great challenge to accurately predicting the future incidence. Therefore, inclusion of a appropriate cyclical term in the model might enable more accurate predictions. Analysis using periodograms has discovered cyclic patterns in certain melanoma data, and might also prove useful for this data set [9]. Ongoing work includes the development of a software tool that encompasses the presented methods. This application will act as a ‘black box’ for the user. A data set will be read in and the program will analyse and fit a variety of different models, from a library of such techniques. Analysis based on the fit and stability of each model will be made, and the program will select the model that will produces the most consistently accurate predictions of the data. This model, its predictions and its confidence intervals can then be applied as the user sees fit. References [1] Ries L A G, Eisner M P, Kosary C L, Hankey B F, Miller B A, Clegg L, Edwards BK (eds)., SEER Cancer Statistics Review, 1973-200 (National Cancer Institute. Bethesda, MD, 2003). [2]Brown A, Maple C, Prediction of Malignant melanoma Incidence, Modelling and Simulation Conference 2003 proceedings, IASTED, 2003, 234-239. [3] Graham S, Marshall J, Haughey B, Stoll H, Zielezny M, Brasure J & West D, An inquiry into the epidemiology of melanoma, American Journal of Epidemiology, 122(4), 1985, 606-619. [4] Gellin G A, Kopf A W & Garfinkel L, Malignant Melanoma: A Controlled Study of Possibly Associated Factors, Archive of Dermatology, 99, 1969, 43-48. [5] Dyba T, Hakulinen T & Paivarinta L, A Simple Nonlinear model in incidence prediction, Statistics in Medicine, 16, 1997, 2297-2309. [6] Boucher K M, Slattery M L, Berry T D, Quesenberry C & Anderson K, Statistical Methods in Epidemiology: A Comparison of Statistical Methods to Analyze Dose response and Trend Analysis in Epidemiologic Studies, Journal of Clinical Epidemiology, 51(12), 1998, 12231233. [7] Langford, I H. Benthan, G et al, ‘Multilevel Modelling of Geographically Aggregated Health Data: A Case Study on Malignant Melanoma Mortality and UV Exposure in the European Community’, Statistics in Medicine,17, 1998, 41-57. [8]Wheeler S and Selby P, 'Confronting Cancer: Cause and Prevention', (Penguin Books Ltd, Harmondsworth, England, 1993). [9] Dimitriov B D, Similar high-frequency cycles in the annual levels of solar ultraviolet radiation, stratospheric ozone concentration and incidence of malignant melanoma of the skin, Department of Environmental sciences and policy Journal, 1, 1998, 14-20.