THE IMPACT OF MULTIFACTORIAL GENETIC DISORDERS ON LONG-TERM INSURANCE By Pradip Tapadar Submitted for the Degree of Doctor of Philosophy at Heriot-Watt University on Completion of Research in the School of Mathematical and Computer Sciences January 2007. This copy of the thesis has been supplied on the condition that anyone who consults it is understood to recognise that the copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without the prior written consent of the author or the university (as may be appropriate). I hereby declare that the work presented in this thesis was carried out by myself at Heriot-Watt University, Edinburgh, except where due acknowledgement is made, and has not been submitted for any other degree. Pradip Tapadar (Candidate) Professor Angus S. Macdonald (Supervisor) Date ii Contents Acknowledgements xiii Abstract xv Introduction 1 1 Genetics and Insurance 1.1 Introduction . . . . . . . . . . . . . 1.2 Genes . . . . . . . . . . . . . . . . 1.3 Genetic Disorders and Insurance . . 1.3.1 Huntington’s Disease . . . . 1.3.2 Alzheimer’s Disease . . . . . 1.3.3 Cancer . . . . . . . . . . . . 1.3.4 Cardiovascular disease . . . 1.4 Genetics and Insurance Regulations 1.5 The UK Biobank Project . . . . . . 1.6 A UK Biobank Simulation Model . . . . . . . . . . . . . . . . . . . . . 2 A Model for Heart Attack 2.1 Specification of the Model . . . . . . . 2.2 The Heart Attack Transition Intensity 2.3 Mortality After First Heart Attacks . . 2.3.1 Literature Review . . . . . . . . 2.3.2 Data . . . . . . . . . . . . . . . 2.3.3 Fitting a Parametric Function . 2.3.4 Discussion of the Fitted Model 2.4 Mortality Before First Heart Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Gene-Environment Interaction 3.1 Definition of Strata: A Simple Example . . . . . . . . 3.2 A Sample Realisation of UK Biobank . . . . . . . . . 3.3 Epidemiological Analysis . . . . . . . . . . . . . . . . 3.4 An Actuarial Investigation . . . . . . . . . . . . . . . 3.5 Premium Rating for Critical Illness Insurance . . . . 3.5.1 A Critical Illness Model . . . . . . . . . . . . 3.5.2 Premium Rating for Critical Illness Insurance iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 15 15 16 18 19 20 22 24 . . . . . . . . 27 27 28 29 30 32 33 37 40 . . . . . . . 47 47 50 50 54 56 56 58 4 UK 4.1 4.2 4.3 4.4 4.5 Biobank Simulation Results Varying the Genetic and Environment Model . . . . Outcomes of 1,000 Simulations: The Base Scenario A Measure of Confidence . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . 5 Adverse Selection and Utility Theory 5.1 Risk and Insurance . . . . . . . . . . . . . . . . . 5.2 Underwriting Risk . . . . . . . . . . . . . . . . . 5.3 Multifactorial Disorders . . . . . . . . . . . . . . 5.4 Literature Review . . . . . . . . . . . . . . . . . . 5.5 Adverse Selection . . . . . . . . . . . . . . . . . . 5.6 Utility of Wealth . . . . . . . . . . . . . . . . . . 5.7 Coefficients of Risk-aversion . . . . . . . . . . . . 5.8 Families of Utility Functions . . . . . . . . . . . . 5.9 Estimates of Absolute and Relative Risk-aversion 6 Adverse Selection in a 2-state Insurance Model 6.1 A Simple Gene-environment Interaction Model . . 6.2 Single Premiums . . . . . . . . . . . . . . . . . . 6.3 Threshold Premium . . . . . . . . . . . . . . . . . 6.4 The Additive Epidemiological Model . . . . . . . 6.5 Immunity From Adverse Selection . . . . . . . . . 6.6 The Multiplicative Epidemiological Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Adverse Selection in a Critical Illness Insurance Model 7.1 A Heart Attack Model . . . . . . . . . . . . . . . . . . . . 7.2 Threshold Premium for Critical Illness Insurance . . . . . 7.3 Premium Rates for Critical Illness Insurance . . . . . . . . 7.4 High Relative Risks . . . . . . . . . . . . . . . . . . . . . . 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 61 64 67 68 76 . . . . . . . . . 79 79 80 82 83 85 86 89 90 92 . . . . . . . . . . . . 95 95 96 97 98 100 104 . . . . . 107 . 107 . 109 . 109 . 117 . 121 . . . . . . . . . . . . . . 8 Conclusions 125 8.1 UK Biobank Simulation Study . . . . . . . . . . . . . . . . . . . . . . 125 8.2 Adverse Selection Issues . . . . . . . . . . . . . . . . . . . . . . . . . 128 A Epidemiology A.1 Introduction . . . . . . . . . . . A.2 Measuring risks . . . . . . . . . A.3 Models of Disease Association . A.4 Relative Risk and Odds Ratio . A.5 Analysis of Grouped Data . . . A.6 Analysis of Matched Studies . . A.7 Effects of Combined Exposures . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 . 131 . 133 . 137 . 139 . 140 . 142 . 146 B Numerical Methods B.1 Differential Equations . . . . . . . . B.1.1 Introduction . . . . . . . . . B.1.2 Euler Method . . . . . . . . B.1.3 Runge-Kutta Method . . . . B.2 Random Numbers . . . . . . . . . . B.2.1 Introduction . . . . . . . . . B.2.2 Uniform Deviates . . . . . . B.2.3 The Transformation Method B.2.4 The Rejection Method . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 . 151 . 151 . 152 . 152 . 153 . 153 . 154 . 155 . 157 165 v vi List of Tables 2.1 2.2 2.3 2.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 4.16 4.17 4.18 Survival probabilities after first heart attack. . . . . . . . . . . . . . Parameter estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . Odds of dying within first 30 days, one year and two years following a first heart attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . Adjusted odds ratios and the corresponding 95% confidence intervals of dying within first 30 days, one year and two years following a first heart attack according to Goldberg et al. (1998). . . . . . . . . . . . The factor ρs , in Equation (3.14), for each gene-environment combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The multipliers k s × ρuv for each stratum. . . . . . . . . . . . . . . The true relative risks for each stratum, relative to the baseline ge stratum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The simulated life histories of the first 20 (of 500,000) individuals showing their genders, exposure to environmental factors, genotypes and the times and types of all transitions made within 10 years. . . Number of individuals in each state at the end of the 10-year follow-up period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odds ratios with respect to the ge stratum as baseline, based on a 1:5 matching strategy using all cases and 5-year age groups. Approximate 95% Confidence intervals are shown in brackets. There were no cases among females age 45–49 in stratum GE. . . . . . . . . . . . . . . . The age-adjusted odds ratios calculated for both males and females. The estimated multipliers cs for each stratum. . . . . . . . . . . . . m 28-Day mortality rates, q01 (x) = 1 − pm 01 (x), for males following heart attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The true critical illness insurance premiums for different strata as a percentage of those for stratum ge. . . . . . . . . . . . . . . . . . . The actuary’s estimated critical illness insurance premiums for different strata as a percentage of those for stratum ge. . . . . . . . . . . The model parameters for different scenarios. Odds ratios are also shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The correlation matrix of the strata-specific premium rates for males aged 45 and policy term 15 years under the Base scenario, all cases included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The correlation matrix of the premium ratings for males aged 45 and policy term 15 years under the Base scenario, all cases included. . . vii . 33 . 35 . 40 . 40 . 48 . 49 . 49 . 51 . 51 . 53 . 54 . 55 . 57 . 59 . 60 . 63 . 65 . 67 4.19 The measure of overlap O for CI insurance premium ratings for males aged 45, with policy term 15 years, for different scenarios. . . . . . . 69 4.20 The measure of overlap O for CI insurance premium ratings for females aged 45 with policy term 15 years, for different scenarios. . . . 74 4.21 The measure of overlap O for CI insurance premium ratings for males aged 45, with policy term 15 years, for different scenarios and a 1:1 matching strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.22 The number of simulations rejected due to the inability to calculate the odds ratios for a 1:1 matching strategy. . . . . . . . . . . . . . . . 76 6.23 The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.5 and an additive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.24 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions. 102 6.25 The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.9 and an additive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.26 The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.9 and a multiplicative model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.27 The premium rates of critical illness contracts of duration 15 years. . 110 7.28 P † for males, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W = £100,000. . . . . 112 7.29 P † for females, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W = £100,000. . . . . 113 7.30 The population average premium rate for CI insurance, P0 , as if heart attack risk were absent (λ12 = 0). . . . . . . . . . . . . . . . . . . . . 114 7.31 The relative risk k above which males of different ages in stratum ge with initial wealth W = £100, 000 will not buy critical illness insurance policies of term 15 years, where ω = 0.9. . . . . . . . . . . . 115 7.32 The relative risk k above which females of different ages in stratum ge with initial wealth W = £100, 000 will not buy critical illness insurance policies of term 15 years, where ω = 0.9. . . . . . . . . . . . 115 7.33 The loss L0 in £,000 above which adverse selection cannot occur. Initial wealth W = £100,000. . . . . . . . . . . . . . . . . . . . . . . 116 7.34 q̄, the probability that a healthy person aged x has a heart attack before age x + t, for policy duration t = 15 years. . . . . . . . . . . . 119 7.35 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions, for males purchasing CI insurance. . . . . . . . . . . . . . . . . . . . 120 7.36 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions, for females purchasing CI insurance. . . . . . . . . . . . . . . . . . . . 121 A.37 List of odds ratios obtained from the 2 × 4 table in Figure A.33. . . . 149 viii A.38 Other measures based on the 2 × 4 table in Figure A.33. . . . . . . . 149 ix x List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 3.18 4.19 4.20 4.21 A 4-state heart attack model. . . . . . . . . . . . . . . . . . . . . . . The transition intensity of all first heart attacks, by gender. . . . . . Subset of the model in Figure 2.1 to study survival after heart attacks. The plots of the data from Table 2.1. . . . . . . . . . . . . . . . . . . The plots of f (t) = 1/(1 + ta ) against t for values of a = 0.25, 0.50, 1.00, 2.00, 4.00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plots of survival probabilities, P22 (x, x+t), against duration after heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. . . The plots of transition intensities, λ24 (x, t), against duration after heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. . . Graphs of λ24 (x, t), assigned to representative ages for each age group, and the force of mortality of the ELT15 life tables. . . . . . . . . . . The plots of survival probabilities of men aged 50, 60, 70, 80 and 90 following ELT15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plots of survival probabilities of women aged 50, 60, 70, 80 and 90 following ELT15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The plots of survival probabilities, of individuals aged 50, 60, 70, 80 and 90, over the first 30 days after a first heart attack. . . . . . . . . The plots of survival probabilities of individuals aged 50, 60, 70, 80 and 90, who survived the first 30 days after a first heart attack. . . . 4-state heart attack model - Grouping of states. . . . . . . . . . . . . A 2-state mortality model. . . . . . . . . . . . . . . . . . . . . . . . . The graph of the integrand in Equation 2.11. . . . . . . . . . . . . . . The graph of the integrand in Equation 2.13. . . . . . . . . . . . . . . Transition intensities of non-heart-attack deaths plotted along with ELT15 for both males and females. . . . . . . . . . . . . . . . . . . . A full critical illness model for gender s. . . . . . . . . . . . . . . . . Scatter plots of CI insurance premium rates for strata gE, Ge and GE versus that of ge under the Base scenario for males aged 45 and policy term 15 years. . . . . . . . . . . . . . . . . . . . . . . . . . . . The scatter plots of the premium ratings Ge/ge and GE/ge versus gE/ge and the corresponding density plots for males aged 45 and policy term 15 years under the Base scenario, all cases included. . . . Marginal densities of premium ratings in the Base scenario (males) with different numbers of cases in the case-control study. . . . . . . . xi 28 29 30 33 34 36 36 37 38 38 39 39 41 42 44 45 46 56 65 66 71 4.22 The empirical cumulative distribution function of the premium ratings gE/ge, Ge/ge and GE/ge for males aged 45 and policy term 15 years under the Base scenario. . . . . . . . . . . . . . . . . . . . . . 4.23 Marginal densities of premium ratings in different scenarios (males), with 5,000 cases in the case-control study. . . . . . . . . . . . . . . 5.24 Utility of wealth for a risk averse individual. . . . . . . . . . . . . . 6.25 A two state model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.26 A full critical illness model. . . . . . . . . . . . . . . . . . . . . . . 7.27 The ratio of heart attack transition intensity to total critical illness transition intensity, by gender. . . . . . . . . . . . . . . . . . . . . . A.28 A schematic diagram of a case-control study. . . . . . . . . . . . . . A.29 A 2-state model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.30 A 2 × 2 table for stratum k with corresponding probabilities. . . . . A.31 A 2 × 2 table with data for stratum k. . . . . . . . . . . . . . . . . A.32 The types of table for each case-control pair in a 1:1 matching. . . . A.33 A 2 × 4 table with data for stratum k. . . . . . . . . . . . . . . . . B.34 The Exp(1) density and the majorising function with δ = 0.10. . . . B.35 The Exp(1) density and the majorising function with δ = 0.01. . . . B.36 The N(0,1) density and the majorising function with δ = 0.10. . . . B.37 The N(0,1) density and the majorising function with δ = 0.01. . . . B.38 Density estimates based on the simulated 50,000 random deviates from Exp(1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.39 Density estimates based on the simulated 50,000 random deviates from N (0, 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii . 72 . . . . 73 87 95 108 . . . . . . . . . . . 110 134 135 139 140 143 148 161 162 162 163 . 163 . 164 Acknowledgements First of all, I would like to thank my supervisor, Professor Angus Macdonald, for his continuous support, guidance and encouragement for the entire duration of this project. However busy, he always found time to meet regularly and discuss my work. I found his constructive criticisms and eye for technical detail absolutely invaluable throughout the course of the study. I would also like to thank Dr Delme Pritchard for his suggestions and advice for the first half of the thesis. This work was carried out at the Genetics and Insurance Research Centre at Heriot-Watt University. I would like to thank the sponsors for funding, and members of the Steering Committee for helpful comments. It has also been a pleasure to work with my colleagues: Lu Li, Tunde Akodu, Laura MacCalman and Tushar Chatterjee. My parents and my sister have encouraged and inspired me to pursue knowledge to the best of my abilities all through my life. Without their love, support and guidance, I would not have come this far. A special thanks to Bruce Porteous, a guide and a friend, for his enthusiastic support. Finally, I dedicate this thesis to my wife, Vaishnavi, for her unfailing love, patience and support. At my insistence, she has had to endure reading numerous versions of the thesis at various stages of development. She provided me unconditional support, without which this thesis would not have been a reality. Thank you. xiii xiv Abstract Rapid advances in genetic epidemiology and the setting up of large-scale cohort studies, like the UK Biobank project, have shifted the focus from severe, but rare, single gene disorders to less severe, but common, multifactorial disorders. This will lead to the discovery of genetic risk factors for common diseases of major importance in insurance underwriting. Given this backdrop, we have two specific aims for this thesis. In the first half of the thesis (also the subject matter of Macdonald et al. (2006)), we analyse the impact of results emerging out of UK Biobank on the insurance industry. In the second half (subject matter of Macdonald and Tapadar (2006)), we consider the related adverse selection issues. The UK Biobank project is a large-scale investigation of the combined effects of genotype and environmental exposures on the risk of common diseases. It is intended to recruit 500,000 subjects aged 40–69, to obtain medical histories and blood samples from them at outset, and to follow them up for at least 10 years. This will have a major impact on our knowledge of multifactorial genetic disorders, rather than the rare but severe single-gene disorders that have been studied to date. The question arises, what use may insurance companies make of this knowledge, particularly if genetic tests can identify persons at different risk? We describe here a simulation study of the UK Biobank project. We specify a simple hypothetical model of genetic and environmental influences on the risk of heart attack. A single simulation of UK Biobank consists of 500,000 life histories over 10 years; we suppose that case-control studies are carried out to estimate age-specific odds ratios, and that an actuary uses these odds ratios to parameterise a model of critical illness insurance. From a large number of such simulations we obtain sampling distributions of premium rates in different strata defined by genotype and environmental exposure. We conclude that xv the ability of such a study reliably to discriminate between different underwriting classes is limited, and depends on large numbers of cases being analysed. As is the situation now in many countries, if genetic information continues to be treated as private, adverse selection becomes possible. But it should occur only if the individuals at lowest risk obtain lower expected utility by purchasing insurance at the average price than by not insuring. We explore where this boundary may lie, using a simple 2 × 2 gene-environment interaction model of epidemiological risk, in a simplified 2-state insurance model and in a more realistic model of heart-attack risk and critical illness insurance. Adverse selection does not appear unless purchasers are relatively risk-seeking (compared with a plausible parameterisation) and insure a small proportion of their wealth; or unless the elevated risks implied by genetic information are implausibly high. In many cases adverse selection is impossible if the low-risk stratum of the population is large enough. These observations are strongly accentuated in the critical illness model by the presence of risks other than heart attack, and the constraint that differential heart-attack risks must agree with the overall population risk. We find no convincing evidence that adverse selection is a serious insurance risk, even if information about multifactorial genetic disorders remains private. xvi Introduction Much of human genetics is concerned with studying the genetic contribution to diseases, and this leads to a profound distinction between the single-gene disorders and the multifactorial disorders. (a) Single-gene disorders are caused, as their name suggests, by a defect in a single gene. Because most genes are inherited in a simple way according to Mendel’s laws, these diseases show characteristic patterns of inheritance from one generation to the next, known to geneticists and underwriters alike as a ‘family history’. Single-gene disorders are quite rare but often severe. (b) Multifactorial disorders are (mostly) common diseases, such as coronary heart disease and cancers, whose onset or progression may be influenced by variations in several genes, acting in concert with environmental differences. The effect is likely to be quite slight, conferring an altered predisposition to the disease rather than a radically different risk. Most genetic epidemiology has, until now, concentrated on single-gene disorders. One reason is that the clear patterns of Mendelian inheritance identified affected families long before the direct examination of DNA, and location of the relevant genes, became possible. So when these tools did emerge in the 1990s, geneticists knew where to look; affected families were studied, genes were identified, and the key epidemiological parameters were estimated. The parameter of most interest to actuaries is the age-related penetrance, which is the probability that a person who carries a risky version of the gene will have suffered onset of the disease by age x. It is entirely analogous to the life table probability x q0 . (Often, the risky versions of the gene are called ‘mutations’, and a person carrying one is called a ‘mutation carrier’ or just ‘carrier’.) 1 Studies of affected families are by definition retrospective in nature; families are studied because they are known to be affected. Retrospective studies are subject to uncontrolled sources of bias, precisely because they are based on a non-randomly selected sample of the population; so they are, if possible, avoided in favour of prospective studies, in which a properly randomised sample of healthy subjects is followed forwards in time. Despite this health warning, retrospective studies of single-gene disorders have been carried out for reasons of convenience, cost and necessity: the ready availability of known affected families was convenient and made data collection relatively cheap; and the rarity of single-gene disorders made prospective studies impractical. Moreover, a prospective study would take many years to yield results. Another consequence of the rarity of most single-gene disorders is that most studies have had quite small sample sizes, but if the penetrance is high enough this is tolerable. These studies have successfully led to many gene discoveries and a lot of progress has been made in understanding a number of single-gene disorders. Multifactorial disorders, which are influenced by more than one gene or by interactions between genes and the environment, are not so well-studied. Many disorders, including cancer, heart disease, diabetes and Alzheimer’s disease are believed to be caused or influenced by complex interactions between multiple genes, environment and lifestyle. The clear patterns of Mendelian inheritance are lost, and any familial clustering of disease that may be observed could just as easily be caused by shared environment as by shared genes. Therefore, there is no existing pool of known affected families that can be studied straightaway. And, because the influence of genetic variation may be slight (low penetrance) large samples will be needed to detect such influence with any reliability. At the risk of oversimplifying a little, single-gene disorders represent the genetical research of the past, and multifactorial disorders represent the genetical research of the future. Progress will need studies that are large-scale, prospective, and longterm (and therefore very expensive). These studies must capture both genetic and environmental variation (and interactions) and relate them to the risks of common diseases. This is extremely ambitious. 2 The proposed UK Biobank project aims to achieve this. This project will recruit 500,000 individuals aged 40 to 69, chosen as randomly as possible from the general UK population, and collect data on them over a period of 10 years. We will discuss the main features of UK Biobank in Section 1.5. A key point is that UK Biobank aims only to collect data, not to analyse it. Its data will, in due course, be made available to researchers interested in particular genes and particular diseases, who will have to obtain separate funding for their studies in the usual way. This is sensible because it is impossible to predict at outset just what combinations of genes, environment and disease it will be most fruitful to study. Nevertheless, it is necessary to have in mind the kinds of statistical studies that may, in future, be carried out, so that UK Biobank can be set up to capture data of the correct form. The presumption is that most studies will be case-control studies. We outline the basics of case-control studies in Appendix A. Given its size and significance, it is important to study the kind of results we might expect to emerge out of UK Biobank. Our particular interest is in the implications of UK Biobank for insurance. There has been a lot of debate, often heated, concerning genetics and insurance in the past 10 years, mainly focussed on singlegene disorders. We refer to Daykin et al. (2003) or Macdonald (2004) as sources. It seems plausible that awareness of genetic issues will be heightened by enrolling 500,000 people into a high-profile genetic study. Insofar as insurance questions arise, answers obtained from past actuarial research, based on single-gene disorders, may be wholly inapplicable. But, since the single-gene disorders provide all the easily grasped examples and paradigms, there is a risk that these examples and paradigms will be grafted onto UK Biobank, however inappropriately, by the media if not by the genetics community. It will then be unfortunate that, by its nature, UK Biobank will not provide the evidence to refute such errors for 5–10 years. We devote the first half of the thesis to modelling UK Biobank itself, so that before a single person has been recruited, or gene sequenced, we may quantify the implications of its outcomes for insurance. We choose critical illness (CI) insurance as the simplest type of coverage, because the insured event is generally onset of disease, and we need not model post-onset events (although as we shall see in Section 3 2, this is not entirely true in parameterising the model). We choose heart attack (myocardial infarction) as the disease of interest, because this will certainly be a major target of studies using UK Biobank data. Our approach is simple: simulate 500,000 random life histories, given an assumed model of genetic and environmental influences on the hazard rate of heart attack. Then we may analyse these simulated data just as an epidemiologist or an actuary may be expected to. At this stage a further complication appears, one that is all too familiar to actuarial researchers who have modelled single-gene disorders. Actu- aries almost never have access to the original data upon which genetic studies are based. In the case of UK Biobank, Section 5.2 of the draft protocol (www.ukbiobank.ac.uk/docs/draft protocol.pdf) says: “Data from the project will not be accessible to the insurance industry or any other similar body.” This means that actuarial researchers will have to rely on the published outcomes of medical or epidemiological research projects that use the UK Biobank data. The ideal, given the models actuaries typically use for pricing and reserving, would be age-dependent onset rates or penetrances, corresponding to µx or qx in a life-table study. Unfortunately, this is far in excess of what is usually published in a medical study, because the questions addressed by such studies can often be answered by much simpler statistics. And, it must be said, the estimation of µx or qx is very demanding of the data. Since we expect case-control studies to be the most common approach to UK Biobank, we must take account of this in analysing our simulated data. We may not, realistically, assume that the actuary can analyse directly the 500,000 simulated life histories. Instead, an epidemiologist must first carry out a case control study and publish the results, which most often will be expressed as odds ratios (see Appendix A). Then the actuary must take these odds ratios and, using whatever approximate methods come to hand, estimate onset rates or penetrances suitable for use in an actuarial model. We will model this process, with two results: (a) We will be able to estimate the impact on CI insurance premiums of representative multifactorial modifiers of heart attack risk. (b) Having simulated the data from a known model of our own choosing, we can 4 assess the seriousness of the errors that must be made, in parameterising an actuarial model from published odds ratios rather than from the raw data. As mentioned before, previous actuarial studies have done exactly that (see Macdonald and Pritchard (2000) for an example), but only in the context of relatively high penetrances. We will be interested to see if robust actuarial modelling of relatively low-penetrance disorders is possible using published casecontrol studies. The plan of the first half of the thesis is as follows. After a general introduction in Section 1.1, we provide a basic overview of genetics in Section 1.2. Section 1.3 gives examples of a few well-known genetic disorders along with reviews of relevant actuarial literature. The regulatory developments in the UK, concerning genetics and insurance are covered in Section 1.4. In Section 1.5 we describe the main features of UK Biobank. In Section 1.6, we will introduce our general approach to simulating UK Biobank. A specific multiple-state model representing heart attack will be introduced in Section 2.1. The transition intensities underlying the model will be developed in Sections 2.2–2.4. In Section 3.1, we will hypothecate a simple 2 × 2 gene-environment interaction model affecting the risk of heart attack. In Section 3.2, we present (in summary form) a set of simulated UK Biobank data, namely 500,000 life histories. Then, we analyse these simulated data, in two stages as described above. First, a model epidemiologist will carry out a case-control study (actually, we will look at several different case-control studies that may be carried out). This is presented in Section 3.3. Then, our model actuary will use these ‘published’ figures to construct CI insurance models allowing for genetic variability and environmental exposures. The actuarial investigation is discussed in Section 3.4. In Section 3.5, premium rates based on these critical illness models will be calculated and compared for these different subgroups. Despite its great size, UK Biobank is essentially an unrepeatable single sample. Any estimated quantity based upon its data is subject to the usual statistical sampling error — and a premium rate is just such an estimated quantity. It is to be hoped that the large samples available from UK Biobank will reduce sampling error 5 to a low level. In reality, the designers of UK Biobank can only estimate the statistical power of representative case-control studies, which was certainly done before choosing 500,000 lives as the sample size. We, however, with control over our simulated data, can assess directly the sampling properties of estimates based on UK Biobank data. In particular, and of direct relevance to the criteria established in the UK by the Genetics and Insurance Committee (GAIC) we can assess in statistical terms the reliability of CI premium rates based on UK Biobank data. We do this simply by repeating the simulation of 500,000 life histories as many times as necessary, and constructing the empirical distributions of derived quantities such as odds ratios and premiums. This is in Section 4. The second half of the thesis addresses the issues relating to adverse selection in the context of multifactorial disorders. Insurance companies have developed sophisticated underwriting techniques to cope with the problems of adverse selection. The principle behind underwriting is to identify key risk factors that stratify applicants into reasonably homogeneous groups, for each of which the appropriate premium rate can be charged. The risk of death or ill health is affected by, among other things, age, gender, lifestyle and genotype. However, the use of certain risk factors is sometimes controversial. In particular, this is true of factors over which individuals have no control, such as genotype. As a result, in many countries a ban has been imposed, or moratorium agreed, limiting the use of genetic information. In the UK, GAIC is providing guidance to insurers on the acceptable use of genetic test results. As discussed earlier, disorders caused by mutations in single genes, which may be severe and of late onset, but are rare, have been quite extensively studied in the insurance literature. One reason is that the epidemiology of these disorders is relatively advanced, because biological cause and effect could be traced relatively easily. The conclusion has been that single-gene disorders, because of their rarity, do not expose insurers to serious adverse selection in large enough markets. However, this conclusion need not be valid for multifactorial disorders. The vast majority of the genetic contribution to human disease, will arise from combinations of gene varieties (called ‘alleles’) and environmental factors, each of which might be quite common, and each alone of small influence but together exerting a measurable effect 6 on the molecular mechanism of a disease. Although the epidemiology of multifactorial disorders is not very advanced, this should make progress in the next 5–10 years through the very large prospective studies now beginning in several countries, like the UK Biobank project. If these studies are successful in capturing both genetic and environmental variations and interactions, and relate them to the risks of common diseases, the genetics and insurance debate will, in the fairly near future, shift from single-gene to multifactorial disorders. Any model used to study adverse selection risk must incorporate the behaviour of the market participants. Most of those applied to single-gene disorders in the past did so in a very simple and exaggerated way, assuming that the risk implied by an adverse genetic test result was so great that its recipient would quickly buy life or health insurance with very high probability. These assumptions were not based on any quantified economic rationale, but since they led to minimal changes in the price of insurance this probably did not matter. The same is not true if we try to model multifactorial disorders. Then ‘adverse’ genotypes may imply relatively modest excess risk but may be reasonably common, so the decision to buy insurance is more central to the outcome. Most research on adverse selection concentrates primarily on providing a proper economic rationale for the impact, on the insurance market, of genetic tests for, mainly, rare diseases. In this thesis, we try to bring together plausible quantitative models for the epidemiology and the economic issues, in respect of more common disorders, therefore affecting a much larger proportion of the insurer’s customer base. We wish to find out under what circumstances adverse selection is likely to occur. The plan of the second half of the thesis is as follows. In Sections 5.1–5.3, we provide background information on risk, insurance and underwriting. In Section 5.4, we review existing literature. Adverse selection in the context of multifactorial disorder is defined in Section 5.5. A basic introduction to utility theory and estimates of risk-aversion is discussed in Sections 5.6–5.9. In Chapter 6, we develop techniques to determine the conditions leading to adverse selection for a 2×2 gene-environment interaction in a simple 2-state insur7 ance model. We study the impact of additive and multiplicative impacts of geneenvironment interactions in Sections 6.4 and 6.6 respectively. The rôle played by population proportion in each risk category is studied in Section 6.5. In Chapter 7, we extend the results from the 2-state model to a CI insurance model. We propose a simple model of a multifactorial disorder, with two genotypes and two levels of environmental exposure, and either additive or multiplicative interactions between them. These factors affect the risk of myocardial infarction (heart attack), therefore the theoretical price of CI insurance. The situation here is slightly different from the 2-state insurance model, in that there are risks, other than heart attack, which affect CI insurance. Conclusions and suggestions for further work are in Chapter 8. We have also provided two appendices at the end for background information. Appendix A gives a brief overview of epidemiology and Appendix B provides introductions to some relevant numerical methods. 8 Chapter 1 Genetics and Insurance 1.1 Introduction With the discovery of genes, we are closer than ever before to a clearer understanding of our biological roots; our place in the history of evolution. As it turns out, the essence of life is embedded in the genes. Genes contain all the information necessary to create a life form out of a single cell. They are the units of heredity passed down from one generation to the next. They shape our physical characteristics and behavioural patterns. In short, genes are key to life, the reasons for our existence. But they cannot work alone, environment plays an active role too. It is increasingly becoming obvious that it is the interplay between genes and the environment that shapes what we are. Genes thrive on diversity. We, human beings, are all distinct from one another and not just mere clones, thus proving the existence of wide variations even within a single species. But diversity also brings with it its own complications. For example, although all variations of the same gene are supposed to perform the same function, they all do it in slightly different ways. Inevitably, this leads to differences in their performance. In particular, underperformance can produce unwanted side-effects in the form of genetic disorders. This has practical implications in all spheres of human life. Here we are interested in the impact of genetic disorders on the insurance industry. Insurance in its basic form is a simple principle of cooperation, where each individual in a group 9 contributes a small amount towards a common fund, which can be used to support the few who suffer losses; a small price to pay for guaranteed support in times of misfortune. However, even in this basic set-up, it is clear that insurance cannot be provided to all at the same price. There will always be heterogeneity in risk profiles, where a few individuals will be exposed to a greater risk of loss compared to the rest. Then it will be unfair on the low-risk individuals to ask them to subsidise the high-risk group. So, charging risk-based insurance premiums seems a sound alternative. The reasoning appears perfectly logical when smokers are charged a higher life insurance premium. It can be argued that individuals who smoke choose to do so of their own free will, fully understanding the related health hazards. However, the same logic becomes untenable when applied to an individual who has inherited a “faulty” gene from his parents. It might be obvious that he faces a greater risk, but is it fair to penalise him for his own misfortune? The answer is far from being straight-forward. Apart from ethical and moral issues, there are economic and political angles to it as well. Governments and insurance regulators might find it difficult to let market economics takes its own course. But if they intervened, the outcome may not be entirely beneficial, as it will ultimately be the general public who will pay for any market inefficiency. Given this backdrop, our aim in this thesis will be to analyse the impact of geneenvironment interactions on insurance from the perspective of both insurers and consumers. We ask, for what types of gene-environment interactions: (a) Can an insurer justify charging different premiums for different groups? (b) Does an insurer face the risk of adverse selection? But, before tackling these questions, we will provide some background information in the remainder of this chapter. In Section 1.2, we provide an overview of genetics. In Section 1.3, we provide examples of a few well-known genetic disorders. In Section 1.4, we will give a brief history of how regulations on genetics and insurance have been shaped in the UK. A brief description of UK Biobank project is in Section 1.5. An outline of our proposed model of UK Biobank, to analyse the results that might come out of the project, is given in Section 1.6. 10 1.2 Genes “ ..., as the earth and ocean were probably peopled with vegetable productions long before the existence of animals; and many families of these animals long before other families of them, shall we conjecture that one and the same kind of living filaments is and has been the cause of all organic life?” This is a bold conjecture by Erasmus Darwin, in his book Zoönomia (Darwin (1794)), more than 60 years before his famous grandson Charles Darwin produced his epic, On the Origin of Species (Darwin (1859)). The conjecture by Erasmus Darwin has turned out to be fantastically close to what is reality. But to arrive there we have to start with the theory proposed by his grandson Charles. Charles Darwin coined the term “natural selection” which he used to mean that each individual has to struggle to survive where resources are limited. Individuals with the “best” characteristics will be more likely to survive and those desirable traits will be passed down through generations and will eventually be dominant in the population over time. When Charles Darwin proposed his theory of natural selection, it was at odds with the existing model of blending inheritance, which predicted that an offspring is an average of its parents. This would mean that an offspring of a tall parent and a short parent will be of medium height, who will then pass on the trait of medium height to the next generation and so on. So the tall and short traits will be lost in future generations, and this contradicted the theory of natural selection which required accumulation of desirable traits. At around the same time, Gregor Johann Mendel was conducting his revolutionary experiments on pea plants. He noticed that if he crossed two pure contrasting traits, the next generation hybrids showed only one trait, the dominant one. And if he crossed only hybrids, the recessive trait re-appeared in 25% of cases. Mendel realised that offsprings inherit a pair of traits, one from each parent, of which the dominant trait is expressed. This was a profound observation, which Mendel published in Mendel (1866). Unfortunately, his work remained largely unnoticed for more than three decades before it was re-discovered in 1900. Following re-discovery, Mendel’s laws of inheritance and Darwin’s natural se11 lection were hotly debated among the scientific community. While Darwinism demanded variety, Mendelism offered stability instead. The marriage between the two theories happened only when Joseph Muller discovered mutation by subjecting fruit-flies to X-rays. Once the conflict was resolved, scientists started wondering how inherited traits are passed between generations. The breakthrough was finally made by Watson and Crick (1953), who discovered the molecular structure of nucleic acids and unravelled the rôle of deoxyribonucleic acid (DNA) in heredity. In the rest of this section, we will provide a very brief introduction to molecular genetics. This is not meant to be a comprehensive review of the subject but only an overview of the fundamental concepts. For detailed discussions, Lewin (2000), Pasternak (1999), Strachan and Read (1999) and Sudbery (1998) are standard textbooks on human genetics. For a popular exposition, please refer to Ridley (1999). Unless specific references are provided, all material in this section, and the next, are obtained from the above-mentioned sources. All living creatures are made up of cells. The cell is the structural and functional unit of all living beings and is sometimes called the building block of life. Some organisms, like bacteria, are unicellular, while other complex life-forms, such as human beings, are multicellular. A human body has an estimated 100 trillion cells. Cells are made up of a number of subcellular components. Except red blood cells, all cells in a human body contain a membrane-enclosed organelle called the nucleus. Other subcellular components, like ribosomes, remain suspended outside the nucleus in a jelly-like material called cytoplasm. Leaving out the red blood cells along with the egg and sperm cells, each human cell nucleus contains 23 pairs of filaments called chromosomes. As mentioned above red blood cells do not have nuclei. The egg cells and the sperm cells contain only one of each pair of chromosomes, i.e., they have 23 chromosomes instead of 23 pairs. An offspring is produced by fertilisation of an egg cell by a sperm cell, whereby all chromosomes become paired again. Inside a chromosome there exists a paired molecule called DNA, with two long strands of sugar and phosphate running parallel to each other. Embedded on each strand is a sequence of nucleotides or bases, which come in four varieties – adenine 12 (A), cytosine (C), guanine (G) and thymine (T). The two strands of a DNA molecule are structured in such a way that if nucleotide A is positioned in a particular location of a strand, the opposite strand will have nucleotide T at the same location. Similarly for C and G. Now, using the property that, A has great affinity for T while C likes to pair with G, the nucleotides on opposite strands form bonds between them and are called base-pairs. This produces the well-known structure of a double helix, where the two strands of DNA stay intertwined with each other. Note that, as the sequence of nucleotides in one strand is a complementary copy of the other, the whole double-stranded sequence is described by the sequence of only one, chosen by convention. The sequence of nucleotides in DNA contains vital information on how to synthesise different types of proteins necessary for the existence of living creatures. Almost everything in a human body is made of protein or made by them. So an efficient mechanism for protein synthesis is critical for survival. On one of the two strands of a DNA molecule, each sequence of three consecutive nucleotides, e.g. ACT, CAG, TTT, is called a codon. Except for a few codons (which are used as stop signals), all codons correspond to particular amino acids, which are the building blocks of any protein. There are 64 possible codons, whereas there are only 20 amino acids. So there are multiple codons which refer to the same amino acid. There are large stretches of DNA which do not contain any useful information; only a small fraction of the complete DNA sequence appears to encode proteins. A gene is a region of DNA that contains the code for synthesising a particular protein. Even within a gene there are sections of meaningless information called introns, in between sections of actual code, called exons. When a cell needs to manufacture a particular protein, appropriate signals are generated to identify the gene containing the recipe for the protein in question. Then a complementary copy of that section of DNA is made to form a new single stranded molecule called messenger ribonucleic acid (mRNA). This process is called transcription. mRNA is very similar to a single strand of a DNA, except that the nucleotide T in DNA is replaced by the nucleotide uracil (U) is mRNA. After transcription, mRNA is stripped of its introns and the exons are spliced together to form 13 a seamless code. The edited mRNA then moves out of the nucleus and approaches a ribosome. Ribosomes translate the information contained in the mRNA into a sequence of amino acids, which then folds up into a distinctive shape (depending on the sequence) to form a protein. This is how a cell uses the code in DNA to manufacture a protein it needs. DNA can also replicate to produce two identical copies of itself. The technique is similar to the one used for mRNA transcription. However, instead of working only on a section of a particular strand, replication works on both strands of DNA simultaneously. At first, the bonds between the base-pairs are broken to separate the complimentary strands. Simultaneously, two new strands are constructed with appropriate nucleotides to form two identical double-stranded DNA. This technique is used to pass on genetic information from cell to cell (mitosis) and from generation to generation (meiosis). The discussion above depicts an idealised scenario. In reality, there are a number of places where things can go wrong. For example, in the replication stage, one nucleotide might get replaced by another by mistake. This can be critical if this happens in the coding region of DNA. Unless the changed codon corresponds to the same amino acid, the gene will not be able to synthesise the correct protein. This can be disastrous depending on the function of the protein. Similar problems will arise if one or more nucleotides are deleted from or inserted in the DNA sequence. Any change to a DNA sequence is termed mutation. Although the consequences can be catastrophic, not all mutations are deleterious. In fact, multiple variations of the same gene is quite common. These are called alleles. The variations between alleles explain simple differences, like hair colours. However, for a particular gene, one allele might produce a slightly different version of a protein from the other alleles. This might turn out to be slightly better or worse at performing a specific function. One might ask: why aren’t inefficient alleles getting purged by natural selection? One answer might be that these alleles might be better at doing other things. We have to wait until we fully understand the implications of all interactions between different genes and the environment to truly appreciate all the nuances of human genetics. 14 Let us now look at a few well-known genetic disorders. 1.3 1.3.1 Genetic Disorders and Insurance Huntington’s Disease Huntington’s disease (HD) or Huntington’s chorea is a rare neurological disorder. It got its name from physician George Huntington who studied the disorder in detail in his paper in 1872. HD can strike at an age less than twenty and the early symptoms include a slight deterioration of the intellectual faculties. Gradually, physical symptoms appear in the form of jerky, uncontrollable, random movements, collectively known as chorea. Patients also exhibit slowing of thought process, speech impairment and inability to learn new skills. They descend into deep depression, with occasional hallucinations and delusions. The disorder has been traced to a particular gene in chromosome 4. As is the case for many genes, this gene also has a large number of alleles. The alleles differ from each other in the number of occurrences of a single codon CAG in the middle of the gene. The number of CAG repeats can vary from six to over a hundred depending on the allele. Individuals with 35 or fewer CAG repeats are safe from HD. For genes with more than 35 copies of CAG, the DNA replication process becomes unstable and the number of repeats can increase in successive generations. Because of the progressive increase in repeat lengths, the disorder tends to increase in severity as it passes from one generation to the next, and to trigger earlier onsets. Also, the disorder is a dominant trait, so even a single affected allele from a parent is enough to trigger HD. For individuals with 39 CAG repeats, there is a 90% probability of first symptoms appearing before age 75. However with 50 CAG repeats, onset of HD, on average, is at age 27. The disorder is incurable and takes 15-25 years to run its full course. The codon CAG corresponds to the amino acid glutamine. It is a necessary ingredient for the production of a protein called huntingtin. However more than 39 CAG repeats produce a mutated form of the protein, which gradually accumulate in neurone cells. This continuous aggregation causes the cells to die off in selected 15 regions of the brain and trigger HD. Even before the actual discovery of the gene responsible for HD, it was obvious that the disorder was hereditary in nature. Insurance companies offering health insurance, like CI insurance, used family histories as an underwriting tool to protect themselves from adverse selection. With the better understanding of the genetics behind HD, insurance companies will be interested to find out if their underwriting techniques could be improved further. This has been studied in detail in Gutiérrez and Macdonald (2004). The authors first estimated the age-dependant rates of onset of HD for males and females with different CAG repeats. They had to take into account the severity of the symptoms that would lead to a successful CI insurance claim. Then the authors calculated the net level CI premium rates for both sexes with 36-50 CAG repeats. They found that insurance companies, following standard underwriting guidelines, will be unable to insure individuals with very long CAG repeats. This is particularly true for younger individuals and longer policy durations. For comparison purposes, the authors have also calculated premiums based on family history alone. The authors then investigated the cost of adverse selection in case of a moratorium on the use of genetic test results and also possibly family history. They found that moratoria on genetic test results can lead to an increase of premiums of about 0.1%, while including family history in the moratoria will increase premiums by 0.35%. The whole exercise was repeated for a life insurance model. Although the results show a discernible increase in the risk of mortality with increase in CAG repeats, the impact is less severe than that in the context of CI insurance. The cost of adverse selection arising from a moratorium on the use of genetic tests for HD was found to be negligible for life insurance. 1.3.2 Alzheimer’s Disease Alzheimer’s Disease (AD) got its name from a German psychiatrist Dr Alois Alzheimer. In 1901, he interviewed a patient, Mrs Auguste D, who showed signs of dementia, a medical term for progressive decline in cognitive functions affecting memory, language and problem solving. The patient died in 1906, and Dr Alzheimer 16 along with his colleagues examined her anatomy and neuropathology. He found deposits of plaques on the outside of the neurones and severance of the connections between the neurones. These have been identified as classical pathological signs of AD. AD is a disorder of old age, rarely affecting people less than 60 years old. The early symptoms include short-term memory loss with a tendency to become less energetic or spontaneous. With the progression of the disease, patients start forgetting well-known skills or objects or persons. At a later stage, the patients find it difficult to perform the simplest of tasks and require constant supervision. The Apolipoprotein E (ApoE) gene on chromosome 19 has been identified as a risk factor for development of AD. The ApoE gene has three alleles ²2, ²3 and ²4, found in the general population in the proportions 0.09, 0.77 and 0.14 respectively. Individuals with ²4 allele in their gene have a greater chance of developing AD; more so if they have two ²4 alleles. In contrast, ²2 allele appears to have a protective effect against AD. The difference between the alleles is that at two locations, two A nucleotides in ²4 are replaced by two Cs in ²2. ²3 is intermediate. As these alleles produce slightly different proteins, the protein derived from ²4 allele appear to aid in the formation of plaques in the neurones. Although the actual biochemical process is not well understood, there is significant statistical evidence of a correlation. Patients with AD can survive up to 15 years after the first symptoms are noticed. This is of significant importance to the long-term care insurance market. Macdonald and Pritchard (2000) and Macdonald and Pritchard (2001) are studies on the impact of AD on long-term care insurance. Macdonald and Pritchard (2000) proposed a multiple-state model for AD and went on to estimate the transition intensities for different possible genotypes of ApoE. Macdonald and Pritchard (2001) applied the model to calculate long-term care insurance premiums. The authors found that insurers, if allowed to use ApoE test results, would probably charge ratings of +25% and +50% for individuals with one and two ²4 alleles in their genes, respectively. The authors also estimated the cost of adverse selection if a moratorium is in place on using genetic test results. 17 They found that the cost will not exceed 5% of premiums and can probably be ignored. 1.3.3 Cancer The two genetic disorders, discussed so far – HD and AD, are commonly known as single-gene disorders. For each, there is a strong link between the disorder and mutations in a particular gene. However, it is important to remember that with advances in genetical research, it is quite possible that links with other genes and environmental factors will come to light in future. HD and AD are unusual in a sense that most common disorders are much more complex in nature and arise out of interactions between a number of genes along with environmental factors. Cancer is one such common multifactorial genetic disorder. As we saw in the discussion of genes, all cells contain the necessary information to replicate themselves. However, unorderly cell replication can lead to cell proliferation and ultimate production of malignant tumours. There is a complex mechanism in place to protect against such an eventuality. Most notably, the tumour suppressor genes or anti-oncogens identify any irregularities and produce a dampening or repressive effect on the cell division cycle. If such repairs prove futile, the genes promote apoptosis, a kind of programmed cell death. Most tumour suppressor genes can function even with one functional allele, i.e. both alleles of these genes must be mutated before a tumour suppression fails. In this section, we will consider two such tumour suppressor genes – BRCA1 and BRCA2. The BRCA1 gene is located on chromosome 17 and codes a protein which regulates the cycle of cell division and inhibits uncontrolled growth of cells, in particular, those that line the milk ducts in the breast. A large number of alleles of the BRCA1 gene have been identified, many of which are associated with an increased risk of breast cancer. The BRCA2 gene, based on chromosome 13, has a function similar to the BRCA1 gene. Again a number of alleles of the BRCA2 gene have been linked to increased risk of breast cancer. There are also studies which have linked BRCA1 and BRCA2 genes with ovarian cancer. It is important to note here that only about 5 to 10% of breast cancers are due to mutations in BRCA1 and BRCA2 18 genes, suggesting that most cases are sporadic in nature. Macdonald et al. (2003a) studies the genetics of breast and ovarian cancer from the perspective of a life and health insurance underwriter, who can only have access to family histories (often incomplete) of prospective consumers. The authors developed a multiple-state model and estimated the transition intensities from UK population data. Using the model, they computed conditional probabilities of women being BRCA1 and BRCA2 mutation carriers (individuals with alleles which possess greater risk of breast and ovarian cancer) given the family history. The authors found that these probabilities are very sensitive to the estimates of mutation frequencies and penetrances. They concluded that it may not be appropriate to apply risk estimates based on studies of high risk families to other groups. Macdonald et al. (2003b) applied the model to CI insurance. The authors found that if insurance underwriters had access to genetic test results, most BRCA1 and BRCA2 mutation carriers will be uninsurable. On the other hand, if underwriting is based on family history alone, only a few cases will exceed the usual underwriting limits. If insurers were unable to use genetic test results or family history information for underwriting, adverse selection was found to be significant in a small CI insurance market, in case of high penetrances or if higher sums assured could be obtained. 1.3.4 Cardiovascular disease In breast and ovarian cancer, the two genes involved accounted for a small number of cases. Cancer can also be caused by mutations due to environmental factors, like exposure to harmful radiations. So, we have gradually shifted our focus from simple, but rare, single-gene disorders to complex, but relatively common, multifactorial disorders. In this section, we will discuss one more common disorder — cardiovascular disease. Cardiovascular disease is a class of disease that involves the heart and blood vessels. In one common form, fatty deposits (plaques) in the blood vessels make them narrow and restrict blood flow. The plaques can sometime rupture forming blood clots that obstruct the artery and stop blood flow to the heart muscles. This is commonly known as myocardial infarction or heart attack. A number of risk 19 factors have been identified for cardiovascular disease. Large-scale studies have found evidence that tobacco smoking can significantly increase an individual’s risk of heart attack. For an example of such a study, see Woodward (1999). Among other risk factors, hypercholesterolemia, or elevated cholesterol levels in the blood stream has been directly linked to heart attacks. To understand hypercholesterolemia, we have to return to the ApoE gene on chromosome 19. The function of the protein coded by the gene is to facilitate transfer of fat and cholesterol from very low density lipoprotein (VLDL), which carries fat and cholesterol from the liver to the cells that need them. If there is a malfunction, much fat and cholesterol remains in the blood stream and form plaques on the walls of arteries, which can ultimately lead to heart attacks. The efficiency with which the ApoE gene carries out its function depends on its alleles. It has been found that individuals with two ²4 alleles or two ²2 alleles are at a heightened risk of cardiovascular disease compared to those who have at least one copy of the ²3 allele. Of course, a low cholesterol diet can reduce the risk considerably. So again we can see that external intervention plays an important rôle on the efficient functioning of genes. Clearly, cardiovascular disease is a multifactorial disorder. The risk not only depends on genetic factors (alleles of ApoE gene), but also environmental interactions (smoking habits, dietary control etc.). We will study heart attack in much greater detail in later chapters. In Chapter 2, we will develop a multiple-state model for heart attack and estimate the transition intensities. In Chapter 3, we will show how we can hypothecate a 2×2 gene-environment interaction based on this model. All our subsequent analysis will be based on that model. 1.4 Genetics and Insurance Regulations Insurance companies set premium rates based on the assumption that they have access to all information relevant to the risk involved. If consumers can withhold any information from an insurance company, there is a risk that the company will face adverse selection. This is the basic principle behind underwriting insurance 20 risks. However, this is not the only consideration behind underwriting classifications. There might be competitive pressure to charge different premium levels to different groups of consumers. One such example is charging higher life insurance premiums to smokers. It is unlikely that smokers, while purchasing insurance, will take into account the adverse health effects of smoking and over-insure themselves to select against an insurer. But once an insurer decides to charge differential premiums, other insurers will have to follow suit, as then charging the average premium will expose them to attracting only high risk consumers. For an in-depth discussion on this topic, please refer to Macdonald (2004). However, underwriting based on genetic test results is very different from the smoker/non-smoker example. One’s own genes are a very private matter and discrimination based on such information has both moral and social implications. At the same time, the possibility of adverse selection cannot be ruled out altogether. Given this dilemma, governments and insurance regulators in different countries have adopted different approaches to deal with the issue. Sweden, for example, does not allow the use of genetic test results or family histories for underwriting. Developments in the UK have been particularly interesting, as the scientific basis for underwriting has come under fierce scrutiny. We will briefly recount the main milestones in this section. For a more detailed discussion, please refer to Macdonald (2003). In 1997, the Human Genetics Advisory Committee (HGAC) asked the UK Government to impose a moratorium on the use of all genetic test results for insurance underwriting purposes. The Government, instead, set up the Genetics and Insurance Committee (GAIC), in 1999, to scrutinise the use of genetic tests in underwriting on a case-by-case basis. In 2000, GAIC approved the use of genetic test results for HD for life insurance contracts over £500,000. GAIC made it clear that insurance companies could not ask individuals to undergo genetic tests for HD. Only if individuals have already been tested, can insurance companies ask for access to that information. GAIC noted that it would actually enhance the access to insurance for individuals with normal test results, but with family history of HD. In the meantime, HGAC and other advisory bodies were merged to form the 21 Human Genetics Commission (HGC), which was particularly critical of the rôle of GAIC. The Association of British Insurers (ABI), who were representing the majority of UK insurers, also came in for some criticism. ABI advised its member insurers that they could continue to use genetic test results unless their use had been rejected by GAIC. Few agreed with this interpretation. The ABI then agreed to restrict the use of test results to those that GAIC had approved. In 2001, after more intense debate on the topic, the ABI withdrew all the applications it had made to GAIC (in respect of HD, breast and ovarian cancer and AD) and agreed on a five year moratorium on the use of genetic test results. Under the terms of the moratorium, customers will not be required to disclose the results of predictive genetic tests for policies up to £500,000 for life insurance, £300,000 for health insurance and paying annual benefits of £30,000 for income protection insurance. In 2005, the original moratorium was extended for five more years and will be valid until 1 November, 2011. The current Concordant and Moratorium on Genetics and Insurance which came into effect from 14 March, 2005, mentions that GAIC will continue to liaise with the clinical genetics community, patient groups and experts in insurance and actuarial science and monitor new developments relevant to genetics and insurance. In the meantime, the UK Biobank project has been launched to analyse the impact of geneenvironment interaction on common multifactorial disorders. With rapid advances in genetics, aided by such large-scale population studies, it is likely that new facts and evidence will come to light with regularity. In particular, it is important to analyse the kind of results that might come out of UK Biobank and its implications for the insurance industry. 1.5 The UK Biobank Project The website http://www.ukbiobank.ac.uk/ is the main source of information on UK Biobank. In particular, it provides a draft protocol. There (Section 1.2) it is stated that: “The main aim of the study is to collect data to enable the investigation of the 22 separate and combined effects of genetic and environmental factors (including lifestyle, physiological and environmental exposures) on the risk of common multifactorial disorders of adult life.” UK Biobank is a cohort study, meaning that a large number of people will be recruited, as randomly as possible, and then followed over time. The main features of the study design are as follows: (a) The cohort will consist of at least 500,000 men and women recruited from the UK general population. (b) The chosen age range is 40 to 69 (note that earlier versions, including the draft protocol referred to above, proposed an age range 45 to 69). (c) The initial follow-up period is 10 years. (d) Participants will be recruited through their local general practitioners. Participants are expected to come from a broad range of socio-economic backgrounds and regions throughout the UK, with a wide range of exposures to factors of interest. (e) The project will be conducted through the UK National Health Service. (f) UK Biobank is funded by the Department of Health, the Medical Research Council, the Scottish Executive and The Wellcome Trust. The cost is approximately £40 million. People registered with participating general practices will be requested to join the study by completing a self-administered questionnaire, attending an interview, undergoing examination by a research nurse and giving a blood sample, to enable DNA extraction at a later date. The protocol assumes that DNA extraction would be deferred and done as and when genotyping is required. The Office of National Statistics will provide routine follow-up data regarding cause-specific mortality and cancer incidence. Hospitalisation and general practice records will provide data regarding incident morbidity. Every two years a subset of 2,000 participants and every five years the entire cohort will be re-surveyed by postal questionnaire to update exposure data and to ascertain self-reported incident morbidity. 23 It is envisaged that the main study design for assessing the combined effect of environment and genotype will consist of a series of case-control studies (see Appendix A) nested within the cohort. Options for the selection of controls include an individually matched design or a panel of controls selected at random from the cohort, probably weighted by age and sex. An important principle underlying the design of the study and the statistical methods that will be applied is to minimise the assumptions made about the underlying nature of the relationship between genetic and environmental factors and the risk of disease. As a comprehensive prospective study with biological samples, UK Biobank is expected to contribute substantially to international knowledge regarding the combined effects of genotype and exposure on the risk of disease. Its design means that the study will provide a structure and resources for future research, and will enable researchers to address current and unforeseen scientific questions. While UK Biobank will collect and store the data, any analysis of the data in the future will require further funding. 1.6 A UK Biobank Simulation Model In this section we will outline how we plan to to simulate the UK Biobank project. We suppose that the study population is subdivided (or stratified) into subgroups with respect to: (a) different genotypes; (b) different levels of environmental exposures; and (c) other relevant factors such as sex. Genotype defines discrete categories, and we suppose that any environmental exposures or other factors defined on a continuous scale are grouped into discrete categories. Thus, we always have a small number of discrete subgroups (or strata). The life history of each participant will be represented by a multiple-state model, with states and transitions defining onset and possibly progression of the disease of interest. Some of the model parameters, namely the transition intensities, will be different in different strata — most obviously those associated with the disease of interest. These intensities are the key to the whole UK Biobank project, as well as our study. 24 (a) The real-life epidemiologist wants to estimate them (or in practice, odds ratios) from UK Biobank data, given a hypothesis about the effect of measured exposures on the disease. (b) The real-life actuary wants to take the estimated intensities (or in practice, approximate them from published odds ratios) and use them in pricing and reserving. (c) We wish to specify hypothetical but plausible dependencies of these intensities, on genotype and other exposures, so that we can observe our model epidemiologist and model actuary at work. The steps in simulating UK Biobank are then as follows. (a) We choose the number of genotypes and the number of levels of environmental exposure, and also the frequencies with which each appears in the population. Thus we can model simple or complex genotypes and exposures, and allow them to be more or less common or rare. These define the subgroups or strata. The simplest example (used in the UK Biobank draft protocol) is to have two genotypes and two levels of environmental exposure. We also choose the intensities of onset of heart attack in each stratum to reflect the strength of the association between stratum and the risk of heart attack. (b) We randomly ‘create’ 500,000 individuals, each equally likely to be male or female, and with ages uniformly distributed in the range 40 to 69, and allocated to strata at random according to the chosen frequencies. (c) The life history of each individual is modelled by simulating the times of any transitions between states in the model, as governed by the intensities. We record the times of any transitions taking place within the 10-year follow-up period of UK Biobank. We implicitly assume that the 500,000 participants are independent in the statistical sense, which is unlikely to be true. The sample is so large that some related individuals are likely to be recruited by chance, but also the method of recruitment (through general practices) guarantees some level of familial and geographical clustering. 25 26 Chapter 2 A Model for Heart Attack 2.1 Specification of the Model Heart attack, cancer and stroke are the three major illnesses generally covered under a critical illness (CI) insurance contract. Other minor CIs, sometimes included in the list, are: (a) coronary artery bypass, (b) major organ transplant, (c) chronic kidney failure, (d) multiple sclerosis, and (e) total permanent disability. Our main focus in this thesis will be on heart attacks. The objective is to build a simple but comprehensive model for heart attacks, which can then be used to represent hypothetical, but plausible, multifactorial gene-environment interactions. We can then subsequently analyse the impact of multifactorial disorders on CI insurance. Hazards of heart attacks have been widely studied by a number of research programmes. The interested parties include clinical researchers, pharmaceutical industries, epidemiologists and also actuaries. However, as remits of these papers are very different, it is difficult to develop a complete model of heart attacks from any one of these reports. For example, Gutiérrez and Macdonald (2003) gives transition intensities or hazard rates of an individual suffering a heart attack. The authors were investigating CI insurance and the subject of interest was the incidence of dif27 λ12 (x) 1 = Healthy λ13 (x) - 2 = Heart Attack λ24 (x, t) ? ? 3 = Dead 4 = Dead Figure 2.1: A 4-state heart attack model. ferent CIs. So, naturally, their analysis did not include post heart attack survivals. On the other hand Capewell et al. (2000) investigates only survival after a heart attack. In this chapter, our aim is to bring together all these results and develop a multiple-state model, which will enable us to track individuals from their birth to any incidence of heart attack and follow them up until they die. We propose a simple 4-state heart attack model given in Figure 2.1. All individuals are assumed to start in State 1, the Healthy state. From there, they may have a heart attack and move to State 2, or die and move to State 3. As our ultimate goal is to apply the model for CI insurance, we are interested in first heart attacks only, because this will trigger a claim under a CI policy, so any subsequent heart attacks are ignored. The only possible transition from the Heart Attack state is death. It is convenient to distinguish deaths occurring after a heart attack, so States 3 and 4 are separate. A basic introduction to multiple state models and transition intensities is given in Appendix A. Please refer to Woodward (1999) and Breslow and Day (1980) for a detailed discussion. 2.2 The Heart Attack Transition Intensity Once we have formed the structure of the model, we now move on to parameterise the transition intensities. First we specify the heart attack transition intensity in the general population, denoted λ12 (x), separately for males and females. Gutiérrez and Macdonald (2003) fitted parametric functions to the transition intensities of all 28 Transition Intensity 0.02 Male Female 0.015 0.01 0.005 0 0 10 20 30 40 50 Age (years) 60 70 80 Figure 2.2: The transition intensity of all first heart attacks, by gender. major critical illnesses, including heart attacks. The authors used numbers of firstever cases of heart attacks between September 1991 and August 1992, taken from McCormick et al. (1995). The exact exposed to risk is calculated and a parametric function is fitted to it. For males, it is given by: exp(−13.2238 + 0.152568x) if x ≤ 44 x − 44 49 − x λ12 (x) = × λ12 (49) + × λ12 (44) if 44 < x < 49 49 − 44 49 − 44 − 0.01245109 + 0.000315605x if x ≥ 49 (2.1) and for females, it is given by: λ12 (x) = 0.598694 × 0.1531715.6412 exp(−0.15317x)x14.6412 . Γ(15.6412) (2.2) These intensities are shown in Figure 2.2. 2.3 Mortality After First Heart Attacks We will now focus on what happens after an individual has experienced his or her first heart attack. Figure 2.3 shows the part of the full 4-state model (Figure 2.1), 29 2 = Heart Attack λ24 (x, t)- 4 = Dead Figure 2.3: Subset of the model in Figure 2.1 to study survival after heart attacks. we are interested in. Here, the individuals who have had their first heart attack, start off from State 2. We then observe these individuals until their death, at which point they move on to State 4. We are interested in the transition intensity from State 2 to State 4. In Section 2.3.1, we will review a number of published articles on survival after first heart attacks. In Section 2.3.2, we will identify a study which we believe to be the most appropriate for our model. In Section 2.3.3, we will propose a parametric form for λ24 (x, t). And finally in Section 2.3.4, we will provide a discussion on the fitted model and validate our model against other relevant data available in the scientific literature. 2.3.1 Literature Review There are a number of articles in published journals which study prognosis following heart attacks. The articles vary widely in their scope and focus. There are articles like Tunstall-Pedoe et al. (1999), which is an outstanding example of a populationbased study, but concentrates only on short-term survival after heart attacks. As our interest lies in both short and long-term prognosis, we will review articles which observe the study subjects over longer periods of time. Capewell et al. (2000) describes a retrospective cohort study involving 117,718 patients admitted to hospital with heart attacks in Scotland between 1986 and 1995. This is one of the largest population-based investigations which deals with both short and long-term prognosis following a first heart attack. The study classifies individuals according to: (a) age groups <55, 55–64, 65–74, 75–84, ≥85, (b) gender, (c) deprivation categories and 30 (d) co-morbidity. Case-fatality rates are aggregated for each of these groups. So it is not possible to model the transition intensity in terms of all these variables. However, we are only interested in modelling post heart attack mortality in terms of age and gender. The case fatality rates appear to be higher for women. The authors have confirmed that this apparent high case fatality rate is due to the fact that the average age of women in the study was significantly higher than that of men. From the published case fatality rates based on age-groups, it is clear that the rates depend on: (a) the age at first heart attack; and (b) the duration of survival after suffering first heart attack. So we will model λ12 (x, t) as a function of x and t, where x is the age at first heart attack and t is the survival duration post-first heart attack. Goldberg et al. (1998) conducted a similar population-based investigation on patients admitted in all acute care hospitals in the Worcester, Massachusetts metropolitan area (1990 census population of 437,000) between 1975 and 1995. A total of 8,070 patients were studied in the investigation. The study classified individuals according to the study periods 1975–78, 1981– 84, 1986–88, 1990–91 and 1993–95, and uses the same age-groups as Capewell et al. (2000). The results published include: (a) odds of dying during hospitalisation, and after 1 year and 2 years following hospital discharge for all age-groups as compared to patients < 55 years; (b) trends in the odds of dying during hospitalisation, and after 1 year and 2 years following hospital discharge for each age-group; (c) 1-year and 2-year death rates of hospital survivors by age-group; and (d) a graph of long-term survival rates among hospital survivors by age-group. Brønnum-Hansen et al. (2001) studied patients registered during 1982–91 in 11 municipalities in the western part of Copenhagen County, Denmark. During the study period, the average size of the population was 202,000 and a total of 3,926 first heart attacks were registered. The patients were classified according to gender, 31 two age-groups (30–59 and 60–74) and three study periods (1982–84, 1985–87 and 1988–91). The published figures include: (a) a table of fatal and non-fatal heart attack cases for each age-group and gender covering the full duration; (b) a table of standardised mortality ratios (quotient of observed to expected number of deaths) and excess death rates (observed minus expected number of deaths per 1,000 person-years) by age-group and gender; and (c) separate graphs of short-term (≤28 days) and long-term (28 days to 15 years) survival probabilities for men and women. The authors point out that according to their findings the age-adjusted casefatality rates after a first heart attack do not differ between the sexes. This agrees with the findings of Capewell et al. (2000). Among these articles, Capewell et al. (2000) appears most relevant on three counts. Firstly, the study population is Scottish which provides most relevant data appropriate for modelling heart attacks in the UK. Secondly, it has the largest study population providing substantial credibility to the figures published. Thirdly, the figures published in this article are presented in a suitable format and can be readily used for parameterising λ24 (x, t). Most of the data in Goldberg et al. (1998) and Brønnum-Hansen et al. (2001) are presented in the form of graphs, odds ratios, standardised mortality ratios and excess death rates. Although results in these formats are not suitable for directly parameterising transition intensities, they can still be used as an independent check of λ24 (x, t), which we will finally propose. 2.3.2 Data As mentioned in the previous section, Capewell et al. (2000) provides case-fatality rates for different age-groups for durations 30 days, 1 year, 5 and 10 years following first heart attacks. We will represent the five age-groups <55, 55–64, 65–74, 75–84, ≥85 by single representative ages, namely, 50, 60, 70, 80 and 90 respectively. For our calculations, we will transform the case-fatality rates into survival probabilities, by subtracting the case-fatality rates from 1. The survival probabilities thus calculated 32 Table 2.1: Survival probabilities after first heart attack. Survival Probability Age Range <55 55–64 65–74 75–84 ≥85 Representative Age 50 60 70 80 90 0 days 1.000 1.000 1.000 1.000 1.000 Duration after first heart attack 30 days 1 year 5 years 10 years 0.949 0.921 0.834 0.737 0.880 0.827 0.672 0.528 0.771 0.677 0.465 0.312 0.641 0.499 0.255 0.133 0.545 0.351 0.123 0.052 P22 (50, 50 + t) P22 (60, 60 + t) P22 (70, 70 + t) P22 (80, 80 + t) P22 (90, 90 + t) 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 Duration (years) after first heart attack 10 Figure 2.4: The plots of the data from Table 2.1. are given in Table 2.1. 2.3.3 Fitting a Parametric Function Based on the data, we will first parameterise the survival function following a first heart attack. Let Pij (y, z) denote the conditional probability that a person is in State j at age z given that he or she was in state i at age y. Table 2.1 gives the survival probabilities P22 (x, x + t) for specific values of x and t, where x denotes the age at first heart attack and t denotes the survival duration after the first heart attack. To get an initial idea of the shape of the functions we are dealing with, we plot 33 1 a a a a a 0.8 = = = = = 0.25 0.50 1.00 2.00 4.00 f(t) 0.6 0.4 0.2 0 0 1 2 3 4 5 t 6 7 8 9 10 Figure 2.5: The plots of f (t) = 1/(1 + ta ) against t for values of a = 0.25, 0.50, 1.00, 2.00, 4.00. the data of Table 2.1 in Figure 2.4. For ease of comparison, the data-points for each age-group are connected by straight lines. As an initial guess of a suitable functional form, consider functions of the form fa (t) = 1/(1 + ta ). Figure 2.5 shows fa (t) for a = 0.25, 0.50, 1.00, 2.00 and 4.00. Note that for all values of a, fa (0) = 1, fa (1) = 0.5 and fa (+∞) = 0. The smaller the value of a, the steeper is the initial descent of fa (t), but flatter is the descent later. A quick glance at Figure 2.4 reveals that we require P22 (x, x + t) for the older ages to have both the initial and later descents steeper than that of the younger ages. It is apparent that a better fit can be achieved by combining the properties of fa (t) for both high and low values of a. So we propose an enhanced version of fa (t) for parameterising P22 (x, x + t) as follows: P22 (x, x + t) = 1 1 + ax × tbx + cx × tdx , (2.3) where, without loss of generality, we assume 0 < bx < 1 and dx > 1. Note that ax and cx are scaling parameters. Clearly by definition, P22 (x, x) = 1. For each representative age, we have four 34 Table 2.2: Parameter estimates. Age Range <55 55–64 65–74 75–84 ≥85 Representative Age 50 60 70 80 90 a 0.0684 0.1686 0.4001 0.8564 1.5181 b 0.1040 0.0911 0.1237 0.1732 0.2431 c 0.0174 0.0406 0.0770 0.1476 0.3309 d 1.1919 1.2280 1.3370 1.5504 1.6727 data-points (Table 2.1) and four parameters (ax , bx , cx and dx ) to estimate. Solving these equations, we obtain the values of ax , bx , cx and dx , given in Table 2.2. Given the parametric form of P22 (x, x + t), the transition intensities λ24 (x, t) can be derived using: λ24 (x, t) = − d log P22 (x, x + t). dt (2.4) For the derivation of the above expressions and the underlying assumptions, please refer to Appendix A. Hence: λ24 (x, t) = ax × bx × tbx −1 + cx × dx × tdx −1 . 1 + ax × tbx + cx × tdx (2.5) Using the parameters from Table 2.2, the graphs of P22 (x, x + t) and λ24 (x, t) are provided in Figures 2.6 and 2.7, respectively. The graphs of λ24 (x, t) for x = 50, 60, 70, 80 and 90 and the transition intensity for both genders from ELT15 are given in Figure 2.8. From the graphs, we observe that both P22 (x, x + t) and λ24 (x, t) differ significantly between age-groups. To extend the definition of the transition intensity to all ages x and durations 0 ≤ t ≤ 10, we first assign λ24 (x, t) for each age-group to its representative age. Then define λ24 (x, t) = λ24 (50, t) for x < 50, λ24 (x, t) = λ24 (90, t) for x > 90, and interpolate linearly in x between the given values for 50 < x < 90. Capewell et al. (2000) do not give survival rates more than 10 years after the first heart attack. For survival rates after more than 10 years, to ensure that the force of mortality does not drop below general population mortality, we take the maximum of λ24 (x, t) defined above and the general population mortality given by ELT15. 35 Survival Probability P22 (50, 50 + t) P22 (60, 60 + t) P22 (70, 70 + t) P22 (80, 80 + t) P22 (90, 90 + t) 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 Duration (years) after first heart attack 10 Figure 2.6: The plots of survival probabilities, P22 (x, x + t), against duration after heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. Transition Intensity 10 λ24 (50, t) λ24 (60, t) λ24 (70, t) λ24 (80, t) λ24 (90, t) 1 0.1 0 1 2 3 4 5 6 7 8 Duration (years) after first heart attack 9 10 Figure 2.7: The plots of transition intensities, λ24 (x, t), against duration after heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. 36 1 Transition Intensity ELT15 Male 0.1 ELT15 Female λ24 (50, t) λ24 (60, t) λ24 (70, t) λ24 (80, t) 0.01 λ24 (90, t) 0.001 0.0001 0 10 20 30 40 50 60 Age (years) 70 80 90 100 Figure 2.8: Graphs of λ24 (x, t), assigned to representative ages for each age group, and the force of mortality of the ELT15 life tables. 2.3.4 Discussion of the Fitted Model First, let us compare the survival probabilities of the fitted model with that of the general population. The survival probabilities of men and women aged 50, 60, 70, 80 and 90 following ELT15 are shown in Figures 2.9 and 2.10. These can now be compared with the P22 (x, x + t) given in Figure 2.6. For all ages, P22 (x, x + t) are lower for all durations as compared to those derived from ELT15. However, the slope of P22 (x, x + t) is significantly lower than that of ELT15 for longer durations. This seems to suggest that survival for a long duration after a first heart attack implies better overall health as compared to the general population. We have also plotted P22 (x, x + t) over the first 30 days following a first heart attack in Figure 2.11. This can be compared with Fig 1 of Brønnum-Hansen et al. (2001), which gives the graphs of survival probabilities for men and women combined for all ages over three different time periods. Although not directly comparable, the graphs show similar features. Figure 2.12 shows the survival probabilities for hospital survivors calculated from the P22 (x, x + t). Again we find that these graphs show similar features when 37 Survival Probability 1 0.8 0.6 0.4 50 60 0.2 70 80 90 0 0 2 4 6 Duration (years) 8 10 Figure 2.9: The plots of survival probabilities of men aged 50, 60, 70, 80 and 90 following ELT15. Survival Probability 1 0.8 0.6 0.4 50 60 0.2 70 80 90 0 0 2 4 6 Duration (years) 8 10 Figure 2.10: The plots of survival probabilities of women aged 50, 60, 70, 80 and 90 following ELT15. 38 Survival Probability 1 0.8 0.6 0.4 50 60 0.2 70 80 90 0 0 0.02 0.04 Duration (years) 0.06 0.08 Figure 2.11: The plots of survival probabilities, of individuals aged 50, 60, 70, 80 and 90, over the first 30 days after a first heart attack. Survival Probability 1 0.8 0.6 0.4 50 60 0.2 70 80 90 0 1 month 1 2 3 4 5 6 Duration (years) 7 8 9 10 Figure 2.12: The plots of survival probabilities of individuals aged 50, 60, 70, 80 and 90, who survived the first 30 days after a first heart attack. 39 Table 2.3: Odds of dying within first 30 days, one year and two years following a first heart attack. Age(years) <55 55–64 65–74 75–84 ≥85 30 days 1.00 2.35 4.49 7.04 8.92 Duration 1 year 2 years 1.00 1.00 2.19 2.12 4.09 3.80 6.34 5.73 8.21 7.28 Table 2.4: Adjusted odds ratios and the corresponding 95% confidence intervals of dying within first 30 days, one year and two years following a first heart attack according to Goldberg et al. (1998). Age(years) <55 55–64 65–74 75–84 ≥85 30 days 1.00 (–) 1.87 (1.30, 2.68) 4.00 (2.86, 5.60) 7.77 (5.55, 10.88) 11.67 (8.10, 16.81) Duration 1 year 1.00 (–) 1.78 (1.27, 2.51) 3.00 (2.18, 4.13) 4.55 (3.28, 6.30) 8.76 (6.12, 12.54) 2 years 1.00 (–) 1.65 (1.25, 2.18) 2.83 (2.18, 3.68) 5.30 (4.05, 6.93) 10.57 (7.75, 14.42) compared with Fig 3 of Brønnum-Hansen et al. (2001) and Figure 2 of Goldberg et al. (1998). Finally, we calculate the odds of dying within the first 30 days, 1 year and 2 years following a first heart attack. The numbers are given in Table 2.3. Most of these fall within the 95% confidence intervals given in Tables II and IV of Goldberg et al. (1998). For reference the relevant numbers are reproduced in Table 2.4. Based on the discussion above, the proposed model appears to be consistent with other relevant data relating to survival after first heart attack. 2.4 Mortality Before First Heart Attacks Going back to the heart attack model proposed in Section 2.1, we have already parameterised λ12 (x) and λ24 (x, t). In this section, we will complete the model by 40 λ12 (x) 1 = Healthy - 2 = Heart Attack λ13 (x) λ24 (x, t) ? ? 3 = Dead 4 = Dead Figure 2.13: 4-state heart attack model - Grouping of states. parameterising λ13 (x). This is the force of mortality affecting individuals who have not had a heart attack. To parameterise λ13 (x), we make use of the mortality transition intensity affecting all individuals in the general UK population. Mortality of the general UK population is well studied and is analysed separately for males and females, and the latest intensities are given in ELT15. To make use of ELT15 for our investigation, we need to make the following observations. The 4-state heart attack model introduced in Section 2.1 is reproduced in Figure 2.13. Note that individuals are alive in States 1 and 2; and they move to States 3 and 4 when they die. The grouping shown in Figure 2.13 using dashed lines, produces a simple 2-state mortality model, given in Figure 2.14. Here, States 1 and 2 are combined to produce State 5, the Alive state, while States 3 and 4 are grouped to form State 6, the Dead state. The resulting transition intensity from State 5 to State 6, λ56 (x), is the force of mortality for the general population as given by ELT15 for respective genders. Recalling that the notation Pij (y, z) denotes the conditional probability that a person is in State j at age z given that he or she was in state i at age y, the probability of an individual dying before attaining age x in the 2-state mortality model can be expressed as: µ Z P56 (0, x) = 1 − P55 (0, x) = 1 − exp − x ¶ λ56 (s)ds . (2.6) 0 Note that we can numerically compute the probability P56 (0, x) for all ages x, as the transition intensity λ56 (x) is known and given by ELT15. 41 5 = Alive λ56 (x) ? 6 = Dead Figure 2.14: A 2-state mortality model. Going back to our original 4-state heart attack model, we can express the same probability of dying, in terms of the transition intensities pertinent to that model. We will assume that all individuals belong to State 1 when they are born. Note that, according to the definitions of the states in the heart attack model, all individuals are born healthy, as individuals who are alive and not have suffered a heart attack are classified as healthy. So in the 4-state heart attack model, the probability of person dying before attaining age x is given by: P56 (0, x) = P13 (0, x) + P14 (0, x) Z Z xh P11 (0, z)λ13 (z) + = z i P11 (0, y)λ12 (y)P22 (y, z)λ24 (y, z − y)dy dz, 0 0 (2.7) where h Z z P11 (0, z) = exp − i (λ12 (y) + λ13 (y)) dy , and (2.8) 0 Z h z−y P22 (y, z) = exp − i λ24 (y, s)ds . (2.9) 0 We see that λ13 (x) is the only unknown variable above. So now we can solve for λ13 (x) numerically using the above equation. The iterative algorithm to solve λ13 (x) is outlined below. (a) For a given age x, let us assume that λ13 (y) is known for all y < x. Based on this information, we will now solve for λ13 (x). 42 (b) Set an initial guess for the value of λ13 (x). The better the initial guess, the faster will be the convergence to the solution. We have used simple linear extrapolation based on the values of λ13 (x − δ) and λ13 (x − 2δ) for a small value of δ > 0. (c) The approximate value of λ13 (x) can then be used to calculate P11 (0, x). We can now calculate P13 (0, x) + P14 (0, x) using Equation 2.7, assuming that λ13 (x) and P11 (0, x) are known quantities. P56 (0, x) can be computed independently and compared with the value of P13 (0, x) + P14 (0, x) thus obtained. Depending on the magnitude and sign of the difference between these quantities, we can refine our initial estimate of λ13 (x). Repeat this step with improved estimates λ13 (x) until convergence is achieved. (d) The above process can be used to calculate λ13 (x) for different ages progressively, starting from age 0. As a starting value, we have assumed λ13 (0) = λ56 (0). In the above steps, we have to compute a number of integrals numerically, for which we have used Romberg Integration. For a detailed discussion on Romberg Integration see Press et al. (2002). The integration involving λ23 (x, t) in Equation 2.7 requires special treatment. The section of the integral we are interested in is given below: Z z P11 (0, y)λ12 (y)P22 (y, z)λ24 (y, z − y)dy. I= (2.10) 0 For convenience, we make a transformation u = z − y, which gives us the following integral: Z z I= P11 (0, z − u)λ12 (z − u)P22 (z − u, z)λ24 (z − u, u)du. (2.11) 0 Recall from Section 2.3.3, that for all values of x and t ≤ 10, λ24 (x, t) is of the form λ24 (x, t) = ax × bx × tbx −1 + cx × dx × tdx −1 , 1 + ax × tbx + cx × tdx (2.12) where 0 < bx < 1 and dx > 1. This implies that limt→0+ λ24 (x, t) = ∞. Also the smaller the value of bx , the steeper is the initial descent of λ24 (x, t). Convergence is difficult to achieve for numerical integration of an unbounded function. If the 43 0.001 Integrand 0.0008 0.0006 0.0004 0.0002 0 0 10 20 30 40 u 50 60 70 80 Figure 2.15: The graph of the integrand in Equation 2.11. integral exists, it is easier to deal with a transformed integrand which is bounded within the required range. For the type of function given in Equation 2.12 we can use a transformation of the form w = uα , where α < bx for all x. For our computations, we have chosen α = 0.05. Using this transformation, Equation 2.11 becomes Z zα I= 0 1 1 1 1 1 1 1 P11 (0, z − w α )λ12 (z − w α )P22 (z − w α , z)λ24 (z − w α , w α ) w α −1 dw. (2.13) α We show the effect of this transformation in Figures 2.15 and 2.16. Figure 2.15 gives the graph of the integrand in Equation 2.11 before the transformation and Figure 2.16 shows the graph of the integrand in Equation 2.13 after the transformation. For both graphs, z has been set to 80. From the figures we can see that the transformation has successfully converted the unbounded function in Figure 2.15 to the bounded function in Figure 2.16. Now we can successfully apply Romberg Integration to evaluate the transformed integral in Equation 2.13. Using the techniques outlined above, we have obtained estimates of λ13 (x) for both males and females. They are given in the Figures 2.17. For comparison, we have also included the gender-specific forces of mortality given in ELT15. 44 Transformed Integrand 0.1 0.08 0.06 0.04 0.02 0 0 0.2 0.4 0.6 w 0.8 1 1.2 Figure 2.16: The graph of the integrand in Equation 2.13. 45 Transition Intensity 1 ELT15 - Male Non-heart-attack deaths - Male 0.1 0.01 0.001 0.0001 0 10 20 30 40 50 Age (years) 60 70 80 60 70 80 Transition Intensity 1 0.1 ELT15 - Female Non-heart-attack deaths - Female 0.01 0.001 0.0001 0 10 20 30 40 50 Age (years) Figure 2.17: Transition intensities of non-heart-attack deaths plotted along with ELT15 for both males and females. 46 Chapter 3 Gene-Environment Interaction 3.1 Definition of Strata: A Simple Example The parameters of the heart attack model estimated above are supposed to apply to the general population. However, the general population is divided into strata according to genotype, environmental exposures and other factors, and we now suppose that the intensity of heart attack λs12 (x) depends on the stratum s. In this chapter, we will introduce the simplest possible gene-environment interactions into our model. We suppose that there is a single genetic locus with two alleles, denoted G and g, therefore just two genotypes. Also, there are just two levels of environmental exposures, denoted E and e (a simple example might be E = ‘smoker’ and e = ‘non-smoker’). This simple model can be used as a stepping stone to study higher dimensional multifactorial models. Note that the UK Biobank draft protocol used the same assumptions in its examples, despite the fact that the project aims to study complex multifactorial disorders. We will suppose that G and E are adverse exposures, while g and e are beneficial. Therefore, we have four strata for each sex — ge, gE, Ge and GE — and eight in total. We must choose plausible values for the frequencies with which each stratum is present in the population, and the stratum-specific heart attack intensities. Since, unlike the study of single-gene disorders, we are considering common risk factors for common diseases, let us assume that the probability that a person possesses genotype G is 0.1, and the probability that a person has environmental exposure E 47 Table 3.5: The factor ρs , in Equation (3.14), for each gene-environment combination. E e G 1.3 1.1 g 0.9 0.7 is also 0.1. Assuming independence, the four strata (for each sex) ge, gE, Ge and GE occur with frequencies 0.81, 0.09, 0.09 and 0.01 respectively. We will suppose that the heart attack intensity in each stratum is proportional to the population average intensity. For stratum s, set: λs12 (x) = k × ρs × λ12 (x) , (3.14) where λ12 (x) is the population intensity given in Section 2.2 and k×ρs is the constant of proportionality for each stratum. We suppose, for clarity, that ρs does not depend on sex, but the constant k does. Again, noting that our interest is in genotypes of modest penetrance, we choose the values of ρs given in Table 3.5. Then, we choose k so that the strata-specific heart attack intensities are consistent, in aggregate, with the population heart attack intensities, for males and females separately. Let the proportion of the healthy population in stratum s at age x be ws (x). Then: ³ R ´ t s s s ws (x) × exp − 0 λ12 (x + y)dy × λ12 (x + t) ³ R ´ . P t s w (x) × exp − λ (x + y)dy s s 0 12 P λ12 (x + t) = (3.15) Substituting Equation (3.14) in Equation (3.15), we get: P λ12 (x + t) = ³ R ´kρs t w (x) × exp − λ (x + y)dy k × ρs × λ12 (x + t) s 12 s 0 . ³ ´kρs Rt P s ws (x) × exp − 0 λ12 (x + y)dy (3.16) From Equation (3.16) we see that k ought to depend on a specific choice of age x and duration t. However, to keep the model simple we will assume that k is constant and calculate it from Equation (3.16) for a representative choice of age and duration. Given that the UK Biobank protocol proposes an age range of 40 to 69 and 48 Table 3.6: The multipliers k s × ρuv for each stratum. Stratum Male Female ge 0.922 0.921 gE 1.186 1.185 Ge 1.449 1.448 GE 1.712 1.711 Table 3.7: The true relative risks for each stratum, relative to the baseline ge stratum. Stratum Male Female ge 1.000 1.000 gE 1.286 1.286 Ge 1.571 1.571 GE 1.857 1.857 a follow-up period of 10 years, we have chosen x = 60 and t = 5. If we assume that the weights ws (x) are equal to the population frequencies of each stratum, then for males k = 1.317274 and for females k = 1.316406. The constants of proportionality (k × ρs ) in Equation (3.14) are given in Table 3.6 for future reference. Having formulated a relationship between strata and the risk of heart attack, we now consider the quantities likely to be estimated by epidemiologists. We have the advantage of being able to compute their true values, because we know the true intensities. From now on, we define the baseline population as the most common stratum, namely the gene-environment combination ge. (a) The relative risk in stratum s, with respect to the baseline stratum ge, is denoted rs and is: rs = ρs k × ρs = . k × ρge ρge (3.17) The values of rs are given in Table 3.7. (b) The odds ratio at age x in stratum s, with respect to the baseline stratum ge, based on 1-year probabilities, is denoted ψs (x) and is given by: Ã ψs (x) = 1 s (x, x + 1) P12 s − P12 (x, x + 1) !,Ã 1 ge P12 (x, x + 1) ge − P12 (x, x + 1) ! (3.18) s (x, x + 1) is the conditional probability that a person in stratum s where P12 who was healthy at age x will suffer a heart attack before age x + 1. 49 We have verified (not shown here) that the odds ratios computed using Equation (3.18) do not vary significantly with age and are approximately equal to the corresponding relative risks. The latter is not surprising, as we have used 1-year probabilities to calculate the odds ratios. For details on relative risks and odds ratios, see Appendix A, or Woodward (1999) or Breslow and Day (1980). 3.2 A Sample Realisation of UK Biobank With the parameterised model, we simulated the life histories of 500,000 people recruited to UK Biobank and followed up for 10 years. The life histories of the first 20 people are shown in Table 3.8. Consider person No.2 in Table 3.8. He is a male with the adverse allele G and is exposed to the beneficial environment e. He entered the study in State 1 as a healthy individual at age 58.74. During the follow-up period he had a heart attack at age 63.89 and moved to State 2. Finally, he died at age 63.94 and moved to State 4. The numbers of people in each state at the end of the 10-year follow-up period are given in Table 3.9. 3.3 Epidemiological Analysis With 500,000 simulated life histories, we can now carry out one or more typical epidemiological analyses. Apart from the life histories, the following information is available to the epidemiologist: (a) the framework of the UK Biobank project; (b) the structure of the 4-state Heart Attack model given in Section 2.1; (c) the transition intensities given in Sections 2.2–2.4; (d) the stratum to which each person is allocated; and (e) the proportion ws (x) of individuals in each stratum at a particular age x, say 60. The UK Biobank protocol suggests that the combined effect of environment and genotype be analysed using matched case-control studies nested within the cohort. 50 Table 3.8: The simulated life histories of the first 20 (of 500,000) individuals showing their genders, exposure to environmental factors, genotypes and the times and types of all transitions made within 10 years. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sex M M M M F M M F M M F F F F F M F M F M E/e e e e e e e e e e e E e e E e e e e e e G/g g G g g G g g G g g g g g g g g g g g g Age 41.10 58.74 52.27 68.39 60.94 62.49 55.50 58.95 65.67 49.79 45.43 57.58 59.68 55.14 42.93 56.23 62.84 62.29 43.69 45.16 State 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Age State Age State 63.89 2 63.94 4 63.81 68.18 61.57 2 3 3 69.58 3 Table 3.9: Number of individuals in each state at the end of the 10-year follow-up period. Sex G/g E/e State 1 State 2 State 3 State 4 Total Male G G g g E e E e 1,871 17,579 17,588 162,474 126 928 775 5,426 356 3,219 3,236 29,610 115 934 702 5,002 2,468 22,660 22,301 202,512 Female G G g g E e E e 2,178 19,746 19,811 178,718 70 397 367 2,320 214 2,021 2,095 18,891 52 408 330 2,441 2,514 22,572 22,603 202,370 419,965 10,409 59,642 9,984 500,000 Total 51 In a case-control study, the first step is to define the cases and controls. Here, clearly, the cases are persons who had first heart attacks during the study period. In real studies, epidemiologists will face problems such as missing data and cost constraints, and in most circumstances they will use only a subset of all cases for their analysis. Here, we have no such problems, unless we choose to model them. So, in the first instance, we will include all cases in the analysis. Later, we will consider the more realistic possibility that a subset of all cases is used. An appropriate matching strategy is particularly important for a matched casecontrol study. Firstly, we match controls with cases by age. Suppose, for example, that we are comparing stratum s with the baseline stratum ge, and that a case entered the study at age x last birthday and had a heart attack at age x + t last birthday. A matched control is a person chosen randomly from persons in these two strata who also entered the study at age x last birthday and remained healthy at least until age x + t + 1 last birthday. Once chosen as a control, that person cannot be chosen as a control again. As controls are plentiful compared with cases, we will match 5 controls to each case, called a 1:5 matching strategy. In Section 1.5, we mentioned that the genotyping of individuals will be done as and when it is required. So, it might be necessary to genotype a large number of people to ensure that enough controls are available for a 1:5 case-control study. Other matching strategies with fewer controls per case will obviously be cheaper to implement. To calculate odds ratios, we need to group ages sensibly. Note that epidemiological studies often use quite wide age groups, much wider than actuaries are accustomed to using. We will use 5-year age bands as a reasonable compromise between accuracy and sample size. Note that the definition of the ages of the controls needs to be adjusted appropriately to maintain consistency. The results are given in Table 3.10. We can see no particular trend with respect to age, so we calculate the age-adjusted odds ratio for each stratum (a weighted average of the age-specific odds ratios, see the Mantel-Haenszel method described in Appendix A or Woodward (1999)), shown in Table 3.11. We can compare the estimated age-adjusted odds ratios with the true odds ratios given in Table 3.7. The estimates are better for strata gE and Ge where the numbers 52 Table 3.10: Odds ratios with respect to the ge stratum as baseline, based on a 1:5 matching strategy using all cases and 5-year age groups. Approximate 95% Confidence intervals are shown in brackets. There were no cases among females age 45–49 in stratum GE. Age 40–44 45–49 50–54 55–59 60–64 65–69 70–74 75–79 Age 40–44 45–49 50–54 55–59 60–64 65–69 70–74 75–79 1.043 1.069 1.330 1.358 1.175 1.267 1.362 1.487 1.167 0.944 0.947 1.243 1.634 1.321 1.257 1.203 gE (0.527,2.065) (0.816,1.400) (1.117,1.583) (1.168,1.579) (1.020,1.352) (1.116,1.438) (1.179,1.574) (1.160,1.907) gE (0.301,4.520) (0.523,1.702) (0.659,1.361) (0.967,1.597) (1.343,1.988) (1.111,1.571) (1.045,1.511) (0.893,1.620) 2.628 1.670 1.578 1.665 1.708 1.592 1.542 1.534 Males Ge (1.561,4.423) (1.317,2.118) (1.336,1.865) (1.448,1.914) (1.507,1.935) (1.416,1.789) (1.348,1.764) (1.187,1.983) 2.375 1.929 1.725 2.133 1.976 1.721 1.907 1.667 1.333 1.869 1.298 1.280 1.867 1.601 1.538 1.220 Females Ge (0.463,3.836) (1.139,3.067) (0.929,1.814) (0.999,1.641) (1.538,2.267) (1.359,1.887) (1.296,1.825) (0.896,1.659) GE 5.000 (0.313,79.942) – 4.167 (1.800,9.644) 2.324 (1.282,4.211) 1.842 (1.112,3.053) 2.457 (1.637,3.689) 2.354 (1.528,3.626) 1.773 (0.788,3.986) 53 GE (0.712,7.917) (0.940,3.959) (1.121,2.654) (1.486,3.062) (1.417,2.753) (1.251,2.368) (1.334,2.726) (0.910,3.052) Table 3.11: The age-adjusted odds ratios calculated for both males and females. Strata Male Female gE 1.285 (1.209,1.365) 1.298 (1.188,1.418) Ge 1.625 (1.536,1.719) 1.538 (1.413,1.674) GE 1.880 (1.620,2.182) 2.250 (1.814,2.790) of cases are higher than in stratum GE. However all the true odds ratios lie within the 95% confidence intervals given in Table 3.11. 3.4 An Actuarial Investigation The actuary starts with the model of Figure 2.1 in mind, and wishes to estimate the intensity λs12 (x) for each stratum. We assume, realistically, that the best available data are the published odds ratios. The ‘estimation’ procedure, therefore, consists of finding a reasonably robust way to estimate transition intensities from odds ratios. There is no simple mathematical relationship, so approximations must be made. Supposing that the actuary knows the rates of heart attack in the general population λ12 (x) (separately for males and females) a simple assumption is that the heart attack intensity for each stratum is proportional to λ12 (x). In stratum s, define: s γ12 (x) = cs (x) × λ12 (x) (3.19) s where γ12 (x) is the actuary’s ‘estimate’ of λs12 (x). Assuming that the odds ratios (denoted ψs (x)) are good approximations of the relative risks, which is reasonable as long as the age groups are not too broad, we have: cs (x) γs (x) = γge (x) cge (x) (3.20) cs (x) = ψs (x) × cge (x). (3.21) ψs (x) = which leads to: As observed from Table 3.10, the odds ratios do not appear to depend strongly on age. So we further assume that cs (x) is a constant cs for all ages (hence also ψs (x) 54 Table 3.12: The estimated multipliers cs for each stratum. Stratum Male Female ge 0.918 0.920 gE 1.179 1.194 Ge 1.492 1.415 GE 1.726 2.070 is a constant ψs ), and therefore: cs = ψs × cge (3.22) where ψs is the age-adjusted odds ratio. Thus Equation (3.19) becomes: s γ12 (x) = cge × ψs × λ12 (x). (3.23) Now Equation (3.16) can be written: ³ R ´ t w (x) exp − c ψ λ (x + y)dy cge ψs λ12 (x + t) s s 0 ge s 12 ³ ´ . Rt P w (x) exp − c ψ λ (x + y)dy s s 0 ge s 12 P λ12 (x + t) = (3.24) Let us assume that at age x = 60, the ws (x) are given by the population frequencies of the respective strata. Now we can solve Equation (3.24) for the multiplier cge for a particular choice of age x and any duration t. Then we can use Equation (3.22) to obtain cs for s = gE, Ge and GE. We find (not shown here) that the results are very similar for different values of t. In Table 3.12, we show the ‘estimated’ cs for representative age x = 60 and duration t = 5, based on the the age-adjusted odds ratios in Table 3.11. These values can be compared with the true constants of proportionality of the underlying model given in Table 3.6. They are in good agreement for strata s = ge, gE and Ge. The agreement for stratum s = GE is not so good, but it was based on a small number of cases, 241 males and 122 females. 55 3.5 3.5.1 Premium Rating for Critical Illness Insurance A Critical Illness Model s The actuary will use the intensities γ12 (x) ‘estimated’ in Section 3.4 to calculate CI insurance premiums. Gutiérrez & Macdonald (2003) obtained the following model for critical illness insurance based on medical studies and population data. Full references can be found in that paper. The structure of the model, as outlined in the paper, is given in Figure 3.18. The relevant transition intensities are listed below. State 1 Heart Attack µs01 (x) ¡ µ ¡ ¡ ¡ ¡ State 0 Healthy ¡ * © © ¡ µs02 (x) © © ¡ © ¡ ©© ¡ © ¡©© µs03 (x) © ¡ © H @HH @ HH HH @ HH @ @ µs (x) H H j H @ 04 @ @ @ @ µs05 (x) @ R @ State 2 Cancer State 3 Stroke State 4 Other CI State 5 Dead Figure 3.18: A full critical illness model for gender s. (a) For males, the age-dependent transition intensities governing the incidence of heart attack are given below: 56 m Table 3.13: 28-Day mortality rates, q01 (x) attacks. m m age x q01 (x) age x q01 (x) 20–39 0.15 47–52 0.18 40–42 0.16 53–56 0.19 43–46 0.17 57 0.20 = 1 − pm 01 (x), for males following heart age x 58–59 60–61 62–64 m q01 (x) 0.21 0.22 0.23 age x 65–74 75–79 80+ m q01 (x) 0.24 0.25 0.26 exp(−13.2238 + 0.152568x) if x ≤ 44 49 − x x − 44 µm × µm × µm 01 (x) = 01 (49) + 01 (44) if 44 < x < 49 49 − 44 49 − 44 − 0.01245109 + 0.000315605x if x ≥ 49 (3.25) For females, the age-dependent transition intensities are: µf01 (x) = 0.598694 × 0.1531715.6412 exp(−0.15317x)x14.6412 Γ(15.6412) (3.26) We also need the 28-day survival factors following heart attacks. This relates to the common contractual condition, that payment depends on surviving for 28 s days. Let ps01 (x) be the 28-day survival probabilities for gender s, and q01 (x) = f 1 − ps01 (x). For females, at ages 20–80, q01 (x) = 0.21, and for males the values are given in Tables 3.13. The 28-day mortality rates given in Table 3.13 can be compared against the survival probabilities obtained from Capewell et al. (2000) and given in Table 2.1. (Note that the odds ratios given in Table 2.3 is derived from the survival probabilities in Table 2.1.) As compared with the Capewell et al. (2000) data, the 28-day mortality rates in Table 3.13 appear slightly higher at younger ages and lower for older ages. However, to maintain consistency with the CI insurance model we will use the rates in Table 3.13 to calculated the CI insurance premium rates. (b) For males, the age-dependent transition intensities governing the incidence of 57 cancer are given below: exp(−11.25 + 0.105x) if x ≤ 51 x − 51 × µm (60) + 60 − x × µm (51) if 51 < x < 60 02 02 m 60 − 51 60 − 51 µ02 (x) = − 0.2591585 − 0.01247354x + 0.0001916916x2 − 8.952933 × 10−7 x3 if x ≥ 60 For females, the age-dependent transition intensities are: exp(−10.78 + 0.123x − 0.00033x2 ) if x < 53 f µ02 (x) = − 0.01545632 + 0.0003805097x if x ≥ 53 (3.27) (3.28) (c) For males, the age-dependent transition intensities governing the incidence of stroke are given below: 2 3 µm 03 (x) = exp(−16.9524 + 0.294973x − 0.001904x + 0.00000159449x ) (3.29) For females, the age-dependent transition intensities are: µm 03 (x) = exp(−11.1477 + 0.081076x) (3.30) We need the 28-day survival factors following stroke. Let ps03 (x) be the 28-day s survival probabilities for gender s, and q03 (x) = 1 − ps03 (x). For both males and s females, q03 (x) = 0.002x/0.9. (d) The transition intensities for other minor causes of critical illnesses amount to 15% of those arising from cancer, heart attack and stroke. So the aggregate rate of critical illness claims, for gender s, is: µs (x) = 1.15(µs01 (x) × ps01 (x) + µs02 (x) + µs03 (x) × ps03 (x)) (3.31) (e) Population mortality rates, given by English Life Tables No. 15 (ELT15) were adjusted to exclude deaths which would have followed a critical illness insurance claim. 3.5.2 Premium Rating for Critical Illness Insurance We will assume that all intensities except those for heart attack are as given here. s (x). We compute expected present values For heart attack, we use the intensities γ12 58 Table 3.14: The true critical illness insurance premiums for different strata as a percentage of those for stratum ge. Stratum Males Age gE Ge GE 45 55 65 75 45 55 65 75 45 55 65 75 Females Term 5 112% 110% 107% 106% 124% 119% 114% 111% 136% 129% 120% 117% 15 111% 108% 106% 25 109% 107% 35 107% 121% 116% 112% 117% 114% 115% 131% 124% 118% 126% 121% 122% Age 45 55 65 75 45 55 65 75 45 55 65 75 Term 5 103% 104% 105% 106% 105% 109% 111% 111% 108% 113% 116% 117% 15 103% 105% 106% 25 104% 105% 35 104% 107% 110% 111% 108% 110% 108% 110% 115% 117% 112% 115% 112% by solving Thiele’s differential equations numerically, with a force of interest of δ = 0.044017 (see Norberg (1995)). Table 3.14 shows the true premiums for the strata s = ge, Ge and GE, as a percentage of the premiums for stratum ge, for males and females and for different ages and terms. Here, ‘true’ means that they have been computed using the intensities of Chapter 2, not the actuary’s estimates. Table 3.15 then shows the corresponding premiums, again as a percentage of those charged for stratum ge, using the actuary’s estimates from Section 3.4. The results are similar to those in Table 3.14. Comparing the actuary’s estimates with the true CI insurance premiums, we can see that the estimates are very good for stratum gE. For stratum Ge, the estimates are also within ±2% of the true values. However, the estimates are not as accurate for females in stratum GE. As mentioned before, stratum GE had relatively few cases resulting in high volatility of the estimated values. 59 Table 3.15: The actuary’s estimated critical illness insurance premiums for different strata as a percentage of those for stratum ge. Stratum Males Age gE Ge GE 45 55 65 75 45 55 65 75 45 55 65 75 Females Term 5 112% 110% 107% 106% 126% 121% 115% 112% 137% 129% 121% 117% 15 110% 108% 106% 25 109% 107% 35 107% 123% 117% 113% 119% 115% 116% 132% 124% 119% 126% 121% 123% 60 Age 45 55 65 75 45 55 65 75 45 55 65 75 Term 5 103% 105% 106% 106% 105% 108% 110% 111% 111% 119% 124% 125% 15 104% 105% 106% 25 104% 105% 35 104% 106% 109% 110% 108% 109% 108% 115% 121% 124% 118% 121% 118% Chapter 4 UK Biobank Simulation Results 4.1 Varying the Genetic and Environment Model In the last chapter, we estimated parameters of a heart attack model and the resulting CI insurance premiums, based on a simulated realisation of UK Biobank. The underlying ‘true’ model (chosen by us) was particularly simple — two genotypes, two environmental exposures and proportional hazards of heart attack — and by great good luck, our model epidemiologist hit upon exactly the correct hypotheses in fitting his/her model. So it is not surprising that he/she obtained good parameter estimates, with the possible exception of those in respect of the smallest stratum, GE. In reality, the epidemiologist faces more difficult problems: (a) There is likely to be more than one gene, many with more than two variants, as candidates for influencing the disease. (b) Similarly, there are likely to be several environmental exposures of interest. (c) Model mis-specification is always possible (indeed, it may be the norm). (d) On grounds of cost, the number of cases and the number of controls per case may be limited. (e) As mentioned earlier, UK Biobank is a single unrepeatable sample, hence sampling error is present. Although 500,000 seems like a huge sample, it may not be when smaller numbers of cases are sampled from within it. In a simulation study, we are in a position to explore these problems. In par61 ticular, we can address (d) and (e) above, because we can replicate the entire UK Biobank dataset many times, and repeat the epidemiological and actuarial analyses using each realisation. Thus we can estimate the sampling distributions of parameter estimates and premium rates, while the analysis of the single realisation in Section 3 only gave us point estimates of the latter. (We did give approximate confidence intervals of the estimated odds ratios, because they can be derived on theoretical grounds. This is not possible for such a complicated function of the model parameters as a premium rate, and simulation is one of the few practical approaches.) We concentrate on this question in the rest of this thesis, because it is directly relevant to the approach adopted by GAIC in the UK, and likely to be adopted by similar bodies elsewhere, which demands that the reliability of prognoses based on genetic information must be demonstrated if it is to be used in any way. In the case of multifactorial disorders, we assume that this requirement is to be interpreted in the statistical sense rather than as applying to individual applicants. Our exploration of (a), (b) and (c) above will be the subject of a future paper. In addition to simulating many replications of UK Biobank, we will consider the effect of stronger or weaker genetic and environmental effects, and of more common and less common adverse genotypes. We call each such variant of the underlying model a ‘scenario’, which should not be confused with the simulation procedure discussed above. We will hold each scenario fixed, and then simulate outcomes of UK Biobank under those assumptions. We have already introduced one set of assumptions is Section 3.1, which we will refer to as our Base scenario. The details of all the scenarios are given in Table 4.16. The parameters that must be specified are: (a) The population frequency of each stratum (the same for males and females). (b) The parameters k for each sex and ρs for each stratum. Although ρs does not depend on sex, for convenience Table 4.16 shows the combined constants of proportionality k × ρs for each sex. Although the odds ratios are derived quantities rather than parameters, they are also shown in Table 4.16 for convenience. The Low and High Penetrance scenarios assume smaller and larger differences, 62 Table 4.16: The model parameters for different scenarios. Odds ratios are also shown. Parameters Stratum Base Penetrance Low High Population Frequency ge gE Ge GE 0.81 0.09 0.09 0.01 0.81 0.09 0.09 0.01 0.81 0.09 0.09 0.01 0.9025 0.0475 0.0475 0.0025 0.64 0.16 0.16 0.04 ρs ge gE Ge GE 0.70 0.90 1.10 1.30 0.85 0.95 1.05 1.15 0.55 0.85 1.15 1.45 0.70 0.90 1.10 1.30 0.70 0.90 1.10 1.30 k (Male) k (Female) All All 1.317274 1.316406 1.136603 1.136463 1.568090 1.564821 1.370745 1.370230 1.221620 1.220385 k × ρs (Male) ge gE Ge GE 0.922 1.186 1.449 1.712 0.966 1.080 1.193 1.307 0.862 1.333 1.803 2.274 0.960 1.234 1.508 1.782 0.855 1.099 1.344 1.588 k × ρs (Female) ge gE Ge GE 0.921 1.185 1.448 1.711 0.966 1.080 1.193 1.307 0.861 1.330 1.800 2.269 0.959 1.233 1.507 1.781 0.854 1.098 1.342 1.587 Odds Ratio ge gE Ge GE 1.000 1.286 1.571 1.857 1.000 1.118 1.235 1.353 1.000 1.545 2.091 2.636 1.000 1.286 1.571 1.857 1.000 1.286 1.571 1.857 63 Frequency Low High respectively, between the effects of the different strata, governed by ρs . The Low and High Frequency scenarios assume that disadvantageous G genotype and E environment have population frequencies half (0.05) or double (0.2) those in the baseline scenario (0.1), respectively. In Section 3.3, we noted that problems like missing values and cost constraints might limit the number of cases that can be used for analysis. So we will also examine the effect of limiting the number of cases used in the analysis. From Table 3.9, around 20,000 individuals were eligible to be considered as cases (in that particular realisation). For each scenario, we will show results based on 1,000, 2,500, 5,000 and 10,000 cases as well as those based on all cases. 4.2 Outcomes of 1,000 Simulations: The Base Scenario We will make 1,000 simulations of UK Biobank. The outcomes will be the empirical distributions of the parameters of the epidemiologist’s model, and of CI insurance premium rates. Let us first consider the Base scenario, all cases included, for males aged 45 taking out a CI insurance policy with term 15 years. Figure 4.19 shows scatter plots of the CI insurance premium rates per unit sum assured for strata gE, Ge and GE versus those of ge. More precisely, the outcome of the ith simulation is a drawing pi = (pige , pigE , piGe , piGE ) from the sampling distribution of the 4-dimensional random variable P = (Pge , PgE , PGe , PGE ), where Ps is the premium rate in stratum s. The scatter plots show clearly that the premium rate pairs (Pge ,PgE ) and (Pge ,PGe ) are more strongly correlated than the pair (Pge ,PGE ). This is true, as the correlation matrix given in Table 4.17 shows, but note that the scale of the x-axis is greatly compressed compared with that of the y-axis. The reason they are correlated is that, as outlined in Section 3.4, the actuary uses the three odds ratios published by the epidemiologist, plus the overall population intensity of heart attack, to obtain the heart attack intensities for the four strata, so the four premium estimates are not independent. The reason that the correlations are negative is that 64 + ++ + + ++++++ + + ++ + + + + +++++ + + ++++ +++++ + +++ ++++++++ ++++ ++++ + ++ ++++ + + + + +++ + + ++ +++ + ++ + + ++++ + ++ + ++++++++ + + ++ + ++++ +++ + +++ + ++ +++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + +++++++ + ++ + + + + ++++ ++ + +++ ++++ + + ++ ++ ++ ++ ++ + ++ +++ +++ + + +++ ++++ ++ + ++ + + ++ + + +++ ++ + ++ + + +++ ++ + + ++ +++ +++ + +++++ ++ + +++ + ++ + + + + +++ +++ ++ + ++ +++ + ++ + + ++ + ++ + + ++ + + + + ++ ++ ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + +++++ + + + +++++++++ +++ ++ ++++++ ++++ +++++++ +++++ +++ ++ + +++ + ++ ++ ++ ++ + +++++ +++++ ++++ ++ +++ + + ++ + ++ ++ ++ + + +++ +++++++ + +++++++ + ++ ++ ++ + ++ ++ ++ ++ ++ +++ + ++ ++++ ++ ++ + +++ ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + o o + + o + o + + + o+ + ++ oo+oo+ +++++ ++++ oo o oo oooooo+ o+o+ oo+ ooooooooooooooo+ ++++ ooo+ o + o o o + + + o o o o o + o + + o o o o o + o o o o o o o o o o o o o o ooooo+ oooooo+ o oooooooooo+ o+ ooo+ o ++ + + ooooooooooooooooooo+ ooooooooooooooooooooooooooooooo+ o o oooooooooooooooooooo oo+ ooooooooo+ o oooooooooo ooo ooooo oo ooooooo+ oooo o oo oooooooooooooooooo+ ooooooooooooooooooooo+ oooooooooooooooooooooooooooooooooooooooo ooooo o ooooooooooooooooooooooooooooo+ o o oooo ooo o oo o + o+ o * * ** ******************************************************************** *** * ** * *** ************************************************************************************* ** *************** *** * * * *** ************************* ************************** **** ******** ** * ******************************************************************** ** 0.012 + 0.011 + o 0.010 PgE , PGe , PGE 0.013 + 0.009 * 0.00865 0.00870 * o + 0.00875 ( Pge , PgE ) ( Pge , PGe ) ( Pge , PGE ) + o * 0.00880 Pge Figure 4.19: Scatter plots of CI insurance premium rates for strata gE, Ge and GE versus that of ge under the Base scenario for males aged 45 and policy term 15 years. the overall level of the four intensities is adjusted so that their aggregate effect is consistent with the general population. So, if the intensities in any of the strata are high, the intensities in the others will tend to fall to restore consistency with the aggregate intensity. We also consider the premium rates for strata gE, Ge and GE as a proportion of those for stratum ge, namely PgE /Pge , PGe /Pge and PGE /Pge . These correspond to premium ratings, if we take the standard premium rate to be that of stratum Table 4.17: The correlation matrix of the strata-specific premium rates for males aged 45 and policy term 15 years under the Base scenario, all cases included. Stratum ge gE Ge GE ge 1.000 −0.604 −0.656 −0.194 gE Ge GE 1.000 −0.123 −0.057 1.000 −0.095 1.000 65 1.8 ( RgE , RGe ) ( RgE , RGE ) 1.4 + + + + + + + + + + ++ + + +++ ++ ++++++++ ++ ++ +++++++++++++++ + ++ + ++ + ++ + + ++ ++++++ + ++ +++ + + + + + + + + + + + + + + ++++++++ + +++++++++ + + ++++ ++ ++++ + +++++++ + + + +++ + ++ ++++++++++ + +++ + ++ ++++ + + +++++ +++ +++ + +++++++ ++ ++ + ++++ ++ +++++ + +++++ + + +++ + + ++ + +++++ + + +++++ ++ ++ + + ++ ++ + + + ++++ + +++ ++ + ++ + + + + + + + + + + + + ++++ +++++++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++++++++++ +++ + ++ +++++++++++++++++ ++++++ + + ++ ++++ +++++ + +++++ ++++++++ ++ + + ++ + ++++ ++ + ++ +++++ ++ ++++ + +++++++ + ++++ ++ ++ +++ +++ + + ++ +++ +++ +++ ++ ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++o+ooo++ o o++o++++ +o + + + +o+ o o+ooo+oo+o+o+ oooo+ o+ o++ o+ oo+oo+o+ooo++o+ooo+ ooo+o+ oooooo+ +ooooooo+oooooo+oooo+oo+ooo o+ oooo+ oo +oo++ ooooo ooooo+ ooooo+ oo+ oooo+ oo+ + + ooo+ oooooooooooo o ooo o o+ ooooooooo+ooooooo+ oooooooooooooooooooooooooooo+ ooooo+ ooooooooooooooooooooooo+ oooooooooooooooooooooooooooooooooooooooooooooooooooooo+ooo+ oooooooooo+ ooooooooooooooooooooooooooo+ oooo+ o oooooooo+ o o o o o o o o o o o o o o o o o o o o o o + o o o o o o oooooooooooooooooooooooo+oooo o o+ooo o o o o o ooo oooo ooooooooo+ + o o o 1.0 1.2 RGe , RGE 1.6 o + 1.05 1.10 1.15 1.20 25 30 RgE 15 0 5 10 Density 20 RgE RGe RGE 1.0 1.2 1.4 1.6 1.8 Premium Ratings Figure 4.20: The scatter plots of the premium ratings Ge/ge and GE/ge versus gE/ge and the corresponding density plots for males aged 45 and policy term 15 years under the Base scenario, all cases included. 66 Table 4.18: The correlation matrix of the premium ratings for males aged 45 and policy term 15 years under the Base scenario, all cases included. Stratum RgE RGe RGE RgE 1.000 0.095 0.013 RGe RGE 1.000 −0.018 1.000 ge, and we will refer to them as such. For brevity, define Rs = Ps /Pge to be the premium rating for stratum s with respect to stratum ge. The correlation matrix of these premium ratings is given in Table 4.18 and the corresponding scatter plots are given in Figure 4.20. Both suggest correlations are small enough to neglect, which means that instead of always considering the full joint distribution of the premiums P , we can obtain all the information of interest by separate examination of the marginal distributions of the premium ratings. The densities of these marginal distributions are given in Figure 4.20. This immediately suggests a simple approach to the questions that GAIC must ask, because the reliability of the premium rating in each stratum — in terms of its distinguishability from the premium ratings in the other strata — is revealed by the degree to which its marginal density overlaps the marginal densities of the others. Presented with Figure 4.20, we might expect GAIC to agree that strata Ge and GE had premium ratings distinct from that of stratum gE, but to ask whether or not they had premium ratings reliably distinct from each other. 4.3 A Measure of Confidence Our precise formulation of the question that GAIC might now ask is: are the marginal empirical distributions of premium ratings in different strata sufficiently different to support charging different premiums (when doing so is allowed)? Hence, we need some kind of measure of confidence in distinguishing one stratum from another in terms of CI insurance premium ratings. Statisticians normally use non-parametric tests, like the Kolmogorov-Smirnov 67 test, to check whether two underlying one-dimensional probability distributions differ from one another by comparing their empirical distribution functions. However, these types of test cannot be applied in a simulation exercise as the power of Kolmogorov-Smirnov type tests increases as the number of observations available for each distribution increase. In a simulation exercise, it is possible to generate a large number of estimates by repeating the experiment any number of times and thus superficially increasing the power of the test. As a consequence, the KolmogorovSmirnov test could not be used to distinguish one risk stratum from another. In the remainder of this section, we will suggest a simple alternative measure to achieve this. Let X and Y be two continuous random variables with cumulative distribution functions FX and FY respectively. We can find u such that FX (u) + FY (u) = 1. If the ranges of X and Y overlap, u lies in both and is unique, otherwise any u that lies between their ranges will do. This can be rewritten as FX (u) = 1 − FY (u), or P[X ≤ u] = P[Y > u]. Without loss of generality, let us also assume that FX (u) ≥ FY (u). Let us define our measure of confidence to be 2 × FX (u) − 1, which gives a measure of the overlap of FX and FY . Denote this O(X, Y ), or just O if the context is clear. If FX (u) = FY (u) = 0.5, then we are as unsure as we can be that FX and FY are distinct, and O = 0. As FX (u) increases to 1, the area of overlap decreases. If the ranges of X and Y do not overlap at all, FX (u) = 1 and we have high confidence in deciding that FX and FY are distinct; in this case O = 1. In this sense, O measures how confident the underwriter can be that the two distributions are different. 4.4 Results In this section, we simulate 1,000 realisations of UK Biobank under each scenario outlined in Table 4.16. Our aim is to examine how reliably UK Biobank might identify differences in premium ratings, as a body like GAIC might require. This is measured by the three quantities O(RgE , RGe ), O(RGe , RGE ) and O(RgE , RGE ). We have verified (not shown here) that these do not vary significantly by age or policy 68 Table 4.19: The measure of overlap O for CI insurance premium ratings for males aged 45, with policy term 15 years, for different scenarios. Scenario Cases O(RgE , RGe ) O(RgE , RGE ) O(RGe , RGE ) Base All 10,000 5,000 2,500 1,000 1.000 0.968 0.872 0.718 0.490 1.000 0.962 0.850 0.698 0.416 0.924 0.632 0.484 0.356 0.176 Low Penetrance All 10,000 5,000 2,500 1,000 0.918 0.662 0.528 0.412 0.250 0.904 0.658 0.472 0.360 0.222 0.572 0.346 0.216 0.148 0.076 High Penetrance All 10,000 5,000 2,500 1,000 1.000 1.000 0.984 0.906 0.688 1.000 0.998 0.970 0.886 0.658 0.992 0.844 0.692 0.540 0.354 Low Frequency All 10,000 5,000 2,500 1,000 0.996 0.892 0.712 0.566 0.386 0.948 0.706 0.516 0.322 0.394 0.658 0.352 0.208 0.060 0.226 High Frequency All 10,000 5,000 2,500 1,000 1.000 0.988 0.932 0.806 0.594 1.000 1.000 0.986 0.902 0.716 0.994 0.896 0.744 0.546 0.358 term, so in Table 4.19, we present results for a representative policy for males aged 45 and policy term 15 years. Note that it is impossible to calculate an odds ratio for a given age group unless there is at least one case in that age group in each stratum. In some circumstances some of the 1,000 simulations failed this criterion, and these were omitted from the results in Table 4.19. Those affected were the Base and the Low Penetrance scenarios with 1,000 cases (1 simulation omitted in each case) and the Low Frequency scenarios with 2,500 and 1,000 cases (10 and 238 simulations omitted, respectively). We make the following comments on Table 4.19: 69 (a) We saw in Section 4.3 that under the Base Scenario, all cases included, the densities of RGe and RGE overlap over a small region. This qualitative observation is made more concrete by Table 4.19, which shows that O(RGe , RGE ) = 0.924 in this case. By definition, this means that there exists x such that P[RGe < x] = P[RGE > x] = 0.924, and we (or GAIC) may have high confidence in assigning these strata to different underwriting groups. (b) Stratum GE is always the smallest, so the distribution of RGE is always the most spread out. This is also evident from the scatter plots in Figure 4.20. (c) We expect real case-control studies to use only a subset of cases, and Table 4.19 shows that the effect of this is very great. For example, in the Base scenario, O(RGe , RGE ) falls from 0.924 to 0.176 as the number of cases used falls from ‘All’ to 1,000. Figure 4.21 shows, for the Base scenario, the marginal densities with different numbers of cases. The densities overlap considerably if the number of cases is small (and bear in mind that 1,000 cases is not a very small investigation by normal standards). (d) Figure 4.22 shows the empirical distribution functions of the premium ratings for males under the Base scenario. For each premium rating, we show the effect of using different numbers of cases. For example, if only 1,000 cases were used, there is about a 30% chance that underwriters would incorrectly assume RGE to be 150% or higher. If instead 10,000 cases were used the chance of making this error is very small. (e) Figure 4.23 shows, for 5,000 cases, the effect of the different scenarios. Reduced frequency of the adverse genetic and environmental exposures, or reduced penetrance of the adverse genotype, both reduce the ability to discriminate between different underwriting classes. Changes in the opposite direction improve the discrimination. This qualitative observation is backed up in a more quantitative way by Table 4.19. Table 4.20 gives the corresponding results for females. When a fixed number of cases is used the results are very similar to those for males. This is as expected, as we assumed that the effects of genotype and environmental exposures were the same for males and females, albeit acting on different baseline risks of heart attack. 70 25 30 Base − All Cases. 15 0 5 10 Density 20 RgE RGe RGE 1.0 1.2 1.4 1.6 1.8 Premium Ratings 25 20 5 0 1.0 1.2 1.4 1.6 1.8 1.0 1.2 1.4 1.6 1.8 Base − 2,500 Cases. Base − 1,000 Cases. 25 20 RgE RGe RGE 5 0 0 5 10 Density 15 20 RgE RGe RGE 15 25 30 Premium Ratings 30 Premium Ratings 10 Density RgE RGe RGE 10 Density 15 0 5 10 Density 20 RgE RGe RGE 15 25 30 Base − 5,000 Cases. 30 Base − 10,000 Cases. 1.0 1.2 1.4 1.6 1.8 1.0 Premium Ratings 1.2 1.4 1.6 1.8 Premium Ratings Figure 4.21: Marginal densities of premium ratings in the Base scenario (males) with different numbers of cases in the case-control study. 71 1.0 0.8 0.6 0.4 0.2 0.0 Cumulative Distribution Function All 10,000 5,000 2,500 1,000 1.0 1.5 2.0 2.5 0.8 0.6 0.2 0.4 All 10,000 5,000 2,500 1,000 0.0 Cumulative Distribution Function 1.0 RgE 1.0 1.5 2.0 2.5 0.8 0.6 0.2 0.4 All 10,000 5,000 2,500 1,000 0.0 Cumulative Distribution Function 1.0 RGe 1.0 1.5 2.0 2.5 RGE Figure 4.22: The empirical cumulative distribution function of the premium ratings gE/ge, Ge/ge and GE/ge for males aged 45 and policy term 15 years under the Base scenario. 72 25 30 Base − 5,000 Cases. 15 0 5 10 Density 20 RgE RGe RGE 1.0 1.2 1.4 1.6 1.8 Premium Ratings 25 20 5 0 1.0 1.2 1.4 1.6 1.8 1.0 1.2 1.4 1.6 1.8 Low Penetrance − 5,000 Cases. High Penetrance − 5,000 Cases. 25 20 RgE RGe RGE 5 0 0 5 10 Density 15 20 RgE RGe RGE 15 25 30 Premium Ratings 30 Premium Ratings 10 Density RgE RGe RGE 10 Density 15 0 5 10 Density 20 RgE RGe RGE 15 25 30 High Frequency − 5,000 Cases. 30 Low Frequency − 5,000 Cases. 1.0 1.2 1.4 1.6 1.8 1.0 Premium Ratings 1.2 1.4 1.6 1.8 Premium Ratings Figure 4.23: Marginal densities of premium ratings in different scenarios (males), with 5,000 cases in the case-control study. 73 Table 4.20: The measure of overlap O for CI insurance premium ratings for females aged 45 with policy term 15 years, for different scenarios. Scenario Cases O(RgE , RGe ) O(RgE , RGE ) O(RGe , RGE ) Base All 10,000 5,000 2,500 1,000 0.990 0.958 0.850 0.728 0.466 0.990 0.948 0.844 0.706 0.488 0.734 0.626 0.494 0.378 0.244 Low Penetrance All 10,000 5,000 2,500 1,000 0.778 0.680 0.528 0.392 0.238 0.762 0.646 0.506 0.326 0.198 0.402 0.302 0.222 0.122 0.078 High Penetrance All 10,000 5,000 2,500 1,000 1.000 1.000 0.992 0.914 0.716 1.000 0.998 0.984 0.884 0.656 0.906 0.836 0.696 0.484 0.320 Low Frequency All 10,000 5,000 2,500 1,000 0.932 0.896 0.748 0.552 0.406 0.800 0.676 0.486 0.340 0.374 0.436 0.298 0.192 0.134 0.218 High Frequency All 10,000 5,000 2,500 1,000 0.998 0.994 0.922 0.814 0.598 1.000 1.000 0.986 0.914 0.678 0.922 0.884 0.756 0.576 0.348 However, when all cases are included, the values of O are smaller than those for males. This is because the lower incidence of heart attack among females results in fewer cases, therefore estimates with higher variances. Until now, we have used a 1:5 matching strategy for all case-control studies; that is, five controls per case. However, cost constraints might dictate the use of fewer controls. In Table 4.21, we show the values of O for males if a 1:1 matching strategy is used. As expected these are decreased significantly under all scenarios. As we mentioned when discussing Table 4.19, we may find simulations under which the odds ratios cannot be calculated because of a lack of cases. Also, note 74 Table 4.21: The measure of overlap O for CI insurance premium ratings for males aged 45, with policy term 15 years, for different scenarios and a 1:1 matching strategy. Scenario Cases O(RgE , RGe ) O(RgE , RGE ) O(RGe , RGE ) Base All 10,000 5,000 2,500 1,000 0.990 0.886 0.740 0.554 0.378 0.990 0.872 0.720 0.544 0.400 0.774 0.454 0.374 0.248 0.222 Low Penetrance All 10,000 5,000 2,500 1,000 0.808 0.558 0.372 0.288 0.232 0.820 0.526 0.378 0.308 0.204 0.456 0.220 0.188 0.184 0.048 High Penetrance All 10,000 5,000 2,500 1,000 1.000 0.988 0.898 0.762 0.548 1.000 0.978 0.902 0.742 0.480 0.908 0.680 0.494 0.366 0.222 Low Frequency All 10,000 5,000 0.954 0.738 0.574 0.856 0.558 0.464 0.474 0.284 0.228 High Frequency All 10,000 5,000 2,500 1,000 1.000 0.944 0.826 0.668 0.474 1.000 0.986 0.932 0.802 0.594 0.950 0.746 0.592 0.456 0.306 that the calculation of odds ratios requires the existence of enough exposed controls. This is more demanding under a 1:1 matching strategy, as fewer controls are available than in 1:5 matching strategy. At first sight this is surprising; it ought to be easier to find a smaller number of controls. This is true, but there is also a higher chance that one of the cells in the 2 × 2 table used to calculate the odds ratio will be empty (see Table A.31 in Appendix A). Table 4.22 shows the numbers of simulations rejected for this reason. The numbers are very high for the Low Frequency scenarios where 1,000 and 2,500 cases were used. The values of O based on the remaining simulations are not reliable and so these are not given in Table 4.21. 75 Table 4.22: The number of simulations rejected due to the inability to calculate the odds ratios for a 1:1 matching strategy. Scenario Base Low Penetrance High Penetrance Low Frequency High Frequency 4.5 All 0 0 0 0 0 Number of Cases 10,000 5,000 2,500 0 0 0 0 0 0 0 0 6 0 0 0 0 123 0 1,000 13 16 36 630 0 Conclusions Earlier in this chapter, we asked the question: how well may UK Biobank distinguish between different levels of risk associated with the influence of genes, environment and their interactions on a given multifactorial disorder? Using a simple model of heart attack as an example, we simulated the outcome of UK Biobank, each simulation consisting of 500,000 life histories. Then we supposed that a model epidemiologist carried out case-control studies using the UK Biobank data, and a model actuary used the published odds ratios from these studies to parameterise a pricing model for CI insurance. We supposed that GAIC (in the UK) would approach the question of the reliability of any genetic test capable of detecting the genetic variation in terms of its ability to allocate tested individuals to distinct underwriting classes. From each simulation of UK Biobank we could estimate the premium rates of a representative CI insurance policy for each stratum defined by genoype and the environment, and for each sex. From a large number of such simulations, we could estimate the sampling distributions of premium ratings with respect to a chosen ‘standard’ underwriting class. For simplicity, we used only two genotypes and two levels of environmental exposure (as in the examples in the UK Biobank protocol). We used proportional hazards of heart attack in different strata, and assumed that the model epidemiologist, in his/her analyses, hit upon the same model. Thus our results correspond to the simplest possible hypothesis that might be investigated using UK Biobank, 76 and is free of model mis-specification on the part of the analyst, and of any noise, nuisance parameters, or missing or contaminated data. The parameters we chose as our baseline represented genetic and environmental exposures that were fairly common (10% of the population with each adverse exposure) and had modest penetrance: the most adverse stratum (GE) and least adverse stratum (ge) had intensities of heart attack 30% higher and 30% lower than average, respectively. (For comparison, CI insurance underwriters typically might consider an extra premium to be appropriate once the assessed premium exceeds about 25% of the standard.) We also considered the effect of varying key parameters, as follows: (a) The relative incidence rates of heart attack for each stratum. (b) The population frequencies of each stratum. (c) The number of cases used in the case-control study. We defined a very simple measure of the extent to which two distributions overlap. We did not attempt to define a cut-off point, beyond which GAIC might deem a genetic test to be insufficiently reliable to be used in underwriting, but the results we obtained ranged across all values of this measure, showing that in some circumstances a genetic test would almost certainly be deemed reliable, and in other circumstances it would almost certainly be deemed unreliable. On the basis of this simple model, we conclude that the ability of case-control studies based on UK Biobank to identify distinct CI underwriting classes was marginal. If a very large number of cases was used, quite reliable discrimination was achieved, but this is a very expensive option. If a more realistic number of cases was used — a few thousands — the power to discriminate quickly diminished. In particular, it was clear that if the effects of the adverse genotype and adverse environment were any less than we had assumed, the power to discriminate would be rather poor. This conclusion ought to bring comfort to those who are worried about insurers’ use of genetic information, and to insurers themselves. This is particularly important during the 5 to 10 years that must pass before UK Biobank itself starts to yield results. We have found no support for the idea that very large-scale genetic studies like UK Biobank will lead to significant changes in underwriting practice. 77 Our study has been very simple and idealised in several respects mentioned above. Most obviously, our genetic model is not truly multifactorial, although it does allow for a basic environmental interaction. Further research is in hand to extend the model to a more realistic, though still hypothetical, representation of a multifactorial genetic contribution to heart attack. Our aim will be to find out whether this will strengthen or weaken the discriminatory power of genetic tests, along the lines that GAIC has pioneered for single-gene disorders. Another point that will repay further study is the possibility of model mis-specification. 78 Chapter 5 Adverse Selection and Utility Theory 5.1 Risk and Insurance An individual faces financial risk all the time. Be it the risk of losing one’s home due to fire, flood, earthquake, or loss of a steady source of income due to failing health; an individual is constantly undertaking huge financial risks. Although the probability of such a high risk event is small, the resulting loss could be enormous and potentially catastrophic for an individual. Facing an uncertain future, an individual might do nothing and gamble on the risk event not happening. Or, the individual can purchase insurance and pass the risk on to an insurer at an appropriate price. So, which of the two options should an individual choose? Economic studies, like Pratt (1964), show that individuals are generally risk averse. If affordable, an individual would not gamble and would opt for insurance protection. Of course, the price of insurance plays an important role. If the insurance premium is set at the actuarially fair price for the risk, it can be shown that a risk-averse individual would always put a higher value on insurance as against gambling with the risk. Pursuing this further, it can also be proved that risk-averse individuals are actually willing to pay more than the fair price for the risk cover, up to a certain maximum. For more details on rational behaviour in purchasing insurance coverage against a given risk please refer to Mossin (1968). 79 This risk-averse nature of individuals provides the business incentive for insurers to operate in the market. While a solitary individual prefers to insure against risks, insurers are in the business of accepting risks. By pooling risks, an insurer can become virtually riskneutral. Coupled with the fact that a risk-averse individual is willing to pay more than necessary, the insurer can charge a premium which will not only cover the expected cost of claims, but also contribute to their profit margin. However, an insurer cannot charge an arbitrarily high price, for several reasons. Firstly, beyond a certain maximum, even the risk-averse individuals will find the price unattractive, which sets an upper limit for the premium that can be charged. Moreover, in a competitive market, where individuals can choose between competing products, they will always buy the cheapest one available, all else being equal. So, competition ensures that insurance is sold at prices much lower than the upper limit that the risk-averse individuals could have paid. In fact, in a competitive market, the equilibrium position for all insurance companies is to charge the fair actuarial price for the risk involved. Rothschild and Stiglitz (1976) provides a model for risk-neutral insurance firms in a competitive market. What can we infer from all this? If the insurance companies can only charge an actuarially fair premium, could the knowledge that the consumers would have actually paid more be used under some other circumstances? As we will show here, the answer to that question is, yes. In the remainder of the chapter, we will see that in certain situations where insurers do not have access to consumers’ private information, the upper limits for insurance premiums become relevant. We will illustrate our results using CI insurance. But before proceeding further, we will discuss the circumstances that lead to information asymmetry. 5.2 Underwriting Risk Each individual is unique, their circumstances are different and so are their risk profiles. So, even if two individuals wish to purchase the same cover from the same insurer, still they might find that they have to pay very different prices. Insurers 80 would always want to charge a premium which is at least equal to the actuarially fair price which is commensurate with the risk they are accepting. Although competition ensures that they cannot charge more than the fair price, they do not want to quote a lower price, as this would result in losses. So consumers with a higher risk profile would be expected to pay a higher premium than those with lower risks. Charging an appropriate premium for a risk involves a good understanding of how different factors contribute to the risk in question. The factors which have a quantitative impact on the risk are identified and are commonly referred to as risk factors. Different levels of exposures of these risk factors would indicate different levels of risks. In other words, exposure levels of risk factors stratify an insurer’s consumer base into a number of homogeneous groups of individuals. Appropriate premiums can then be set for these groups of individuals based on their exposures to these risk factors. As and when a potential consumer approaches the insurer for cover, the individual’s exposure to the risk factors would dictate the premium to be charged. This is, broadly, how underwriting strategies work for insurance companies. However, acquiring information on risk factors has its disadvantages. Firstly, there are costs associated with it. A piece of information is only useful for underwriting purposes if the advantages outweigh the cost of acquiring it. The risk factors which satisfy this economic criterion can then be used for underwriting purposes. As more and more risk factors come to light through medical research, it is also an evolving process. This is very relevant for recent developments in genetics, as the rôle of genes in an individual’s health becomes clearer. However, as of now, genetic tests are expensive and it needs further research to establish the relative efficiency of these tests as underwriting tools. More importantly, though, there are ethical considerations in accessing private information. Let us discuss this in the context of CI insurance. The risk of CI is affected by, among other things, age, gender, lifestyle and genotype. Clearly, CIs are more common at advanced ages. Medical research has also established differences between CI incidences for males and females. The same is true for some lifestyle factors like smoking habits. Use of these items of information for underwriting has 81 become standard and is widely practised in the industry. However, the use of certain information on environmental exposures and genetic test results has proved more controversial. Unlike smoking habits, there are some environmental exposures which are beyond an individual’s control. And genes are even more intrinsic to human beings, as individuals are born with them. Given this backdrop, should insurers be allowed to use such information to discriminate between individuals? In many countries, a ban has been imposed, or moratorium agreed, limiting the use of genetic information. In the UK, GAIC is providing guidance to insurers on the acceptable use of genetic information. As it stands now, insurers are only allowed access to genetic test results for covers exceeding a certain prescribed level. Clearly, the regulators are now responsible for formulating policies on ethical issues, while the insurers do not access genetic information for the majority of cases. It is imperative here to understand the role of different types of genetic disorders that might affect an individual’s health. 5.3 Multifactorial Disorders We have discussed genetic disorders in detail in Chapter 1. In this section, we will recount briefly the main issues. Disorders caused by mutations in single genes, which may be severe and of late onset, but are rare, have been quite extensively studied in the insurance literature, see Macdonald and Pritchard (2000) for an example. One reason is that the epidemiology of these disorders is relatively advanced, because biological cause and effect could be traced relatively easily. The conclusion has been that single-gene disorders do not expose insurers to serious adverse selection, in large enough markets, because of their rarity. The vast majority of the genetic contribution to human disease, however, will arise from combinations of gene varieties (called ‘alleles’) and environmental factors, each of which might be quite common, and each alone of small influence but together exerting a measurable effect on the molecular mechanism of a disease. 82 Some combinations may be protective, others deleterious. These are the multifactorial disorders, and they are the future of genetics research. Their epidemiology is not very advanced, but should make progress in the next 5–10 years through the very large prospective studies now beginning in several countries. As discussed in earlier chapters, one of the largest is the Biobank project in the UK, with 500,000 subjects. UK Biobank will recruit 500,000 people aged 40 to 69 from the general population of the UK, and follow them up for 10 years. The aim is to capture both genetic and environmental variations and interactions, and relate them to the risks of common diseases. If successful, the outcome will be much better knowledge of the risks associated with complex genotypes. Thus the genetics and insurance debate will, in the fairly near future, shift from single-gene to multifactorial disorders. 5.4 Literature Review Any model used to study adverse selection risk must incorporate the behaviour of the market participants. Most of those applied to single-gene disorders in the past did so in a very simple and exaggerated way, assuming that the risk implied by an adverse genetic test result was so great that its recipient would quickly buy life or health insurance with very high probability. These assumptions were not based on any quantified economic rationale, but since they led to minimal changes in the price of insurance this probably did not matter. The same is not true if we try to model multifactorial disorders. Then ‘adverse’ genotypes may imply relatively modest excess risk but may be reasonably common, so the decision to buy insurance is more central to the outcome. Information asymmetry and adverse selection have been considered before. Doherty & Thistle (1996) pointed out that, under symmetric information, the private value of information is negative and insurance deters people from taking diagnostic tests. This is because, from an individual’s perspective, before undertaking the test, the premium is a random variable and, being risk-averse, the individual will decide against testing and opt for an average premium instead. On the other hand, if the insurer cannot observe test results, acquiring information has a positive private 83 value as it enables an informed choice to be made. However, as insurers adjust their premiums to guard against adverse selection, there is a loss of market efficiency. The authors used a general insurance model to show that insurers can only allow partial cover for the lowest risk group if positive (beneficial) test results cannot be reported. If reporting verifiable positive test results was allowed, the lowest risk group could buy full coverage at the lower price. However, uninformed individuals would pay the same higher premium charged to the high risk group. Assuming costless information, this provides an incentive for taking diagnostic tests. The authors concluded that the loss of efficiency in the insurance market should be weighed against the increased value of private information. Hoy & Polborn (2000) analysed the same problem in a life insurance model. As life insurance companies do not share information, restricting insurance cover is not a viable option against adverse selection. Instead the authors propose an income protection model, which they then use to compute an optimal insurance coverage. Under specific assumptions, they showed that for a fixed insurance premium, appetite for insurance cover increases with risk. The authors constructed scenarios where the effect, on welfare, of a new test could go either way. A Pareto worsening happens when very high risk individuals opt for insurance only when the test produces very bad news. This increases the average insurance premium for life insurance buyers and worsens everybody’s situation. On the other hand, if the individuals with positive (beneficial) test results have lower risk than the average life insurance buyer, then there is a Pareto improvement. The authors also investigated a third scenario under which individuals who go for tests gain and those who do not lose. As, currently, very few people have diagnostic genetic tests, individuals with bad news can only move insurance premiums by very small amounts in practice. However, the authors conclude that if tests become cheaper and widely available, testing could lead to either Pareto improvements or worsening. Hoy & Witt (2005) applied the results from Hoy & Polborn (2000) to the specific case of the BRCA1/2 breast cancer genes. They simulated the market for 10-year term life insurance policies targeted at women aged 35 to 39. They stratified the consumer base into 13 risk categories based on family background information. This 84 information is also available to insurers. Then within each risk group, they checked the impact of test results for BRCA1/2 genes on welfare effects, using iso-elastic utility functions. The authors showed that in the presence of a high risk group, and in the presence of information asymmetry, the equilibrium insurance premium can be as high as 297% of the population weighted probability of death. All these papers assume that the genetic epidemiology implies that genetic tests carry very strong information about risk; true of some single-gene disorders but unlikely to be so true of multifactorial disorders. They concentrate primarily on providing a proper economic rationale for the impact, on the insurance market, of genetic tests for, mainly, rare diseases. Here, we try to bring together plausible quantitative models for the epidemiology and the economic issues, in respect of more common disorders, therefore affecting a much larger proportion of the insurer’s customer base. We wish to find out under what circumstances adverse selection is likely to occur. 5.5 Adverse Selection We suppose that individuals are risk-averse, have wealth W and aim to buy CI insurance with sum assured L ≤ W . Their decision is governed by expected utility, conditioned on the information available to them. Insurers, in a competitive market, charge an actuarially fair premium P , equal to the expected present value of the insured loss, conditioned on the information available to them. See for example Hoy and Polborn (2000) for a similar market model. Because they are risk-averse, individuals will be willing to pay a premium up to a maximum of P ∗ > P , provided that they and the insurer have the same information. We can then consider the effect of genetic information that is only available to applicants. We propose a simple model of a multifactorial disorder, with two genotypes and two levels of environmental exposure, and either additive or multiplicative interactions between them. These factors affect the risk of myocardial infarction (heart attack), therefore the theoretical price of CI insurance. However these price differences are not very large. To begin with, the risk factors are not observable, 85 because the epidemiology is unknown, or the necessary genetic tests have not yet been developed. Insurers therefore charge everyone the same premium, which is the appropriate weighted average of the genotype and environment-specific premiums. Subsequently, genetic tests that accurately predict the risk become available, but only to individuals; insurers are barred from asking about genotype. Adverse selection therefore becomes a possibility. 5.6 Utility of Wealth Utility theory has its roots in the early works of the utilitarian philosophers, including Bentham (1789) and Mill (1879). They proposed that people ought to desire those things that will maximise their utility, where utility is measured in terms of happiness or satisfaction gained from consumption of commodities. Among the early applications, Daniel Bernoulli suggested the use of expected utility theory to solve the St. Petersburg paradox. However, the first important breakthrough came from Von Neumann and Morgenstern (1944), who used the assumptions of expected utility maximisation in their formulation of game theory. The work of Nash (1950) on optimum strategies for multiplayer games ushered in a new era and since then utility theory has been at the forefront of economic research activity. For a full exposition of utility theory, see Binmore (1991). In this chapter, we will define utility functions as a measure of an individual’s preference for wealth. In other words, an individual, hypothetically, assigns a value U (w) to every amount of wealth w that she can possess. Figure 5.24 shows a specimen utility function plotted against a person’s wealth. For this individual, the utility of wealth, measured in terms of U (w), increases with wealth, w. This is known as the non-satiation property which states that more wealth is preferred than less wealth. The other feature of the utility function is that it is concave, i.e., the rate of increase of U (w) slows down as the wealth goes up. In other words, the marginal utility of wealth decreases as the wealth increases. For this individual the value of an extra pound is more when her existing wealth is £1 rather than £1, 000, 000. This is known as the risk-aversion property. 86 Utility u(W ) u(W − qL) u(W ∗ ) = (1 − q)u(W ) +qu(W − L) u(W − L) W −L W∗ Wealth W − qL W Figure 5.24: Utility of wealth for a risk averse individual. Let us now formalise our definition of utility function in terms of the non-satiation and risk-aversion properties. In mathematical terminology, the utility function for a risk-averse individual is increasing and concave. Now, U (w) is concave on an interval [a,b], if for any points w1 and w2 in [a,b] and for any α in (0,1), we have, U [αw1 + (1 − α)x2 ] > αU (w1 ) + (1 − α)U (w2 ). (5.32) If U (w) is twice-differentiable in [a,b], then a necessary and sufficient condition for it to be concave on that interval is that the second derivative U 00 (w) < 0 for all w in [a,b]. So a twice-differentiable utility function for a risk-averse individual has the properties U 0 (w) > 0 (non-satiation property) and U 00 (w) < 0 (risk-aversion property). From the above formulation, it is not readily obvious how concavity of utility functions relates to risk-aversion. To understand the relationship, let us assume that an individual with a concave increasing utility function U (w), has initial wealth W from which he might lose L with probability q. The ultimate wealth is the random variable X, where X = W − L with probability q and X = W with probability 1 − q. The expected utility of this gamble from the individual’s perspective is: 87 E[U (X)] = qU (W − L) + (1 − q)U (W ). (5.33) If he chooses, he can insure the risk for premium P , and accept W −P with certainty. He should do so if: U (W − P ) > E[U (X)] = qU (W − L) + (1 − q)U (W ). (5.34) In particular he should insure if the premium is equal to the expected loss qL since: U (W − qL) = U (q(W − L) + (1 − q)W ) > qU (W − L) + (1 − q)U (W ). (5.35) The inequality of Equation 5.35 can also be verified from Figure 5.24, which shows that an individual values certainty more than a gamble. He is more willing to forgo a fixed loss of amount qL than to participate in the gamble. This is why the individual is risk-averse. In fact, we can see from Figure 5.24 that a risk-averse individual is willing to run her wealth down further than that which is required for a fair actuarial premium. If W ∗ is the amount of wealth for which: U (W ∗ ) = (1 − q)U (W ) + qU (W − L) (5.36) then the individual will be ready forgo a maximum of W − W ∗ in order to avoid the gamble. As this is greater than the fair actuarial premium of qL: P ∗ = W − W ∗ = W − U −1 [(1 − q)U (W ) + qU (W − L)] (5.37) is the maximum premium that this person will pay for insurance. So in a market where competition drives insurers to charge the actuarially ‘fair’ premium qL, insurance will be bought, but this is not the limiting case; insurance will be bought as long as the premium is less than P ∗ . 88 5.7 Coefficients of Risk-aversion Risk-aversion is the prerequisite for insurance. However, not all individuals have the same attitude towards risk. Some individuals are more risk-averse than others. In this section, we will define properties of utility functions which characterise an individual’s attitude towards risk. For a comprehensive discussion on properties of risk-averse utility functions, please refer to Pratt (1964). Let us consider two utility functions U (w) and V (w), where for a > 0, V (w) = aU (w) + b. In mathematical terminology, U (w) and V (w) are said to be related by a positive affine transformation. How different are these two utility functions from each other? If U (w) represents the utility function for a risk-averse individual, i.e., U 0 (w) > 0 and U 00 (w) < 0, then so does the function V (w), i.e., V 0 (w) > 0 and V 00 (w) < 0. Now, assuming an initial wealth of W , if there is a risk of losing L with probability q, how will decisions based on utility function U (w) be different from those based on V (w)? Note that: V −1 [qV (W − L) + (1 − q)V (W )] = V −1 [a{qU (W − L) + (1 − q)U (W )} + b] = U −1 [qU (W − L) + (1 − q)U (W )]. (5.38) From Equations 5.37 and 5.38, we can see that the maximum premium payable under both these utility functions is the same. So in a way, a positive affine transformation has preserved the inherent characteristics of these utility functions. To understand the underlying mechanics, let us define the absolute risk-aversion function for a utility function U (w), as follows: AU (w) = − U 00 (w) . U 0 (w) (5.39) Clearly, for a positive affine transformation V (w) = aU (w) + b, we have: AV (w) = − V 00 (w) aU 00 (w) U 00 (w) = − = − = AU (w). V 0 (w) aU 0 (w) U 0 (w) (5.40) So a positive affine transformation leaves the absolute risk-aversion functions unaltered. Conversely, if we assume AU (w) = AV (w) for two risk-averse utility functions U (w) and V (w), then we have: 89 V 0 (w) V 00 (w) = . U 0 (w) U 00 (w) (5.41) (5.42) Let us now define: f (w) = V 0 (w) . U 0 (w) (5.43) Taking derivatives of both sides: f0 = V 0 U 00 V 0 £ V 00 U 00 ¤ V 00 − = − 0 . U0 (U 00 )2 U0 V 0 U (5.44) From Equation 5.41, f 0 = 0 implying that V (w) = aU (w) + b where a > 0. So, we can see that the absolute risk-aversion function is the same for two functions which are related by a positive affine transformation. In other words, the absolute risk-aversion coefficient fully characterises a utility function. We will also introduce here a related quantity called the relative risk-aversion function, defined as follows: R(w) = AU (w)w = − 5.8 U 00 (w)w . U 0 (w) (5.45) Families of Utility Functions We introduce two families of utility functions which we will use in examples throughout the rest of the document. (a) The Iso-Elastic utility functions are defined by: (wλ − 1)/λ λ < 1 and λ 6= 0 UI(λ) (w) = log(w) λ = 0. (5.46) The condition λ < 1 ensures concavity. Log-utility is the limiting case as λ → 0. The family gets its name, iso-elastic, from the property that scaling wealth by a certain amount k produces a utility function which is just a positive affine transformation of the original utility function. In mathematical notation, for 90 all k > 0, there exist some functions f (k) > 0 and g(k), which are independent of wealth w, such that: U (kw) = f (k)U (w) + g(k). (5.47) It is easy to verify that this family of utility functions satisfies iso-elasticity: k λ UI(λ) (w) + (k λ − 1)/λ λ < 1 and λ 6= 0 UI(λ) (kw) = U (w) + log(k) λ = 0. I(λ) (5.48) This property plays an important role, as we will see later that individuals with iso-elastic utility functions put more emphasis on the proportion of loss under risk than the actual amount of loss itself. The absolute risk-aversion function of UI(λ) (w) is: A(w) = 1−λ w (5.49) and the relative risk-aversion function is constant, R(w) = R = 1 − λ. Hence higher λ means less risk aversion. (b) The Negative Exponential family of utility functions is parameterised by a constant absolute risk-aversion function A(w) = A, as follows: UN (A) (w) = − exp(−Aw), where A > 0. (5.50) Clearly, a higher value of A implies more risk aversion. The Negative Exponential utility functions possess the interesting property that they are invariant under any translation of wealth. In other words, for all k > 0, there exist some functions f (k) > 0 and g(k), which are independent of wealth w, such that: U (k + w) = f (k)U (w) + g(k). (5.51) It is easy to verify that for Negative Exponential utilities, UN (A) (k + w) = exp(−kA)UN (A) (w). 91 (5.52) We will see later that this property ensures that individuals with Negative Exponential utility functions put all emphasis on the actual amount of loss, ignoring completely their initial wealth. The basic properties of these families of utility functions along with some simple applications to portfolio optimisation are given in Norstad (1999). 5.9 Estimates of Absolute and Relative Riskaversion To parameterise these utility functions, we need estimates of absolute or relative riskversion coefficients. Eisenhauer and Ventura (2003) pointed out that past research was inconclusive; estimates of average relative risk-aversion coefficients ranged from less than 1 to well over 40. Hoy and Witt (2005) illustrated their model using iso-elastic utilities with R = 0.5, 1 and 3. We will adopt a similar strategy, as follows. Eisenhauer and Ventura (2003) estimated the risk-aversion function based on a thought experiment conducted by the Bank of Italy for its 1995 Survey of Italian Households’ Income and Wealth. Under certain assumptions, they estimated that a person with an average annual income of 46.7777 million lira had absolute riskaversion coefficient 0.1837, and relative risk-aversion coefficient 8.59. Allowing for the sterling/lira exchange rate in 1995 (average £1 = 2570.60 lira http://fx.sauder.ubc.ca/) and price inflation in the UK between July 1995 and June 2006 (Retail Price Index 149.1 and 198.5, respectively) an average income of 46.7777 million lira in 1995 equates to about £24,226 in 2006, not very different from the actual average of £25,810 (Jones (2005)). We need utility functions of wealth, so an estimate of the wealth-income ratio is required. Estimates of this ratio in the literature are quite varied. According to Treasury (2005) in the U.K., it varies between 5 and 7 for total wealth, and between 2 and 4 for net financial wealth. The Inland Revenue in the U.K. also publishes figures on personal wealth distribution http://www.hmrc.gov.uk/stats/personal wealth/menu.htm. Their lat92 est figure (for 2003) shows that 53% of the population has less than £50,000 and 83% has less than £100,000. As the distribution of wealth is positively skewed, we will assume a total wealth of W = £100, 000. This gives a wealth-income ratio of 4 which is consistent with the figures published by Treasury (2005). (a) The absolute risk-aversion function depends on the unit of wealth. Given utility functions U (w) and V (w) related by U (cw) = V (w) for some constant c, their absolute risk-aversion functions are related by AU (cw) = AV (w)/c. Using the exchange and inflation rates above, we suppose that a Briton in 2006 has absolute risk-aversion coefficient 8.967 × 10−5 ≈ 9 × 10−5 , denominated in 2006 pounds. (b) The relative risk-aversion function does not depend on the unit of wealth and so the estimate of 8.59 can be used without any adjustment. We will use a rounded-off value of 9 henceforth. The formulation of utility functions with non-constant relative risk-aversion is an active area of research. Meyer and Meyer (2005) specified a form of marginal utility function which gives decreasing relative risk-aversion. Xie (2000) proposed a power risk-aversion utility function which can produce increasing, constant or decreasing risk-aversion depending on its parameterisation. These specialised utility functions are not yet in widespread use and we will not consider them further. We will use the following utility functions for the purposes of illustration: (a) Iso-elastic utilities with parameter λ = 0.5, 0 and −8, which corresponds to constant relative risk-aversion of 0.5, 1 and 9 respectively. (b) Negative exponential exponential utility with absolute risk-aversion coefficient A = 9 × 10−5 . Since iso-elastic utility with λ = −8 has absolute risk-aversion coefficient equal to 9 × 10−5 when wealth is £100,000, our assumption of W = £100, 000 allows us to compare the two utility functions. 93 94 Chapter 6 Adverse Selection in a 2-state Insurance Model 6.1 A Simple Gene-environment Interaction Model We will illustrate the principles of underwriting long-term insurance in the presence of a multifactorial disorder in the simple setting of the two-state continuous-time model in Figure 6.25. We will also assume that all individuals have the same initial wealth W and follow the same utility function of wealth U (w). The insured event could be death or illness, and it is represented by transition from state A to state B. The probability of transition is governed by the transition intensity λs (x), which depends on age x, and the values of various risk factors which are labelled s (for ‘stratum’). The risk factors arise from a 2 × 2 gene-environment interaction model. That is, λs (x) - A B Figure 6.25: A two state model 95 there are two genotypes, denoted G and g, and two levels of environmental exposure, denoted E and e. We assume that G and E are adverse exposures while g and e are beneficial. Therefore, there are four risk groups or strata, that we label ge, gE, Ge and GE. Let the proportion of the population at a particular age (at which an insurance contract is sold) in stratum s be ws . The epidemiology is defined as follows. (a) We assume proportional hazards, so for each stratum s there is a constant ks such that λs (x)/λge (x) = ks for all ages x. Clearly kge = 1. (b) We assume symmetry between genetic and environmental risks, as follows: (1) The probability of possessing the beneficial gene g is the same as the probability of exposure to the beneficial environment e, each denoted ω. Assuming independence, wge = ω 2 , wgE = wGe = ω(1 − ω) and wGE = (1 − ω)2 . (2) We assume that kgE = kGe = k. (c) The gene-environment interaction is represented by either an additive or a multiplicative model, as follows: (1) Additive Model: kGE = kGe + kgE − kge = 2k − 1. (2) Multiplicative Model: kGE = kGe kgE /kge = k 2 . See Woodward (1999) for a discussion of additive and multiplicative models. Therefore, the epidemiology is fully defined by the parameters λge (x), ω and k along with the choice of interaction. 6.2 Single Premiums For simplicity, let the force of interest be δ = 0. (This is consistent with the assumptions of Doherty and Thistle (1996), Hoy and Polborn (2000) and Hoy and Witt (2005).) Then the single premium for an insurance contract of term n years, with sum assured £1, sold to a person aged x who belongs to stratum s is: · Z t ¸ qs = 1 − exp − λs (x + y)dy = 1 − (1 − qge )ks . (6.53) 0 If the proportion of insurance purchasers aged x is the same as the proportion in the population, ws (for example if the stratum is not known to applicants or to 96 insurers) observation of claim statistics will lead the insurer to charge a weighted average premium rate q̄ = X ws qs = s X ws [1 − (1 − qge )ks ] = 1 − X s ws (1 − qge )ks (6.54) s per unit sum assured. Given our assumption that the ks can all be expressed as simple functions of k, the stratum-specific and average premium rates can also be expressed as qs (k) and q̄(k). In particular, a neat expression can be derived using the assumption of an additive model along with symmetry between genetic and environmental risks, set out in Section 6.1. Starting from Equation 6.54 and incorporating these assumptions, we get: 1 − q̄(k) = X ws (1 − qge )ks s 2 = ω (1 − qge ) + 2ω(1 − ω)(1 − qge )k + (1 − ω)2 (1 − qge )2k−1 = (1 − qge )[ω 2 + 2ω(1 − ω)(1 − qge )k−1 + (1 − ω)2 (1 − qge )2(k−1) ] = (1 − qge )[ω + (1 − ω)(1 − qge )k−1 ]2 . (6.55) Alternatively, given values of q̄, qge and ω, one can solve Equation 6.55 for k, using: hq log k =1+ 6.3 1−q̄ 1−qge i −ω log(1 − ω) . (6.56) Threshold Premium Suppose all individuals have initial wealth W and that the net effect of suffering the insured event in the next n years is a loss of L. We assume partial insurance is not possible, so that the individual insures against the full loss L or does not insure at all. Define the loss ratio f = L/W . If no-one knows to which stratum they belong everyone will be willing to pay a single premium of up to: P ∗ = W − U −1 [q̄(k)U (W − L) + (1 − q̄(k))U (W )]. (6.57) However, someone who knows they are in stratum s will be willing to pay a single premium of up to: 97 Ps∗ = W − U −1 [qs (k)U (W − L) + (1 − qs (k))U (W )]. (6.58) Ps∗ is smallest for stratum ge. So if the insurer, ignorant of the stratum, continues ∗ to charge premium q̄(k)L, adverse selection will first appear if q̄(k)L > Pge . That is, if: U (W − q̄(k)L) < qge (k)U (W − L) + (1 − qge (k))U (W ). 6.4 (6.59) The Additive Epidemiological Model Replace the inequality in Equation 6.59 with an equality and solve for k; this represents the relative risk (of each risk factor) with respect to stratum ge, above which persons who know they are in stratum ge will cease to buy insurance. Doing this with iso-elastic utility with λ 6= 0 we obtain: (1 − q̄(k)f )λ = qge (1 − f )λ + (1 − qge ). (6.60) In the special case of logarithmic utility (iso-elastic utility with λ = 0) we obtain: 1 − q̄(k)f = (1 − f )qge (6.61) and under negative exponential utility: eq̄(k)AL = qge eAL + (1 − qge ) (6.62) in which wealth W does not appear. As expected, risk preferences characterised by different utility functions produce different values of q̄(k). Once q̄(k) is obtained for a particular utility function, the value of k can be derived from Equation 6.56. Specifically, we have solved Equations 6.60, 6.61 and 6.62 for certain values of baseline risk qge and loss L, assuming an initial wealth of W = £100, 000. Then using ω = 0.5 (a uniform distribution across strata) and an additive model, we solve for k. The results are in Table 6.23. We observe the following: 98 Table 6.23: The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.5 and an additive model. Utility Function I(0.5) Log I(−8) N (9e-5) qge 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 10 1.025 1.024 1.022 1.021 1.019 1.051 1.048 1.045 1.042 1.039 1.598 1.546 1.498 1.451 1.405 1.566 1.516 1.468 1.423 1.379 20 1.053 1.050 1.047 1.044 1.041 1.110 1.104 1.098 1.091 1.084 2.755 2.512 2.322 2.163 2.023 2.504 2.292 2.126 1.987 1.864 30 1.085 1.081 1.076 1.072 1.066 1.180 1.170 1.160 1.149 1.138 4.947 4.153 3.664 3.313 3.035 3.917 3.337 2.963 2.684 2.457 40 1.122 1.116 1.110 1.103 1.096 1.264 1.250 1.235 1.220 1.203 8.831 6.972 6.148 5.810 6.107 5.793 4.617 3.972 3.536 3.206 99 loss L in £’000 50 60 1.165 1.217 1.158 1.209 1.150 1.200 1.142 1.189 1.132 1.178 1.368 1.504 1.350 1.479 1.330 1.453 1.308 1.425 1.286 1.395 15.950 – 14.430 – – – – – – – 8.036 10.574 6.119 7.911 5.204 6.857 4.655 6.519 4.305 7.636 70 1.284 1.274 1.263 1.251 1.238 1.691 1.659 1.626 1.590 1.551 – – – – – 13.428 10.226 9.812 – – 80 1.373 1.364 1.352 1.339 1.324 1.976 1.939 1.898 1.854 1.805 – – – – – 16.739 13.900 – – – 90 1.513 1.506 1.497 1.486 1.472 2.524 2.488 2.451 2.413 2.372 – – – – – 20.862 – – – – (a) For low loss ratios, even small relative risks k will cause people in the baseline stratum to opt against insurance. This is as expected as small losses are relatively tolerable. (b) As the loss ratio f increases, so does the relative risk at which adverse selection appears. This is simply risk aversion at work. (c) The higher the baseline risk qge for a given loss ratio f , the lower the relative risk at which adverse selection appears. This is the result of a concave utility function, as the fair actuarial price increases and depletes wealth. (d) Lower risk-aversion, under iso-elastic utility, (λ = 0.5) means that smaller relative risks would discourage members of the baseline stratum to buy insurance at the average premium, and for higher risk-aversion (λ = −8) the reverse is true. (e) We have assumed here that everyone has the same utility function and that partial insurance is not possible. This meant that in our model, individuals either insure or decide not to insure. In reality, it is possible that individuals would opt for partial insurance, which we ignore here to keep the model simple. Comparing iso-elastic and negative exponential utilities, we see that the limiting relative risks are broadly similar for smaller losses. For larger losses, however, isoelastic utility functions have much greater limiting relative risks. This is because risk-aversion increases as wealth falls under iso-elastic utility, while for negative exponential utility it is constant. As the fair actuarial premium for bigger losses increases and depletes wealth, risk-aversion under iso-elastic utility climbs above that under negative exponential utility, with the result shown. 6.5 Immunity From Adverse Selection The missing entries in Table 6.23 mean that adverse selection never appears, whatever the relative risk k. Clearly, this must be related to the size of the high-risk strata, and their ability, or otherwise, to move the average premium enough to affect the baseline stratum. We may ask: given qge and f , is there some proportion wge in the lowest risk stratum above which members of that stratum will always buy 100 insurance at the average premium rate? Begin by noting that: lim q̄(k) = lim k→∞ X k→∞ ws [1 − (1 − qge )ks ] = wge qge + s X ws = 1 − wge (1 − qge ) (6.63) s6=ge and that this limit is not a function of the ks and thus holds for additive and multiplicative models. As a check, it can be easily verified from Equation 6.55, that the limit is valid for additive models. Now, substituting this limiting value in Equations 6.60 to 6.62, we can solve for wge as follows, for iso-elastic utility with λ 6= 0: wge " # 1 1 − (qge (1 − f )λ + (1 − qge ))1/λ = 1− , 1 − qge f (6.64) for logarithmic utility: wge " # 1 1 − (1 − f )qge = 1− 1 − qge f (6.65) and for negative exponential utility: wge " # 1 log[qge eAL + (1 − qge )] = 1− . 1 − qge AL (6.66) 1/2 Values of ω = wge are given in Table 6.24. Values of ω < 0.5 in Table 6.24 correspond to missing entries in Table 6.23. Table 6.24 shows just how uncommon an adverse exposure has to be to avoid adverse selection. Assuming ω = 0.5 is perhaps extreme; it means that half the population possess a significant genetic risk factor (modulated by environment) yet to be discovered. This is by no means impossible, but we might expect most as-yet unknown risk factors to affect a smaller proportion of the population, simply because they are asyet unknown. So, we increase ω to 0.9, so that only 10% of individuals are exposed to the adverse environment or possess the adverse gene. The relative risks k at which adverse selection appears are given in Table 6.25. They are larger than in Table 6.23 because the relative risk experienced by the smaller number of high-risk individuals has to be much higher to have the same impact on the average premium. 101 Table 6.24: The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions. Utility Function I(0.5) Log I(−8) N (9e-5) qge 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 10 0.999 0.997 0.996 0.995 0.993 0.997 0.995 0.992 0.989 0.987 0.969 0.943 0.919 0.897 0.878 0.971 0.946 0.923 0.903 0.884 20 0.997 0.994 0.992 0.989 0.986 0.994 0.989 0.983 0.977 0.972 0.916 0.857 0.812 0.776 0.746 0.927 0.875 0.835 0.802 0.775 30 0.996 0.991 0.987 0.982 0.978 0.991 0.981 0.972 0.963 0.954 0.830 0.747 0.693 0.653 0.622 0.868 0.797 0.748 0.712 0.682 loss 40 0.994 0.987 0.981 0.974 0.968 0.986 0.973 0.960 0.947 0.934 0.719 0.632 0.580 0.543 0.515 0.802 0.723 0.673 0.637 0.608 102 L in £’000 50 60 0.991 0.989 0.983 0.977 0.974 0.966 0.965 0.954 0.956 0.942 0.981 0.974 0.962 0.949 0.945 0.925 0.927 0.902 0.910 0.880 0.603 0.496 0.525 0.431 0.480 0.393 0.448 0.367 0.424 0.347 0.738 0.682 0.660 0.607 0.612 0.562 0.577 0.530 0.551 0.505 70 0.985 0.970 0.955 0.940 0.924 0.965 0.932 0.900 0.870 0.841 0.398 0.345 0.315 0.294 0.279 0.635 0.564 0.522 0.492 0.468 80 0.981 0.961 0.941 0.920 0.899 0.951 0.906 0.863 0.823 0.786 0.304 0.264 0.241 0.225 0.213 0.595 0.528 0.488 0.460 0.439 90 0.974 0.947 0.919 0.890 0.860 0.926 0.859 0.798 0.743 0.693 0.203 0.176 0.161 0.150 0.142 0.562 0.498 0.461 0.434 0.414 Table 6.25: The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.9 and an additive model. Utility Function I(0.5) Log I(−8) N (9e-5) qge 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 10 1.126 1.120 1.113 1.106 1.099 1.257 1.246 1.233 1.220 1.205 4.458 4.823 5.705 – – 4.246 4.514 5.109 7.984 – 20 1.269 1.258 1.246 1.233 1.218 1.563 1.546 1.526 1.504 1.479 18.642 – – – – 13.531 – – – – 30 1.433 1.419 1.404 1.387 1.367 1.934 1.923 1.910 1.894 1.876 – – – – – – – – – – loss L in £’000 40 50 60 1.625 1.855 2.140 1.613 1.852 2.158 1.599 1.847 2.180 1.582 1.841 2.210 1.562 1.833 2.250 2.399 3.004 3.839 2.418 3.107 4.170 2.444 3.268 4.844 2.482 3.555 8.317 2.542 4.296 – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – 103 70 2.511 2.577 2.668 2.807 3.055 5.101 6.164 – – – – – – – – – – – – – 80 3.033 3.212 3.502 4.108 – 7.368 13.981 – – – – – – – – – – – – – 90 3.899 4.419 5.689 – – 13.841 – – – – – – – – – – – – – – Table 6.26: The relative risk k above which persons in stratum ge with initial wealth W = £100, 000 will not buy insurance, using ω = 0.9 and a multiplicative model. Utility Function I(0.5) Log I(−8) N (9e-5) 6.6 qge 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 10 1.125 1.119 1.113 1.106 1.098 1.254 1.243 1.231 1.218 1.203 4.223 4.723 5.676 – – 4.024 4.410 5.073 7.981 – 20 1.265 1.255 1.243 1.230 1.216 1.549 1.533 1.516 1.495 1.472 18.561 – – – – 13.391 – – – – 30 1.424 1.412 1.398 1.381 1.362 1.899 1.892 1.884 1.873 1.859 – – – – – – – – – – loss L in £’000 40 50 60 1.608 1.825 2.090 1.598 1.825 2.115 1.586 1.824 2.144 1.571 1.822 2.181 1.553 1.817 2.229 2.328 2.880 3.645 2.360 3.018 4.065 2.399 3.212 4.805 2.449 3.527 8.314 2.521 4.288 – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – 70 2.431 2.511 2.617 2.773 3.037 4.839 6.086 – – – – – – – – – – – – – 80 2.907 3.119 3.447 4.086 – 7.107 13.967 – – – – – – – – – – – – – 90 3.701 4.315 5.660 – – 13.706 – – – – – – – – – – – – – – The Multiplicative Epidemiological Model Unlike Equation 6.55 for additive models, we cannot derive a neat expression for q̄(k) in multiplicative models. However, the equations can easily be solved numerically. Table 6.26 shows relative risks above which adverse selection appears, assuming ω = 0.9 and a multiplicative model. They can be compared with the values in Table 6.25. We observe the following: (a) The missing entries are the same as in the additive model. This is because the limiting values of q̄(k) and ω do not depend on the model structure. (b) The relative risk in stratum GE is higher in the multiplicative model (k 2 > 2k − 1) so persons in the baseline stratum will be less tolerant towards any given value of k. This is why the values in Table 6.26 are smaller than those in Table 6.25. 104 (c) However the differences between the additive and multiplicative models are not very large. If k ≈ 1, then k 2 ≈ 2k − 1, and for large values of ω (which arguably is most realistic) the impact of stratum GE is relatively small. In view of this, we will use only the additive model from now on. 105 106 Chapter 7 Adverse Selection in a Critical Illness Insurance Model 7.1 A Heart Attack Model We now model the specific example of CI insurance. We will focus on heart attack risk, building upon the material developed in earlier chapters. (a) We will use the CI insurance model developed by Gutiérrez and Macdonald (2003), which we have already seen in Section 3.5.1. To recap, the authors parameterised the CI model shown in Figure 7.26, using medical studies and population data. Therefore, in particular, λ12 (x) denotes the rate of onset of heart attacks in the general population (different for males and females). (b) In Chapter 3, we assumed that a 2 × 2 gene-environment interaction affected heart attack risk, with genotypes G and g, and environmental exposures E and e, upper case representing higher risk. So there were four strata for each sex — ge, gE, Ge and GE. We showed that it is possible to hypothecate assumptions on strata-specific relative risks, in a way which is consistent with the rate of onset in the general population. We will use a similar technique here. Consider all healthy individuals aged x. If q̄ denotes the probability that a healthy person aged x has a heart attack before age x + t, it can be calculated from the heart attack transition intensity of the general population as follows: 107 State 2 Heart Attack λ12 (x) ¡ µ ¡ ¡ ¡ ¡ State 1 Healthy ¡ * © ¡ λ13 (x) © © ¡ © ¡ ©© © ¡ © ¡ ©© λ14 (x) © © ¡ H @H @HHH @ HH @ HH HH @ j @ λ15 (x) H @ @ @ @ λ16 (x) @ R @ State 3 Cancer State 4 Stroke State 5 Other CI State 6 Dead Figure 7.26: A full critical illness model. · Z t ¸ q̄ = 1 − exp − λ12 (x + y)dy (7.67) 0 Now, for males and females separately, let c denote the relative risk in the baseline stratum ge with respect to the general population, and let ks denote the relative risk in stratum s with respect to stratum ge, in both cases assumed to be constant at all ages (in other words, we assume a proportional hazards model). If we denote the rate of onset of heart attack in stratum s by λs12 (x), it is given by: λs12 (x) = c × ks × λ12 (x). (7.68) Suppose that at age, x, the proportion of healthy individuals who are in stratum s is ws . In stratum s, let qs be the probability that a healthy person age x has a first heart attack before reaching age x + t. Then using Equations 7.67 and 7.68, we can show that: 108 · Z t ¸ s qs = 1 − exp − λ12 (x + y)dy = 1 − (1 − q̄)cks . (7.69) 0 Equating the weighted average probability over all strata with the population probP ability, that is, q̄ = ws qs , we have: q̄ = X ws [1 − (1 − q̄)cks ]. (7.70) Given the relative risks, the population proportions and the estimated λ12 (x), we can solve this for c, which fully specifies the stratum-specific intensities λs12 (x). 7.2 Threshold Premium for Critical Illness Insurance To extend the two-state insurance model of Section 6.1 to the CI model with six states, we make some simplifying assumptions. (a) We will model gene-environment interactions affecting heart attack risk alone, leaving other intensities unaffected. This is not completely realistic, since many known risk factors for heart disease are also risk factors for other disorders. (b) The heart attack transition intensity is different for males and females. Figure P 7.27 shows the ratio λ12 (x)/ 5j=2 λ1j (x) for both sexes. Heart attack is the predominant CI among middle-aged men, while among women, heart attack is increasingly prominent from age 30 onwards, but cancer is the dominant CI at all ages. The ratio for males stays significantly higher than the ratio for females, except at very high ages. Hence we might expect adverse selection to appear at different relative risk thresholds for the two sexes. 7.3 Premium Rates for Critical Illness Insurance As examples, we model single-premium CI insurance contracts of duration 15 years sold to males and females aged 25, 35 and 45. First, assuming all transition intensities are as given by Gutiérrez and Macdonald (2003), we compute the single 109 1 Male Female Ratio 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 Age (years) 60 70 80 Figure 7.27: The ratio of heart attack transition intensity to total critical illness transition intensity, by gender. Table 7.27: The premium rates of critical illness contracts of duration 15 years. Age 25 35 45 Male 0.013787 0.048413 0.136363 Female 0.018746 0.049715 0.110434 premiums as expected present values (EPVs) of the benefit payments by solving Thiele’s differential equations (see Norberg (1995)) numerically. Again for simplicity, we assume the force of interest δ = 0. Table 7.27 gives the CI premium rates per unit sum assured for these contracts. We make the same epidemiological assumptions as before, namely that kgE = kGe = k; that an additive model (kGE = 2k − 1) applies, and that wge = ω 2 , wgE = wGe = ω(1 − ω), and wGE = (1 − ω)2 , where ω = 0.9 (the more realistic assumption); and also that initial wealth is W = £100, 000. Given the relative risks, we obtain c and hence the the heart attack intensity for each sex and stratum as in Section 7.1. This allows us to calculate stratum-specific premium rates. Let Ps denote the single premium rate for unit CI insurance in stratum s. Note 110 that apart from the stratum-specific heart attack risk, Ps also covers the risk of all other CIs, which are assumed to be the same for all strata. Let P̄ denote the population average premium rate for unit CI insurance (the averaging being over all strata for a given gender). As before, since we are ignoring interest rates and profit margins, the various premium rates defined above are the same as the probabilities of the event insured against. Then define a function Z(P ) of a premium P as follows: Z(P ) = U (W − P̄ L) − [P U (W − L) + (1 − P )U (W )]. (7.71) Note that Z(Pge ) < 0 is the condition under which adverse selection will appear, equivalent to Equation 6.59 of Section 6.3. Or, let P † be the solution of Z(P ) = 0. Then Pge < P † is the condition for adverse selection to appear. Tables 7.28 and 7.29 show P † for males and females respectively. It depends on the utility function but not on the epidemiological model. For the 2-state model, Equation 6.59 was central in our analysis. Given: (a) a model structure (additive or multiplicative), the baseline risk qge , and the proportion ω with low values of each risk factor; and (b) noting that the average risk q̄ was an increasing function of the relative risk parameter k; we obtained a minimum value of k for which adverse selection first appears. 111 112 N (9e-5) I(−8) Log I(0.5) Utility Function Age 25 35 45 25 35 45 25 35 45 25 35 45 10 0.013438 0.047229 0.133321 0.013095 0.046062 0.130316 0.008388 0.029922 0.087752 0.008554 0.030512 0.089459 20 0.013068 0.045969 0.130058 0.012374 0.043604 0.123918 0.004503 0.016319 0.049912 0.004976 0.018032 0.055093 30 0.012674 0.044622 0.126534 0.011620 0.041019 0.117108 0.002062 0.007596 0.024272 0.002733 0.010061 0.032069 loss L in £’000 40 50 60 0.012250 0.011788 0.011277 0.043167 0.041577 0.039808 0.122691 0.118448 0.113678 0.010826 0.009980 0.009065 0.038282 0.035353 0.032171 0.109801 0.101879 0.093158 0.000773 0.000223 0.000045 0.002893 0.000849 0.000174 0.009674 0.002978 0.000642 0.001429 0.000719 0.000351 0.005349 0.002734 0.001356 0.017804 0.009517 0.004938 70 0.010695 0.037788 0.108172 0.008055 0.028636 0.083326 0.000005 0.000021 0.000081 0.000167 0.000656 0.002504 80 0.010004 0.035378 0.101522 0.006891 0.024543 0.071772 0.000000 0.000001 0.000004 0.000078 0.000312 0.001247 90 0.009102 0.032216 0.092679 0.005423 0.019348 0.056865 0.000000 0.000000 0.000000 0.000036 0.000146 0.000613 Table 7.28: P † for males, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W = £100,000. 113 N (9e-5) I(−8) Log I(0.5) Utility Function Age 25 35 45 25 35 45 25 35 45 25 35 45 10 0.018273 0.048500 0.107899 0.017809 0.047304 0.105398 0.011431 0.030745 0.070219 0.011657 0.031351 0.071593 20 0.017773 0.047209 0.105188 0.016833 0.044782 0.100089 0.006150 0.016778 0.039438 0.006796 0.018539 0.043550 30 0.017239 0.045827 0.102269 0.015811 0.042131 0.094460 0.002823 0.007814 0.018924 0.003740 0.010350 0.025029 loss L in £’000 40 50 60 0.016664 0.016038 0.015344 0.044334 0.042702 0.040886 0.099094 0.095600 0.091684 0.014734 0.013586 0.012344 0.039322 0.036315 0.033050 0.088443 0.081945 0.074821 0.001060 0.000307 0.000062 0.002978 0.000875 0.000180 0.007438 0.002256 0.000479 0.001961 0.000989 0.000483 0.005506 0.002817 0.001397 0.013714 0.007231 0.003700 70 0.014554 0.038813 0.087179 0.010971 0.029420 0.066825 0.000007 0.000021 0.000059 0.000231 0.000677 0.001849 80 0.013616 0.036339 0.081758 0.009388 0.025217 0.057471 0.000000 0.000001 0.000003 0.000108 0.000322 0.000908 90 0.012389 0.033093 0.074580 0.007390 0.019880 0.045463 0.000000 0.000000 0.000000 0.000050 0.000151 0.000439 Table 7.29: P † for females, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W = £100,000. Table 7.30: The population average premium rate for CI insurance, P0 , as if heart attack risk were absent (λ12 = 0). Age 25 35 45 Male 0.009821 0.031290 0.092818 Female 0.018326 0.046485 0.097947 We would like to do the same for the CI insurance model. However, there are important differences between the two models. (a) In the 2-state model we specified the baseline risk and relative risks, and these determined the average risk. In the CI insurance model, we specify the average risk (given by the population heart attack risk) and the relative risks, and these determine the baseline risk, in the form of the relative risk c. Clearly increasing the relative risk k will cause c to fall, hence also the premium Pge . To make this dependence clear, we will write c(k) and Pge (k) in this section. It will also be useful to note that the probability qge of a heart attack similarly depends on k, and write qge (k). (b) However, unlike in the 2-state model, Pge (k) has a lower bound, denoted P0 , given by the population average premium rate for CI insurance as if heart attack risk were absent (λ12 = 0 and c = 0). These values are shown in Table 7.30. They do not depend on the epidemiological model or the utility function. Clearly Pge (k) ≥ P0 , no matter how high k becomes. Thus we have two possibilities: limk→∞ Pge (k) = P0 (equivalently limk→∞ c(k) = 0); or limk→∞ Pge (k) > P0 (equivalently limk→∞ c(k) > 0). We return to this point in Section 7.4. (c) If Pge (k) is a strictly decreasing function, which it is for the utility functions we are using, adverse selection is possible if limk→∞ Pge (k) < P † , and in such cases we can solve Pge (k) = P † for the threshold value of k above which adverse selection will appear. Tables 7.31 and 7.32 show these values for the various utility functions and loss levels, for males and females respectively. The missing values correspond to combinations of parameters such that limk→∞ Pge (k) > P † , for which adverse selection will not appear. (d) Another consequence of this is that there is a level of insured loss, that we 114 Table 7.31: The relative risk k above which males of different ages in stratum ge with initial wealth W = £100, 000 will not buy critical illness insurance policies of term 15 years, where ω = 0.9. Utility Function I(0.5) Log I(−8) N (9e-5) Age 25 35 45 25 35 45 25 35 45 25 35 45 10 1.484 1.376 1.389 2.062 1.808 1.843 – – – – – – 20 2.111 1.846 1.886 3.783 2.998 3.138 – – – – – – 30 2.960 2.450 2.544 7.068 4.917 5.339 – – – – – – 40 4.183 3.262 3.456 15.883 8.530 9.794 – – – – – – loss L in £’000 50 60 6.117 9.698 4.420 6.226 4.808 7.027 122.410 – 17.855 98.596 23.063 765.192 – – – – – – – – – – – – 70 18.869 9.509 11.388 – – – – – – – – – 80 105.569 17.715 24.239 – – – – – – – – – Table 7.32: The relative risk k above which females of different ages in stratum ge with initial wealth W = £100, 000 will not buy critical illness insurance policies of term 15 years, where ω = 0.9. Utility Function I(0.5) Log I(−8.0) N (9e-5) Age 25 35 45 25 35 45 25 35 45 25 35 45 10 – 4.031 2.293 – 15.856 4.459 – – – – – – 20 – 18.470 4.710 – – 26.155 – – – – – – loss L in £’000 30 40 50 – – – – – – 10.770 52.668 – – – – – – – – – – – – – – – – – – – – – – – – – – – – 115 60 – – – – – – – – – – – – 70 – – – – – – – – – – – – 80 – – – – – – – – – – – – 90 – – – – – – – – – – – – 90 – 93.578 – – – – – – – – – – Table 7.33: The loss L0 in £,000 above which adverse selection cannot occur. Initial wealth W = £100,000. Gender Male Female Age 25 35 45 25 35 45 Utility Function I(0.5) Log I(−8) N (9e-5) 82.3 51.8 7.1 7.2 92.3 62.6 9.2 9.5 89.9 60.4 8.9 9.2 8.9 4.5 0.5 0.5 25.3 13.3 1.5 1.6 43.4 23.9 2.9 2.9 denote L0 , above which adverse selection cannot occur, because fixing L > L0 in Equation 7.71 and solving for P † yields a solution P † < Pge (k) for all k. Table 7.33 gives the values of L0 , for the usual utility functions and initial wealth £100,000. The missing values in Tables 7.31 and 7.32 occur for losses L > L0 . The general pattern of threshold relative risks for males given in Table 7.31 is similar to that in Chapter 6; what is of most interest are their absolute values, since we have tried to suggest plausible models for both the risk model and the utility functions. (a) For iso-elastic utility with λ = −8 and negative exponential utility with parameter A = 9 × 10−5 , we find no evidence at all of adverse selection. (b) For all utility functions and at all loss levels, if adverse selection can appear, it does so at higher levels of relative risk than under the two-state model. This is because the impact of the gene and environment on heart attack risk is diluted by the presence of the other CIs. Only for the lowest levels of loss are these relative risks in the range that might be typical of relatively common multifactorial disorders; by definition, we do not expect studies like UK Biobank to lead to the discovery of hitherto unknown high risk genotypes. (c) When adverse selection can appear, the relative risk threshold first decreases and then increases with age. This is because among CIs the importance of heart attack peaks at around age 45 as can be seen from Figure 7.27. 116 The threshold relative risks for females are given in Table 7.32. We observe the following: (a) The threshold relative risks are much higher than those for males, in all cases. This is because heart attacks form a smaller proportion of all CIs for females, so a larger increase in heart attack risk is needed to trigger adverse selection. (b) As for males, at levels of absolute and relative risk-aversion that we regard as most plausible (consistent with the Bank of Italy study) we find no evidence that adverse selection is likely. (c) In contrast to males, the threshold relative risks decrease with age. The reason is clear from Figure 7.27; for females the relative importance of heart attack increases with age. (d) Adverse selection appears to be possible only for: (i) smaller losses; and (ii) extremely low levels of risk aversion. 7.4 High Relative Risks In Section 6.5, we considered relative risks that increased without limit, for the simple 2-state insurance model. We saw that, even in this extreme case, if stratum ge was large enough, adverse selection would not appear. In this section, we consider high relative risks (of heart attack) in the CI insurance model. We assume the heart attack rates in the general population λ12 (x) are fixed at their estimated values (Gutiérrez and Macdonald (2003)). From Equation 7.70 we obtain: 1 − q̄ = 1 − X ws [1 − (1 − q̄)c(k)ks ] s = wge (1 − q̄)c(k) + X ws (1 − q̄)c(k)ks . (7.72) s6=ge Differentiation shows the right-hand side to be a decreasing function of c and of each ks (s 6= ge), all other quantities held constant in each case. Also, if c = 1 the right-hand side is less than (1 − q̄) while if c = 0 it is greater than (1 − q̄). Hence, as we increase the ks without limit, c must decrease, and being bounded below it 117 must have a limit. The limit could be zero or non-zero. We can easily see that if c has a non-zero limit (necessarily positive) then the last term on the right-hand side of Equation 7.72 vanishes and the limit must be: lim c(k) = 1 − ks →∞ s6=ge log wge log(1 − q̄) (7.73) which in turn implies (1 − q̄) < wge . On the other hand if (1 − q̄) > wge , then c cannot have non-zero limit, so the equation: lim ks →∞ s6=ge X ws (1 − q̄)c(k)ks = (1 − q̄) − wge (7.74) s6=ge holds. Since the left-hand side is finite, at least one of the products cks tends to a finite limit as the ks → ∞. However, we have not specified here how the quantities ks (s 6= ge) jointly approach infinity, so the behaviour of c is not easy to analyse in general. It is greatly simplified if the ks are simple functions of a single parameter k, which is the case in our assumed epidemiological model (in which case we again make explicit the dependence of c by writing c(k)). For example, under an additive model with symmetry between genetic and environmental risks, Equation 7.72 can be written as: 1 − q̄ = ω 2 (1 − q̄)c(k) + 2ω(1 − ω)(1 − q̄)c(k)k + (1 − ω)2 (1 − q̄)c(k)(2k−1) = (1 − q̄)c(k) [ω + (1 − ω)(1 − q̄)c(k)(k−1) ]2 (7.75) therefore: k =1+ log[(1 − q̄)(1−c(k))/2 − ω] − log(1 − ω) . c(k) log(1 − q̄) (7.76) If ω 2 > (1 − q̄) then as k → ∞, the limiting value of c(k) is non-zero. Otherwise, when ω 2 < (1 − q̄), c(k) → 0, and Equation 7.76 yields the finite limiting value: lim c(k)k = k→∞ log[(1 − q̄)1/2 − ω] − log(1 − ω) . log(1 − q̄) So, in summary: 118 (7.77) Table 7.34: q̄, the probability that a healthy person aged x has a heart attack before age x + t, for policy duration t = 15 years. Age 25 35 45 Male 0.004743 0.021454 0.059959 0 lim c(k) = k→∞ 1− Female 0.000541 0.004299 0.017616 if wge ≤ (1 − q̄) log wge log(1−q̄) if wge > (1 − q̄). (7.78) We want to find out if the baseline stratum ge can ever be large enough that adverse selection will never appear, no matter how large k becomes. Hence we want to understand the behaviour of limk→∞ Pge (k) as a function of wge . Equation 7.78 shows that we must treat separately the cases wge ≤ (1 − q̄) and wge > (1 − q̄). Values of q̄ are given in Table 7.34. (Note that P0 + q̄ 6= P̄ , because in a competing risks model removing one cause of decrement increases the probabilities of the other decrements occurring.) (a) If P0 > P † the result is trivial, since limk→∞ Pge (k) ≥ P0 for any value of wge , and adverse selection can never occur. (b) If P0 < P † adverse selection will occur if wge ≤ (1 − q̄), since then limk→∞ Pge (k) = P0 . (c) The non-trivial case is P0 < P † and wge > (1−q̄), since then limk→∞ Pge (k) > P0 . We can show that limk→∞ Pge (k) is an increasing function of wge in this range, because the limit of the heart attack probability limk→∞ qge (k) is (use Equation 7.73 to write: lim qge (k) = lim [1 − (1 − q̄)c(k) ] = 1 − k→∞ k→∞ (1 − q̄) wge (7.79) and differentiate). The function limk→∞ Pge (k) is continuous and increases from P0 to P̄ as wge increases from (1 − q̄) to 1, the upper limit being attained when all the strata have collapsed into one, and c = 1. Since P † < P̄ for any concave utility function, the intermediate value theorem guarantees that there exists a 119 Table 7.35: The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions, for males purchasing CI insurance. Utility Function I(0.5) Log I(−8) N (9e-5) Age 25 35 45 25 35 45 25 35 45 25 35 45 10 1.000 0.999 0.998 1.000 0.999 0.996 – – – – – – 20 1.000 0.998 0.995 0.999 0.997 0.991 – – – – – – 30 0.999 0.998 0.993 0.999 0.995 0.986 – – – – – – loss 40 0.999 0.997 0.990 0.998 0.994 0.981 – – – – – – L in £’000 50 60 0.999 0.998 0.996 0.995 0.987 0.984 0.998 – 0.992 0.990 0.976 0.970 – – – – – – – – – – – – 70 0.998 0.993 0.980 – – – – – – – – – 80 0.998 0.992 0.975 – – – – – – – – – 90 – 0.990 – – – – – – – – – – unique value of wge such that limk→∞ Pge (k) = P † ; that is, such that adverse selection can never appear if wge exceeds this value. 1/2 Tables 7.35 and 7.36 give the threshold values of ω = wge above which no adverse selection takes place, in the additive model with gene-environment symmetry, for males and females respectively. Missing values indicate that adverse selection will never appear. When it is possible, the threshold value of ω ranges from 0.970 to 1 for males and 0.992 to 0.999 for females. As the relative risks in Tables 7.31 and 7.32 are based on ω = 0.9, this explains the missing values in those tables. This pattern is quite unexpected. If adverse selection can occur, then a large enough baseline stratum does confer immunity from it, but it has to be very large indeed, all but a few percent of the population. But once the threshold is crossed, adverse selection cannot appear at all, even if very few people are in the baseline stratum. This had no counterpart in the 2-state model, and it is caused by the presence of substantial other risks not affected by the gene-environment variants. 120 Table 7.36: The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions, for females purchasing CI insurance. Utility Function I(0.5) Log I(−8) N (9e-5) 7.5 Age 25 35 45 25 35 45 25 35 45 25 35 45 10 – 0.999 0.998 – 0.998 0.996 – – – – – – 20 – 0.998 0.996 – – 0.993 – – – – – – loss L in £’000 30 40 50 60 – – – – – – – – 0.994 0.992 – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – 70 – – – – – – – – – – – – 80 – – – – – – – – – – – – 90 – – – – – – – – – – – – Conclusions Until now, genetical research on information asymmetry and adverse selection has taken one of two routes — models of single-gene disorders and work on the economic welfare effects of genetic testing. Single-gene disorders, by their very nature, are often severe and it is a reasonable first approximation to assume that private information about risk makes insurance purchase highly likely. This is not so for multifactorial disorders, where adverse gene-environment interactions are expected to be much more common and lead to more modest risk differences. On the other hand, the economic welfare approach concentrates primarily on efficiency losses in the insurance market, and may be less concerned with the epidemiology. In this paper, we have represented multifactorial disorders using standard epidemiological models and analysed circumstances leading to adverse selection, taking economic factors into account in a simple way through expected utility. Logarithmic utility, although popular, may not reflect all risk preferences very well. In particular, Eisenhauer and Ventura (2003) showed that consumers’ riskaversion is normally much greater than implied by logarithmic utility. We therefore used utilities with both realistic and traditional risk-aversion coefficients to illustrate 121 our results. We used a simple 2 × 2 gene-environment interaction model, assuming that information on status within the model was available only to the consumers and not to the insurer. Competition leads insurers to charge actuarially fair premiums, based on expected losses given the information they have. Adverse selection will not occur as long as members of the least risky stratum (who know their status) can still increase their expected utility by insuring at the average price. First, we studied a simple 2-state insurance model, with constant relative risks in different risk strata defined by the gene-environment model and sex. We found that adverse selection does not appear unless purchasers are relatively risk averse (compared with what we think to be a plausible parameterisation) and insure only a small proportion of their wealth; or unless the elevated risks implied by genetic information are implausibly high, bearing in mind the nature of multifactorial risk. In many cases adverse selection is impossible if the low-risk stratum is large enough, these levels being quite compatible with plausible multifactorial disorders. We applied the same gene-environment interaction model, assumed to affect the risk of heart attacks, to CI insurance. As heart-attack risk is just part of the risk of all CIs, the impact of the gene-environment risk factor was diluted, compared with the 2-state insurance model where the total risk was influenced. Our results showed complete absence of adverse selection at realistic risk-aversion levels, irrespective of the stratum-specific risks. Moreover, the existence of risks other than of heart attack, and the constraint of differential heart-attack risk to be consistent with the average population risk, introduced a threshold effect absent from the 2-state model. When adverse selection was possible at all (low risk aversion, low loss ratios) only an unfeasibly high proportion of the population in the low-risk stratum would avoid it, but when the threshold was crossed adverse selection vanished no matter what the size of the low-risk stratum. The results from both 2-state and CI insurance models suggest that in circumstances that are plausibly realistic, private genetic information, relating to multifactorial risks, that is available only to customers does not lead to adverse selection. This conclusion is strongest in the more realistic CI insurance model. 122 We have not considered what might happen if insurers were allowed access to this genetic information. The opportunity would then exist to underwrite using that information. If one believed that social policy is best served by solidarity, the important question is whether insurers would find it worthwhile to use the genetic information. Further research would be useful, to investigate the costs of acquiring and interpreting genetic information relating to common diseases, compared with the benefits in terms of possibly more accurate risk classification, in both cases in the context of multifactorial risk. 123 124 Chapter 8 Conclusions In Chapter 1, we set out our broad objectives for the thesis — to analyse how gene-environment interactions in multifactorial disorders might affect current underwriting practices of the insurance industry. Researchers have found that severe single-gene disorders, due to their rarity, do not have a significant impact on insurance premiums. Equivalently, the extent of adverse selection was found to be minimal. Multifactorial disorders, on the other hand, are much more common and any medical development or breakthrough in this area is likely to have a major impact on the insurance industry. With the setting up of large-scale cohort studies, like the UK Biobank project, specifically to concentrate on multifactorial disorders, this has become a real possibility. Given this backdrop, we tackled two fundamental questions in this thesis: (a) In the next 5–10 years, as results start emerging from UK Biobank, what will be the impact of these on the insurance industry? (b) Given the risk-averse nature of insurance purchasers, at what levels of geneenvironment interaction might an insurer face a realistic risk of adverse selection? 8.1 UK Biobank Simulation Study In the first half of the thesis, we examined question (a). We chose heart attack as the disorder of interest and hypothecated a simple 2×2 gene-environment interaction for 125 the risk of heart attack. This segregated the study population into four strata with varying risk-profiles based on the impact of the respective genes and environmental factors. As the rates of onset of heart attack are significantly different for males and females, we analysed the results separately for each sex. Based on this model, we randomly simulated 500,000 life histories to generate data similar to what is expected to emerge out of the UK Biobank project. An epidemiological analysis was then carried out on the simulated data using case-control studies. The results, thus obtained, were then used in an actuarial model to calculate CI insurance premium rates for all strata. This led us to the question, how reliable are these estimates of premium rates based on which insurers can possibly justify discriminating between individuals with different genes and with exposure to different environmental factors? In particular, GAIC and other interested parties would want insurers to provide factual evidence and rigorously demonstrate the justification of underwriting strategies based on genetic information. So, we looked at the empirical distributions of the estimated premium rates generated by simulating many replications of UK Biobank. We noted that the strata-specific premium rates, as a proportion of the baseline premiums, are uncorrelated and the extent of overlap of the empirical densities provided a measure of reliability. The main conclusions from the analysis are as follows: (a) Our Base scenario assumptions reflected fairly common adverse genetic and environmental exposures with modest penetrances. This is what we would expect for most common multifactorial disorders. We found that, if epidemiologists opted for an extensive study, which included all heart attack cases in conjunction with a 1:5 matching strategy, reliable discrimination could be achieved. However, this is also an expensive option and case-control studies with such large numbers of cases and controls may not be economically viable. Casecontrol studies with a few thousand cases coupled with a modest 1:1 matching strategy, although realistic, quickly diminished the reliability of the estimates and thus the power to discriminate. (b) We also analysed the results by varying our assumptions of the frequencies and 126 penetrances of the adverse exposures. The reliability of the estimates reduced substantially when the proportion of adverse traits in the study population were halved. Case-control studies also became increasingly harder to carry out as the number of cases decreased and suitable matching controls with adverse traits became rarer. Reduced penetrances had a similar impact with reduced ability to discriminate between different risk-categories; the problems being more acute for case-control studies with fewer cases and controls. (c) The results were similar for both males and females when the number of cases used in the case-control study was fixed in advance. If all cases were to be included in the study, the estimates of premium rates for females were less reliable than those for males. This is because heart attacks are rarer for females and as a result total numbers of cases were fewer. To summarise, we found that, unless the “adverse” genetic and environmental factors are abundant or have significant penetrances, the inherent variability of estimates obtained from case-control studies would make it difficult for insurers to justify charging different premiums for different risk-groups. This result should bring comfort to the regulators and other groups who are concerned about insurers using genetic information to discriminate against the unfortunate few. While carrying out our analysis we have made a number of simplifying assumptions to keep the problem tractable. Further research needs to be carried out to analyse the implications of relaxing these assumptions. In particular: (a) We have assumed a 2×2 gene-environment interaction, which is the simplest of multifactorial models. However, most common disorders are likely to involve higher order gene-environment interaction with complex interplay between multiple genes and environmental factors. Extending the simple 2×2 model to general higher order interactions should produce interesting results. (b) Caution should also be exercised in interpreting the results because they are based on some idealised assumptions. In particular, we have ignored the problems of model mis-specification altogether. In reality, there are a number of places where this can go wrong. As UK Biobank is essentially an unrepeatable exercise, epidemiologists will have access to a single set of observations, based 127 on which they will propose their models. It is thus highly unlikely that the model will be “accurate”. A poor choice of epidemiological model would lead to erroneous results and will inevitably have additional knock-on effects for studies based on these results. Mis-specification can also occur when an actuary tries to develop his or her own model based on the results published by the epidemiologists. The true implications of these need to be investigated further. 8.2 Adverse Selection Issues In the second half of the thesis, we tackled the adverse selection issues in the context of multifactorial disorders. In many countries, due to regulations or agreed moratoria, genetic information is treated as private and insurers do not have access to such information. This asymmetry of information can then lead to adverse selection if individuals in the lowest risk-category find the average premium, charged by the insurer, unacceptably high. Of course, this would depend on a number of factors including the degree of risk-aversion of these individuals. Our objective was to analyse the different factors, and the levels of these, which would lead to adverse selection. First, we assumed a 2×2 gene-environment interaction in a simple 2-state insurance model. The factors of interest were: (a) the baseline risk; (b) the amount of loss insured as a proportion of total wealth; (c) the proportion of individuals in the lowest risk-category; and (d) the degree of absolute and relative risk-aversion. For each of these factors, we analysed the levels of relative risks required to trigger adverse selection. Our observations were: (a) The higher the baseline risk, the lower is the level of relative risks of higher risk strata at which adverse selection appears. (b) As the amount of loss insured increased as a proportion of total wealth, higher relative risks were required to trigger adverse selection. 128 (c) The more risk-averse the individuals, the higher are the relative risks required for adverse selection. (d) If the proportion of individuals in the lowest risk-category is high, relative risks in other categories need to be large to move average premium rates so as to trigger adverse selection. In fact, we found that if the lowest risk-category is large enough, it is possible to achieve full immunity from adverse selection. Of course, the levels at which this is attained depended on all other factors. We then extended our results to a realistic example of a CI insurance model. As in the UK Biobank simulation study, we hypothecated a 2×2 gene-environment interaction on heart attack risk. We assumed that other illnesses covered under a CI insurance contract remained unaffected by these genes and environmental factors. The results from this model were along similar lines to those obtained from the 2-state insurance model. In particular: (a) As the rates of onset of heart attack are different for males and females, we analysed the impact separately. For females, the relative risks required for adverse selection were substantially higher than those for males. This is because heart attacks form only a small proportion of all CIs for females. (b) The presence of other CIs diluted the impact of gene-environment interactions on heart attack. As a result, the relative risks required for adverse selection were generally much higher than those observed for the 2-state insurance model. In fact, for individuals with empirical estimates of risk-aversion, adverse selection did not appear at all. (c) The existence of risks other than that of heart attack introduced a floor below which CI insurance premiums could not fall even when risks of heart attack were non-existent. This implied that when adverse selection was possible, immunity from adverse selection was possible only at a very high proportion of population in the lowest risk-category. Otherwise, adverse selection does not appear at all. Results from both the 2-state insurance and CI insurance models confirm the key message that under realistic assumptions, private genetic information does not lead to adverse selection. There are further research opportunities in a number of areas: 129 (a) As pointed out for the UK Biobank simulation model, extending the simple 2×2 gene-environment model to higher order interactions, might also produce interesting results on adverse selection issues. (b) An insurer’s decision to use genetic test results, if permitted, will depend on a number of issues including the actual cost of these tests. If the costs are as high as they are now, it might not make economic sense to use genetic tests for underwriting purposes. However, as tests become cheaper, in future, the balance might tilt in the other direction. It might be of interest to find out the levels of cost at which genetic testing becomes an affordable underwriting tool. (c) In our analysis, we made a simplifying assumption that all individuals wish to insure the same amount of loss, as a proportion of wealth, irrespective of their risk-profiles. This assumption can be relaxed, as Hoy and Polborn (2000) showed that under certain assumptions, the appetite for cover increases with risk. The techniques developed in this thesis can be extended to incorporate these assumptions and analyse the situations where high-risk individuals could opt for increased cover. Further research in this area might produce interesting results. 130 Appendix A Epidemiology A.1 Introduction Epidemiology is the study of diseases which tries to answer two fundamental questions: (a) What causes a disease? (b) Who are affected by a disease? There might be a number of factors whose interplay manifests itself in the form of a disease. With the advent of genetic knowledge, researchers have found out that diseases can be caused by a genetic disorder. Or in other words, an individual with a particular gene might have a higher or lower probability of contracting a disease. This fact does not, however, diminish the role played by the environment on disease susceptibility. For example, it is well documented that there are more smokers than non-smokers among lung cancer patients. These factors, genetic and environmental, which precipitate a disease are called risk factors and form the primary subject matter of an epidemiological investigation. The second question tries to ascertain the distribution of a disease. Instead of looking at the population as a whole, it can be stratified into groups, the analysis of which may show variability in disease susceptibility by strata. In an epidemiological study, the usual stratifications are based on age, sex, social class, marital status, racial group, occupation and geographical location. However, it is vital not to overlook any other form of stratification which could explain the variation better. 131 To answer the questions posed above, epidemiologists collect, analyse and interpret data collected from groups of individuals. The results thus obtained and the conclusions arrived thereof, apply directly to the individuals from whom the data is collected. However, it is natural to seek to see if the results and conclusions can be extended to a wider group. Of course, the ultimate goal of an epidemiological study is to obtain results which can then be extended and be held to be valid for the entire human population. In practice, epidemiological investigations commence with an objective to obtain results for a target population. For example, the UK Biobank protocol clearly states that its objective is to investigate the risk of common multifactorial disorders of adult life. So the target population here is the whole UK general population, and with a wider focus – the entire human population. Collecting data from the whole target population may not always be feasible. So normally data is collected from a representative subset of the target population, the study population. The UK Biobank project aims to collect data from a large crosssection of individuals, at least 500,000 men and women, from the general population of the United Kingdom. Once the study population is identified, the focus shifts to the collection of appropriate data for analysis. Ideally, each individual within the study population should be followed-up through time. Every instance of disease should be recorded along with data on plausible risk factors. Such a detailed study, sometimes called a cohort study, can then provide direct information on the sequence of happenings demonstrating causality. Moreover, being so detailed, cohort studies can analyse many diseases simultaneously. However, cohort studies are often very expensive and time consuming. Also, they are not ideal for studying rare diseases as they would require either a very large study population or a very long time span. For studying rare diseases, resources can be used more efficiently by employing case-control studies. Unlike cohort studies, where we follow-up every individual within the study population prospectively for the entire duration of the study period, in case-control studies individuals are chosen at the end of the study period according 132 to their disease status. This is why case-control studies are retrospective studies. In case-control studies, the first step is to identify a number of cases, subjects with the disease under consideration. The next step is to select a number of controls, subjects who are free from the disease. Controls should be a representative sample of those individuals in the study population who do not have the disease, but had the same chance as a case, to be classified as a case had they become diseased. This is best achieved by matching at the design stage. Matching will be discussed in detail in a later section. While selecting cases and controls, care needs to be taken that the definitions of both cases and controls are precise and strictly adhered to during the course of the investigation. The other important consideration is the possibility of bias that may arise if the chance of having a particular risk factor among chosen cases is different from all those with the disease in the study population. The same consideration for bias needs to be given for controls. The data from the cases and the controls are then analysed to determine the effect of different risk factors on these two groups. As is evident, case-control studies are quicker and cheaper. The resources are also focused to study the more interesting subjects, the cases, in great detail, which is all the more crucial for rare diseases. In the UK Biobank project, it is envisaged that analysis will take the form of case-control studies nested within the study population. A schematic diagram for a case-control study is given in Figure A.28. A.2 Measuring risks Before we start analysing the data, let us clarify what we are trying to measure. In simple terms, the goal is to measure the risk of a disease. So to start with we need a formal definition of risk. The risk of a disease can be defined as the probability of an individual becoming newly diseased given that the individual has the particular attribute or risk-factor in question. We will introduce some notation here. Let S(t) be a stochastic process which 133 Diseased - - Cases - Controls - All Healthy Target Population Study Population Figure A.28: A schematic diagram of a case-control study. records an individual’s state at time t. Let us also denote Pij (s, t) as the conditional probability that the study subject is in state j at time t, given that it was in state i at time s. In mathematical notation: Pij (s, t) = Prob[S(t) = j|S(s) = i]. (A.80) The conditional probability defined above is also known as a transition probability. Using the transition probabilities, we can now define the transition intensity or hazard rate, λij (t), as the instantaneous rate of change of probability at time t, of moving from state i to state j, given that the subject is in state i at time t, i.e., Pij (t, t + dt) − Pij (t, t) , dt→0 dt λij (t) = lim (A.81) which can also be written as: Pij (t, t + dt) = Pi,j (t, t) + λij (t) × dt + o(dt). (A.82) The above definition can be simplified further by noting that a subject cannot remain in two different states at any one particular instant of time, i.e., 134 λ12 (t) - 1 = Healthy 2 = Diseased Figure A.29: A 2-state model. 0 if i 6= j Pij (t, t) = 1 if i = j Using the fact that P j Pij (s, t) = 1 for all t ≥ s, we can derive a useful relation- ship between the transition intensities. If we sum both sides of Equation A.82 we get, X Pij (t, t + dt) = j X Pi,j (t, t) + j X λij (t) × dt + o(dt), (A.83) j which leads to, X λij (t) = 0, or equivalently, λii (t) = − j X λij (t). (A.84) j6=i Before proceeding further, let us work our way through a simple model with two states – Healthy and Diseased, where the names of the states refer to a particular disease. Let us assume that an individual always starts off healthy. During the course of the investigation, the individual can either stay healthy or contract the disease and move on to the Diseased state. Once in the Diseased state, we will assume that the individual cannot turn healthy again. Figure A.29 gives a pictorial representation of this 2-state model. The transition intensity, λ12 (t), gives the instantaneous rate of change of probability at time t of being diseased for a subject who is healthy up to time t. Let us now derive a direct relationship between P12 (·) and λ12 (·) as follows. Using basic probability theory, P12 (s, t + dt) = P11 (s, t)P12 (t, t + dt) + P12 (s, t)P22 (t, t + dt). 135 (A.85) Using Equation A.82 and the fact that a subject cannot return to the Healthy state, i.e. P22 (t, t + dt) = 1, we have: P12 (s, t + dt) = P11 (s, t) (P12 (t, t) + λ12 (t)dt + o(dt)) + P12 (s, t). (A.86) Noting that P11 (s, t) = 1 − P12 (s, t) and P12 (t, t) = 0, we can rewrite the above equation as follows: P12 (s, t + dt) − P12 (s, t) = (1 − P12 (s, t)) × (λ12 (t) × dt + o(dt)) . (A.87) This leads to 1 d × P12 (s, t) = λ12 (t), 1 − P12 (s, t) dt (A.88) which can be solved, noting the boundary condition of P12 (s, s) = 0, to give: µ Z t ¶ P12 (s, t) = 1 − exp − λ12 (u)du . (A.89) s If the disease is rare, or the time period t − s is short, we can use a Taylor series expansion to obtain the following approximate relationship between P12 (·) and λ12 (·). Z t P12 (s, t) ≈ λ12 (u)du. (A.90) s Moving on to a general multiple-state model, we can derive similar relationships between transition probabilities and transition intensities. We will start off from a generalised version of Equation A.85. Pij (s, t + dt) = X Pik (s, t) × Pkj (t, t + dt). (A.91) k Now using Equation A.82, as before, we have, Pij (s, t + dt) = X Pik (s, t) × (Pkj (t, t) + λkj (t) × dt + o(dt)) , k which yields, 136 (A.92) X d Pik (s, t) × λkj (t). Pij (s, t) = dt k (A.93) We will discuss ways to solve these differential equations in Appendix B.1. A.3 Models of Disease Association In the previous section, we have formulated the risk of a disease through transition probabilities and transition intensities. Now we will use these concepts to develop models for measuring the effects of risk factors on a particular disease. A risk-factor can have a number of levels. Suppose, we are interested in investigating the effect of smoking on lung cancer patients. Smoking habits can be classified according to the average number of cigarettes smoked per day. The higher the number, the higher is the level of exposure to the risk-factor of smoking. Investigations can then be performed to figure out how the risk of lung cancer differs from one level of risk-factor to the other. In the simplest situation, we can have two levels of a risk-factor where an individual is either exposed to the factor or not. In the lung cancer example, people can be classified as smokers and non-smokers. Analysts will then investigate how smoking increases the risk of lung cancer. Here we will concentrate primarily on this binary set-up. Initially we will develop models to study effects of one risk-factor at a time. To do this, care needs to be taken that the results are not distorted by the effects of other risk factors. One way to ensure this is to stratify the study population according to the levels of these other possible risk factors, and then analyse the effect of the risk-factor in question within each such stratum. Going back to the example of investigating the effect of smoking on lung cancer, suppose we believe that age is also a risk-factor. Following the strategy outlined above, the study population needs to be stratified according to age-groups. We then examine the effect of smoking within each such age-group. Extending the notation from the previous section, let λuk ij denote the transition intensity from state i to state j for exposure status u and stratum k. We will assume 137 that u can take values 1 or 0 depending on whether the individual is exposed to the risk-factor or not. One simple formulation to study the excess risk or, more accurately, the excess rate of risk, which is 0k bkij = λ1k ij − λij . (A.94) In most studies, the risk-factor in question is not the sole contributor to the risk of the disease. Suppose that the total risk of the disease is the combined effect of the risk-factor and some other general factors. In Equation A.94, by subtracting the transition intensity of the unexposed group from that of the exposed group, we are trying to eliminate the effects of those other general factors. If our stratification is precise, then the difference represents the true effect of the risk-factor in question. It should also remain stable from stratum to stratum. This leads to the following simplification of Equation A.94: 0k bij = λ1k ij − λij , for all k. (A.95) The model in Equation A.95 is also known as the additive model. An alternative model to study disease association is to study the ratios of transition intensities instead of the differences. The formulation is as follows: k rij = λ1k ij . λ0k ij (A.96) The ratio in Equation A.96 is known as the relative risk. Again under the assumption that the effect of the general factors cancels out and the ratios remain stable from stratum to stratum, we get the multiplicative model: rij = λ1k ij , for all k. λ0k ij (A.97) There is an interesting relationship between the additive and the multiplicative model. If we take logarithms of both sides of Equation A.97, we get: 0k log rij = log λ1k ij − log λij . 138 (A.98) Healthy Diseased Total Exposed p × Pij1k p × Q1k ij p Unexposed q × Pij0k q × Q0k ij q p × Pij1k + q × Pij0k 0k p × Q1k ij + q × Qij 1 Total Figure A.30: A 2 × 2 table for stratum k with corresponding probabilities. Clearly, Equations A.95 and A.98 have the same structure, except for the scale. This is why, sometimes multiplicative models are also called log-linear models. Another important fact to note here is that although all the models above are specified in terms of the transition intensities, an equivalent formulation can be achieved through transition probabilities. The relationship defined in Equation A.89 can be used for this purpose. A.4 Relative Risk and Odds Ratio In the previous section, we introduced the concept of relative risk. In epidemiological research, it has become the most frequently used measure for associating exposure with disease. Here we will develop the concept further by introducing odds ratios. Using notation similar to the one used for transition intensities in the previous section, let us denote Pijuk as the transition probability from state i to state j, for an individual from stratum k and exposure status u. If we assume that p is the proportion of individuals exposed to the risk-factor in question, we can draw up the uk 2 × 2 table in Figure A.30 for stratum k where q = 1 − p and Quk ij = 1 − Pij : If the study period is reasonably short or the disease under consideration is relatively rare, we can use the approximation given in Equation A.90 to obtain the following relationship: λ1k Pij1k ij ≈ . λ0k Pij0k ij Using this, along with the definition of relative risk, we get: 139 (A.99) Diseased Healthy Total Exposed akij bkij m1k ij Unexposed ckij dkij m0k ij Total n1k ij n0k ij Nijk Figure A.31: A 2 × 2 table with data for stratum k. k = rij λ1k Pij1k ij ≈ . λ0k Pij0k ij (A.100) Let us now define the odds ratio ψijk , for stratum k, as the ratio of the odds of disease in the exposed and non-exposed subgroups, i.e., ψijk = (Pij1k /Q1k ij ) ÷ (Pij0k /Q0k ij ) Pij1k Q0k ij = 0k 1k . Pij Qij (A.101) Again based on the assumption that the study period is short or the disease is rare, 1k we get Q0k ij ≈ Qij ≈ 1. This leads to the following approximate relationship between k ψijk and rij . ψijk = A.5 Pij1k Pij1k Q0k ij k ≈ ≈ rij . 0k Pij0k Q1k P ij ij (A.102) Analysis of Grouped Data Using the theory developed above, let us now proceed to draw inference based on actual data. Hence forward we will state most of the results without any proof. For details, please refer to Breslow and Day (1980) and Woodward (1999). Suppose we are investigating the effect of a risk-factor on a particular disease. To avoid distortion of results due to other risk-factors, the study population is stratified into a number of strata. For each stratum of the study population, the data can be summarised in a 2 × 2 table, as given in Figure A.31. From the data, we can obtain estimates of the transition probabilities as follows: 140 P̂ij1k = akij , (akij + bkij ) P̂ij0k = ckij . (ckij + dkij ) (A.103) These can then be used to derive an estimate of relative risk as follows: k r̂ij = P̂ij1k P̂ij0k akij /(akij + bkij ) akij /m1k ij = k = k . k k 0k cij /(cij + dij ) cij /mij (A.104) Using a log transformation and normality assumption, the standard error can be estimated by s k se(log ˆ e r̂ij ) = 1 1 1 1 − k + k − k . k k aij aij + bij cij cij + dkij (A.105) The estimate and the estimated standard error can then be used to obtain approxik mate confidence intervals for rij . They can also be used to obtain p-values for testing k hypotheses on rij . Similarly, estimates can be obtained for the odds ratio: ψ̂ijk = P̂ij1k Q̂0k ij P̂ij0k Q̂1k ij = akij k (aij +bkij ) ckij k (cij +dkij ) s k se(log ˆ e ψ̂ij ) = dkij k (cij +dkij ) bkij k (aij +bkij ) akij dkij = k k , bij cij 1 1 1 1 + k + k + k. k aij bij cij dij (A.106) (A.107) Again, approximate confidence intervals and p-values can be obtained for ψijk using these equations. Note that the marginal totals, muk ij , are meaningless for case-control studies, as individuals are selected according to their disease status and not their exposure status. As a result relative risks cannot be estimated for case-control studies. However, no such problem exists for the estimation of odds ratios as the marginal totals cancel out. However, if the disease is rare or if the study period is short, the odds ratios are good approximation to the relative risks. So for case-control studies we will only concentrate on the estimation of odds ratios. 141 Until now, we have calculated odds ratios separately for each stratum. However, if we can assume that there is a common true odds ratio for each stratum and the differences in the observed odds ratios are purely due to chance variation, the estimate of the common odds ratio is given by the Mantel-Haenszel estimate: Ã ψ̂ij = X akij dkij k !,Ã Nijk X bkij ckij k Nijk ! . (A.108) The estimate of the standard error of loge ψij , as proposed by Robins et al. (1986), has the following form: se(log ˆ e ψ̂ij ) = v P P k k P k k P k k u u Uij Xij + k Vij Wij Uijk Wijk k k Vij Xij k t ¡P P P + + ¢ ¢ , ¡ P 2 k k k k 2 W X 2 W X 2 2 ij ij k k ij ij k k (A.109) where, for stratum k, Uijk = A.6 akij + dkij bkij + ckij akij dkij bkij ckij k k k , V = , W = , X = . ij ij ij Nijk Nijk Nijk Nijk (A.110) Analysis of Matched Studies In a case-control study, individuals are selected according to their disease status. If cases and controls are chosen independently, there is a chance that the profiles of the individuals in the control group will be different from that of the cases. This difference will then feed into the analysis to distort the results. Matching is a method which tackles this problem by choosing controls based on the profiles of the cases. Matching uses the concept of stratification, introduced in Section A.3, to subdivide the study population into a number of strata. The cases are first classified according to the strata they come from. Controls are then chosen in such a way that they have a distribution similar to that of the cases across strata. This ensures that analysis can be done within each strata, eliminating the distortions arising out of the differences between strata. 142 No exposures Two exposures Cases Controls Total Cases Controls Total Exposed 0 0 0 Exposed 1 1 2 Unexposed 1 1 2 Unexposed 0 0 0 Total 1 1 2 Total 1 1 2 One exposure One exposure Cases Controls Total Cases Controls Total Exposed 1 0 1 Exposed 0 1 1 Unexposed 0 1 1 Unexposed 1 0 1 Total 1 1 2 Total 1 1 2 Figure A.32: The types of table for each case-control pair in a 1:1 matching. However, care needs to be taken to guard against over-matching. As an extreme example, suppose that the study population is stratified for the risk-factor in question. This will then result in the same distribution of cases and controls for each exposure level. No conclusions can then be drawn from the analysis. So it is important to leave aside the risk-factor in question while stratifying the study population. The simplest form of all matching is the 1:1 matching or pair matching. Here for each case, a control is chosen from the same stratum irrespective of the exposure status. A case-control pair can then be identified with one of the four possibilities shown in Figure A.32. If we assume that each case-control pair represents a stratum and that there exists a common odds ratio for all strata, we can derive the Mantel-Haenszel estimate using Equation A.108. Let, tu be the number of sets with u exposures, and 143 mu be the number of sets with u exposures in which the case is exposed. Using these notations in Equation A.108 we get, Ã ψ̂ij = X akij dkij k !,Ã Nijk X bkij ckij k ! Nijk t0 × 02 + m1 × 12 + (t1 − m1 ) × 02 + t2 × = t0 × 02 + m1 × 02 + (t1 − m1 ) × 12 + t2 × m1 . = t1 − m1 0 2 0 2 (A.111) In other words, the estimate is the ratio of the number of exposed cases to the number exposed controls where one of the case or the control is exposed. Note that the sets where both case and control are exposed or where both are unexposed contain no extra information. So these terms are eliminated from Equation A.111. The standard error of the estimate can be derived using Equation A.109. However, when ti is small, Breslow and Day (1980), have provided a formula for an exact 100(1 − α)% confidence interval (ψL , ψU ), where m1 , (t1 − m1 + 1)Fα/2 (2(t1 − m1 + 1), 2m1 ) (m1 + 1)Fα/2 (2(m1 + 1), 2(t1 − m1 )) = . t1 − m1 ψL = ψU (A.112) Here Fα/2 (ν1, ν2) denotes the upper 100(α/2) percentile of the F distribution with ν1 and ν2 degrees of freedom. As it is highly likely that for a rare disease, there are more controls available than there are cases, it is possible to develop a design where each case can be matched to a number of controls, say c. Increasing c, increases the efficiency of the estimates as the standard errors fall. However, for each increase in c, the marginal increase in efficiency decreases. So, 1:c matching is rarely performed with c greater than 5. For 1:c matching, using techniques similar to the one used for 1:1 matching, we can derive the Mantel-Haenszel estimate of the odds ratio as follows: Pc u=1 (c + 1 − u)mu . ψ̂ij = P c u=1 u(tu − mu ) (A.113) Miettinen (1970) gives an approximate formula for the standard error of loge ψ̂ij : 144 " se(log ˆ e ψ̂ij ) = ψ̂ij #−0.5 c X (c + 1 − u)tu u=1 (uψ̂ + c + 1 − u) . 2 (A.114) Sometimes in a 1:c matching, it is possible that data from a few controls may not be available. In this situation, a case can be matched against a number of controls which is not fixed but can vary between 1 and c. This then becomes a 1:variable matching design. Using similar techniques, estimates of the odds ratio can be obtained. Let j denote the number of controls that are matched with any one case, where j = 1, 2, · · · , c. The Mantel-Haenszel estimate is given by Pc ψ̂ij = Pv (v) v=1 u=1 Tu , Pc Pv (v) B u v=1 u=1 (A.115) where, (v) Tu(v) Bu(v) (v + 1 − u)mu = , v+1 (v) (v) u(tu − mu ) = . v+1 (A.116) Also, Equation A.114 can be generalised to obtain the standard error of ψ̂ij in Equation A.115. " se(log ˆ e ψ̂ij ) = ψ̂ij c X v X u(v + 1 − u)tu v=1 u=1 (uψ̂ij + v + 1 − u) #−0.5 (v) 2 . (A.117) The most general of all matching strategies is the many:many matching design. Here a variable number of controls are matched against a variable number of cases. Although conceptually more difficult, similar techniques can be used to derive the Mantel-Haenszel estimate of the odds ratio. (rs) Suppose that muk is the number of matched sets with r cases and s controls in which there are u exposures to the risk-factor, k of which are exposed cases. P ψ̂ij = P (rs) Tuk (rs) Buk where, 145 , (A.118) (rs) Tuk (rs) Buk (rs) k(s − u + k)muk , r+s (rs) (u − k)(r − k)muk = . r+s = (A.119) The standard error of the estimate can be derived using Equation A.109. A.7 Effects of Combined Exposures Until now, we have looked at models to study the effect of one particular risk-factor at a time. However, in reality, all human diseases are caused by the combined interactions of a number of risk factors. In Section A.1, we have briefly touched upon gene-environment interactions, which study the combined effects of genetic and environmental factors precipitating a disease. In this section, we will develop models to analyse the effects of combined exposures on a disease. Suppose we are interested in two risk factors A and B. Extending the notation developed in Section A.3, let λuvk be the transition intensity from state i to state ij j with exposure level u of risk-factor A and exposure status v of risk-factor B, for stratum k. As before, in the binary set-up, u and v can take values 1 or 0 depending on the exposure status. In a similar way, we can extend the notation of relative risk uvk to rij and odds ratio to ψijuvk . Using this notation, in the binary set-up, for stratum k, we can define: 11k rij = λ10k λ01k λ00k λ11k ij ij ij ij 10k 01k 00k , r = , r = , and r = = 1. ij ij ij 00k 00k 00k λ00k λ λ λ ij ij ij ij (A.120) Recall the definition of excess rate of risk in Section A.3. Based on the same concept, for two risk factors, we can define three types of excess rates of risk, as follows: 00k λ11k : When exposed to both A and B. ij − λij 00k λ10k : When exposed to A but unexposed to B. ij − λij 00k λ01k : When exposed to B but unexposed to A. ij − λij 146 (A.121) Now let us assume that the effect of risk-factor A is independent of the effect of the risk-factor B, or in other words, there is no interaction between the risk factors. Independence or non-interaction between risk factors can be interpreted in a number of ways. One possible formulation is to assume that the joint effect of risk factors A and B is additive, i.e., 00k 10k 00k 01k 00k (λ11k ij − λij ) = (λij − λij ) + (λij − λij ), (A.122) 01k 00k 10k λ11k ij = λij + λij − λij . (A.123) which simplifies to: Dividing both side of Equation A.123 by λ00k ij is: 11k 10k 01k rij = rij + rij − 1. (A.124) An alternative characterisation for the joint association is the multiplicative or the log-linear model. Here we assume that the log transformation of the transition intensities are additive. Under this formulation, Equation A.122 transforms into: ³ log(λ11k ij ) ³ ´ − log(λ00k ij ) log(λ10k ij ) − = ´ log(λ00k ij ) ³ + log(λ01k ij ) ´ − log(λ00k ij ) , (A.125) which simplifies to: λ11k λ10k λ01k ij ij ij log 00k = log 00k + log 00k , λij λij λij (A.126) which when re-written in terms of relative risks, is: 11k 10k 01k log(rij ) = log(rij ) + log(rij ), (A.127) 11k 10k 01k rij = rij × rij . (A.128) or equivalently, 147 A B Cases Controls + + akij bkij + − ckij dkij − + ekij fijk − − gijk hkij Figure A.33: A 2 × 4 table with data for stratum k. So in the above model, the independence or non-interaction of risk factors implies a multiplicative combination for the joint effect. Earlier in Section A.5, we have seen that in case-control studies, although relative risks cannot be estimated directly, odds ratios can be calculated and used as good approximations of relative risks. So we will use odds ratios, instead of relative risks, to analyse the effects of combined exposures in case-control studies. To study the effects of two risk factors A and B, the data can be summarised in a 2 × 4 table, as given in Figure A.33, where ‘+’ implies exposure and ‘−’ implies non-exposure. Table A.37 lists all possible odds ratios that can be calculated from the data given in Figure A.33. The first odds ratio, ψij11k , measures the joint effect of the risk factors A and B. The next two odds ratios, ψij10k and ψij01k , measure the effect of one risk-factor at a time. The remaining four odds ratios, ψij1∗k , ψij0∗k , ψij∗1k and ψij∗0k , stratify the population based on the exposure level of one risk-factor and then measure the effect of the other risk-factor. The asterisk, in the notation of these last four odds ratios, denotes the risk-factor for which the effect is being measured. For example, ψij1∗k is the odds ratio measuring the effect of exposure to B, for those who are already exposed to A. The odds ratios, ψij11k , ψij10k and ψij01k , can also be used to measure the deviation of the data from both additive and multiplicative models. The first two measures, given in Table A.38, provide direct checks on deviation from these models. The 148 Table A.37: List of odds ratios obtained from the 2 × 4 table in Figure A.33. Notation ψij11k ψij10k ψij01k ψij1∗k ψij0∗k ψij∗1k ψij∗0k Formula Main Information akij hkij k bkij gij ckij hkij k dkij gij ekij hkij k gk fij ij akij dkij bkij ckij ekij hkij k gk fij ij k akij fij bkij ekij ckij hkij k dkij gij Effect of joint exposures versus none. Effect of exposure to A alone versus none. Effect of exposure to B alone versus none. Effect of exposure to B, given exposed to A. Effect of exposure to B, given unexposed to A. Effect of exposure to A, given exposed to B. Effect of exposure to A, given unexposed to B. case only odds ratio gives an alternative measure to check departure from the multiplicative model. The control only odds ratio estimates exposure dependencies in the underlying population. A discussion on these last two measures is given in Khoury and Flanders (1996). For a general discussion on the use of 2 × 4 tables for measuring combined exposures, please refer to Botto and Khoury (2001). Table A.38: Other measures based on the 2 × 4 table in Figure A.33. Other measures Formula Multiplicative interaction ψij11k /(ψij10k ψij01k ) ψij11k − (ψij10k + ψij01k − 1) Additive interaction k akij gij Case only odds ratio ckij ekij bkij hkij Control only odds ratio k dkij fij 149 150 Appendix B Numerical Methods B.1 B.1.1 Differential Equations Introduction In this section, we will briefly describe how the transition intensities introduced earlier can be used to formulate a set of differential equations which can be solved for the transition and occupation probabilities. For details, please refer to Press et al. (2002). Here we will consider a general n-state model. Using the same definitions and notations defined in the previous chapter, we have the following set of equations, commonly referred to as the Kolmogorov forward equations: X d Pij (s, t) = Pik (s, t)λkj (t), dt k (B.129) P0 (s, t) = P(s, t) × Λ(t), (B.130) or in matrix notation, with the boundary condition P(s, s) = I. With arbitrary Λ(t), defined by typical life history events, we can only solve these equations numerically and not explicitly. We now discuss some numerical methods of solving differential equations. 151 B.1.2 Euler Method The formula for the Euler method is: P(s, t + h) = P(s, t) + h × P0 (s, t) (B.131) which advances a solution from t to t + h. However, this method advances the solution through an interval of length h using derivative information only at the beginning of that interval. Although the method converges, it is inefficient and asymmetric and is not normally recommended. The Euler method can easily be improved upon by making use of an intermediate solution to achieve greater accuracy. A simple approach is to find a solution at the mid-point of the interval and to then obtain the solution at the end of the interval as illustrated below. Define: K1 = h × P0 (s, t) = h × P(s, t) × Λ(t), ½ 1 K2 = h × P(s, t) + K1 2 ¾ 1 × Λ(t + h), 2 (B.132) (B.133) leading to: P(s, t + h) = P(s, t) + K2 + O(h3 ) (B.134) This method is sometimes referred to as the midpoint method and can be further refined to give the fourth-order Runge-Kutta method which is outlined in the next section. B.1.3 Runge-Kutta Method By far the most often used method is the classical fourth-order Runge-Kutta formula. The steps are outlined below. Define: 152 K1 = h × P0 (s, t) = h × P(s, t) × Λ(t) ¾ ½ 1 1 K2 = h × P(s, t) + K1 × Λ(t + h) 2 2 ½ ¾ 1 1 K3 = h × P(s, t) + K2 × Λ(t + h) 2 2 K4 = h × {P(s, t) + K3 } × Λ(t + h). (B.135) leading to: 1 1 1 1 P(s, t + h) = P(s, t) + K1 + K2 + K3 + K4 + O(h5 ) 6 3 3 6 (B.136) For any multiple state model, the transition intensities will form the fundamental building blocks. So in almost all circumstances we will be able to define a set of differential equations specifying the problem and numerical solutions can be computed using Runge-Kutta method. B.2 B.2.1 Random Numbers Introduction Generation of random numbers from a particular distribution forms one of the most important tasks in a simulation exercise. This topic is covered in many textbooks on numerical analysis. So this section is not meant to be an exhaustive discussion on this topic. Rather the aim will be to provide a documentation of the methods that we are going to use. For a fuller treatment of the topic, please refer to Press et al. (2002). In the next section, we will give a brief introduction to the generation of random numbers from a uniform distribution. Then we will move on to other distributions of interest, from which random numbers can be generated using suitable transformations. In the final section, we will outline a method that can be used for any general continuous distribution. 153 B.2.2 Uniform Deviates Standard libraries of all major programming languages provide random number generators. In our case, we will concentrate primarily on C++, as all our programs will be written in that programming language. C++ has inherited from the ANSI C library a pair of routines, srand() and rand() for initialising and then generating random numbers. The random number generator is initialised with a seed and a sequence of random numbers can be generated based on that seed. Note that the same initialising value of seed will always return the same sequence of random numbers. The rand() function of C++ is a linear congruential generator, which can generate a sequence of integers I1 , I2 , . . . each between 0 and m − 1 by the recurrence relation Ij+1 = aIj + c (mod m). Here m is called the modulus, and a and c are positive integers called the multiplier and the increment respectively. ANSI C requires that m be at least 32768, which is nevertheless too small an integer for any large scale simulation exercise. In Press et al. (2002), there is detailed discussion on efficient routines for random number generation, salient features of which are listed below. ran0 This routine is a simple linear congruential generator, which is satisfactory for the majority of applications. However, it is not recommended because of the presence of subtle serial correlations. ran1 The routine uses the same algorithm as ran0. However, it shuffles the output to remove low-order serial correlation. The routine ran1 passes those statistical tests that ran0 is known to fail. However, it is 30% slower than ran0. This routine is recommended for general use. ran2 The ran2 routine uses a long period random number generator with the shuffle. It is recommended for generating more than 100,000,000 random numbers in a single calculation, as it has a longer period than ran1. However, this routine is only half as fast as ran0. For our simulation exercise, we would need to generate a lot of random numbers. So we will use ran2 for our simulation exercise. In Press et al. (2002), there is also 154 a discussion on ran4 which generates “extremely” good random deviates. However, it is only half as fast as ran2 and we will not describe it here. Unlike rand() of C++ library which generates integers, ran0, ran1 and ran2 produce uniform random deviates between 0.0 and 1.0 (exclusive of the endpoint values). Similar to the rand() function all these random number generators require a seed to initiate the sequence. If a seed is not provided, the seed will automatically be set to the time of the machine clock. B.2.3 The Transformation Method In the last section, we have seen how we can generate uniform deviates using the ran2 routine. Now we will see how we can use randomly generated uniform deviates to produce random deviates from a specific distribution. Let us first look at a simple discrete distribution — the Bernoulli distribution. Let Y ∼ Ber(p), i.e. P [Y = 1] = p and P [Y = 0] = 1 − p. The following steps can be used to generate random deviates from this distribution. (a) Generate a random deviate x from a U (0, 1) distribution. (b) If x < p produce 1, else produce 0 as the required random deviate from Ber(p). Random deviates from Bin(n, p) can be produced by adding n independent Ber(p) random deviates. Random deviates from the M ultinomial(n, p1 , p2 , . . . , pn ) can be generated as follows: (a) Generate a random deviate x from a U (0, 1) distribution. k−1 k P P (b) If x ≤ p1 produce 1, else if pj < x ≤ pj produce k as the required random variate. j=1 j=1 For continuous distributions, let us first consider U (a, b), a simple generalisation of U (0, 1). We know that if we define Y = a + (b − a)X where X ∼ U (0, 1), then Y ∼ U (a, b). So if we generate x from U (0, 1) and define y = a + (b − a)x, then y is a random deviate from U (a, b). So we see that a simple linear transformation of the U (0, 1) produces random deviates from U (a, b) distribution. 155 The next distribution of interest is the exponential distribution, Exp(λ). Here we use the following transformation: Y = − log(1 − X). If X ∼ U (0, 1), Y ∼ Exp(λ). Note that in both the examples above, we have made use of the fact that for any random variable Y , F (Y ) ∼ U (0, 1), where F (·) is the cumulative distribution function of the random variable Y . In other words, X = F (Y ) ∼ U (0, 1). So, Y = F −1 (X) has the cumulative distribution function F (·). A general method for producing random deviates from any random variable with cumulative distribution function F (·) requires the following steps: (a) Generate a random deviate x from a U (0, 1) distribution. (b) Find y, such that, F (y) = x. (c) Produce y as a random deviate from F (·). The result above can be used to generate random deviates from a general distribution if the cumulative distribution function for that distribution can be inverted. However, most distributions that we will be interested in rarely have a cumulative distribution function that can be inverted easily. Of course, an iterative method can be used to act as a substitute. However, the algorithm above will not be efficient if F (y) is not easy to compute. If this is the case then it is advisable to tabulate the values of F (y) at appropriately short-spaced y’s, and use linear interpolation at intermediate points. Note that the shorter the spacing between tabulated y’s the greater the accuracy but the larger the space requirement. As an example, let us assume that we know the age-dependent transition intensity λ(x) for a particular hazard. Suppose we are interested in generating the waiting time T for an individual aged a to make the relevant transition. We know that the distribution function of T is given by ³ R ´ t exp − 0 λ(s)ds ¡ Ra ¢ (B.137) F (t) = 1 − exp − 0 λ(s)ds Unless we have a very simple form for λ(·), we will have to perform numerical integration each time we need F (t). As this is inefficient and time consuming, we evaluate the values F (t1 ), F (t2 ), . . . where tj+1 = tj + δ, δ being a small positive number, say 0.01, and then store these values for ready reference. 156 Now following the algorithm outlined above, generate a uniform random variate x and find t such that F (t) = x. One can choose an efficient search algorithm which can minimise the search for the correct t. As we are searching within a bounded interval the Bisection method can be used. The t thus obtained gives us the required waiting time. There is an important point to note here. Many of the transition intensities that we will be working with may not have the property F (∞) = 1. This means that there is a probability 1 − F (∞) that an individual will not make a transition at all. This can be taken into account by generating a Bernoulli random variate Y ∼ Ber (1 − F (∞)), where Y = 0 will indicate that the individual will never make the transition and Y = 1 will indicate otherwise. So the above algorithm will only be implemented if Y = 1, as searching for a value of t is only required if a transition is made. For a Normal distribution, the cumulative distribution function is not easily invertible. So a different transformation known as the Box-Muller transformation is usually used to produce standard normal deviates. Consider the transformation between two random deviates x1 , x2 from U (0, 1) and two quantities y1 , y2 , p −2 ln x1 cos 2πx2 p y2 = −2 ln x1 sin 2πx2 y1 = (B.138) (B.139) It can be shown that y1 , y2 are independent random deviates from the N (0, 1) distribution. B.2.4 The Rejection Method The rejection method is a powerful, general technique for generating random deviates from a distribution whose density function p(·) is known and computable. The rejection method does not require that the cumulative distribution function be readily computable, much less the inverse of that function, which was required for the transformation method described in the previous section. The rejection method involves the following steps: (a) Find a majorising function M (·), for which M (x) > p(x) for all x. 157 (b) Calculate the area A under the majorising function M (·), i.e. A = R∞ −∞ M (s)ds. (c) Generate a random deviate x1 from U (0, A). Ry (d) Find y, such that x1 = −∞ M (s)ds. (e) Generate a random variate x2 from U (0, 1). (f) If x2 < p(y)/M (y), produce y as the required random deviate from the distribution with density function p(·); otherwise return to step (c). As we have already seen how to generate uniform random deviates, the main issue here is to obtain an appropriate majorising function. There are many different ways one can define a majorising function and suitability of the majorising function will also depend on the shape of p(·). Also, apart from the fact that the M (·) needs to have the property that M (x) > p(x) for all x, it should also be easy to invert Ry M (s)ds. Here, we will propose a general method of producing the majorising −∞ function for any density function p(·). Our aim will be to find a step function M (x) which will provide an upper envelope for p(x). For this, we need to start off from any x = x0 , such that 0 < p(x0 ) < ∞. Given x0 , we move on to x1 , such that M (x) on this interval is a constant and exceeds p(x) for all x in that interval and the area under M (x) does not exceed a pre-specified positive number. Once p(x) becomes smaller than a set tolerance level, it is assumed that the tail of the distribution is reached and is approximated by an exponential function. These steps are followed on both sides of x0 to +∞ and −∞. The full algorithm is outlined below. First find x = x0 such that 0 < p(x0 ) < ∞. This will be the starting point for setting our majoring function M (·). We set M (x0 ) = p(x0 ). Now our algorithm will set the values of M (·), first for x > x0 and then for x < x0 . At each step it will be required to calculate p0 (x) for which any simple numerical differentiation method can be used. So for x > x0 do the following: ¯ ¯ ¡ ¢ 1. Find x+n from x+(n−1) , so that the area ¯x+n − x+(n−1) ¯ × p x+(n−1) equals a pre-defined small value δ > 0. ¡ ¢ 2. Depending on the values of p0 x+(n−1) and p0 (x+n ) do one of the following: 158 ¡ ¢ ¡ ¢ (a) If p0 x+(n−1) < 0 and p0 (x+n ) < 0, then set M (x+(n−1) ) = p x+(n−1) . ¡ ¢ (b) If p0 x+(n−1) > 0 and p0 (x+n ) > 0, then set M (x+(n−1) ) = p (x+n ). ¡ ¢ (c) If p0 x+(n−1) > 0 and p0 (x+n ) < 0, then set M (x+(n−1) ) as the minimum of the following two terms: ¯ ¡ ¢ ¯ ¡ ¢ p x+(n−1) + ¯x+n − x+(n−1) ¯ × p0 x+(n−1) ¯ ¯ p (x+n ) + ¯x+n − x+(n−1) ¯ × p0 (x+n ). ¡ ¢ (d) Else set M (x+(n−1) ) as the maximum of p x+(n−1) and p (x+n ). 3. If p (x+n ) < ² n = 1, 2, . . . for a pre-specified small ² > 0, then ¡ ¢ (a) If p0 x+(n−1) < 0, set M (x) = p (x+n ) × e−(x−x+n ) x > x+n ¡ ¢ (b) If p0 x+(n−1) > 0, set ¡ ¢ (x−x+(n−1) ) x p x +(n−1) × e +(n−1) < x ≤ x+n M (x) = 0 x > x+n (B.140) (B.141) and stop. Else continue. Similarly for x < x0 do the following: ¯ ¯ 1. Find x−n from x−(n−1) n = 1, 2, . . ., so that the area ¯x−n − x−(n−1) ¯ × ¡ ¢ p x−(n−1) equals a pre-defined small value δ > 0. ¡ ¢ 2. Depending on the values of p0 x−(n−1) and p0 (x−n ) do one of the following: ¡ ¢ (a) If p0 x−(n−1) < 0 and p0 (x−n ) < 0, then set M (x−n ) = p (x−n ). ¡ ¢ ¡ ¢ (b) If p0 x−(n−1) > 0 and p0 (x−n ) > 0, then set M (x−n ) = p x−(n−1) . ¡ ¢ (c) If p0 x+(n−1) < 0 and p0 (x+n ) > 0, then set M (x−n ) as the minimum of the following two terms: ¯ ¡ ¢ ¡ ¢ ¯ p x−(n−1) + ¯x−n − x−(n−1) ¯ × p0 x−(n−1) ¯ ¯ p (x−n ) + ¯x−n − x−(n−1) ¯ × p0 (x−n ). ¡ ¢ (d) Else set M (x−n ) as the maximum of p x−(n−1) and p (x−n ). 159 3. If p (x−n ) < ² for a pre-specified small ² > 0, then ¡ ¢ (a) If p0 x−(n−1) > 0, set M (x) = p (x−n ) × e(x−x−n ) x < x−n ¡ ¢ (b) If p0 x−(n−1) < 0, set ¡ ¢ −(x−x−(n−1) ) p x x−n < x ≤ x−(n−1) −(n−1) × e M (x) = 0 x < x−n (B.142) (B.143) and stop. Else continue. For values of x where M (·) is not defined above, define M (x) = M (y) where y is the largest value less than x for which M (·) is defined. It is easy to verify that M (·) defined above is easily invertible and has the property M (x) > p(x) where x does not belong to the tail region. For the tails, we will assume that the scaled exponential function majorises p(x). The exponential approximation of the tails is satisfactory for most distributions which will be of interest to us. However, this approach is not adequate for dealing with distributions with fat tails. Now that we have obtained M (·) for a general p(·), we can use the rejection method to generate random deviates from the general distribution with density function p(·). As an example, let us consider Exp(1) distribution. If we start from x0 = 1 and set δ = 0.10, Figure B.34 shows how the majorising function M (x) will provide an upper envelope for the exponential density p(x). Now if we change δ to 0.01, the new majorising function M (x) is given in Figure B.35. Clearly, with δ = 0.01, the majorising function is a very close approximation of the Exp(1) density function. The important point to note here is that the only difference in the simulation of random deviates in cases, δ = 0.10 and δ = 0.01 lies in the efficiency of the method. It is quicker to compute the majorising function if δ is large. However, this might mean generating a significantly large number of uniform deviates to get a single random deviate from the target distribution. On the other hand, small δ means a significant amount of time spent on computing M (x), but more efficiency is achieved in terms of actual generation of random deviates. But since M (x) need 160 1 Majorising function Exponential(1) density Density 0.8 0.6 0.4 0.2 0 0 1 2 3 x 4 5 6 Figure B.34: The Exp(1) density and the majorising function with δ = 0.10. only be computed once, the following rule of thumb can be used — to generate a large number of random deviates, use a small δ. For the N (0, 1) distribution, a similar exercise leads to the majorising functions given in Figures B.36 and B.37. The density estimates based on the simulated 50,000 random deviates obtained from the Exp(1) and N (0, 1) distributions using the Rejection method with δ = 0.01 are given in Figures B.38 and B.39 respectively. 161 1 Majorising function Exponential(1) density Density 0.8 0.6 0.4 0.2 0 0 1 2 3 x 4 5 6 Figure B.35: The Exp(1) density and the majorising function with δ = 0.01. 0.5 Majorising function Normal(0,1) density Density 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 x 1 2 3 4 Figure B.36: The N(0,1) density and the majorising function with δ = 0.10. 162 0.5 Majorising function Normal(0,1) density Density 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 x 1 2 3 4 0.0 0.2 0.4 Density 0.6 0.8 1.0 Figure B.37: The N(0,1) density and the majorising function with δ = 0.01. 0 1 2 3 4 5 6 x Figure B.38: Density estimates based on the simulated 50,000 random deviates from Exp(1). 163 0.5 0.4 0.3 0.0 0.1 0.2 Density −4 −2 0 2 4 x Figure B.39: Density estimates based on the simulated 50,000 random deviates from N (0, 1). 164 Bibliography Arrow, K. (1963). Uncertainty and the welfare economics of medical care. American Economic Review, 53(5), 941–973. Bentham, J. (1789). An introduction to the principles of morals and legislation. Oxford University Press (1996). Binmore, K. (1991). Fun and games: A text on game theory. Houghton Mifflin. Botto, L. and Khoury, M. (2001). Commentary: Facing the challenge of geneenvironment interaction: The two-by-four table and beyond. American Journal of Epidemiology, 153, 1016–1020. Breslow, N. and Day, N. (1980). Statistical Methods in Cancer Research: Volume 1 – The analysis of case-control studies. International Agency for Research on Cancer. Brønnum-Hansen, H., Jørgensen, T., Davidsen, M., Madsen, M., Osler, M., Gerdes, L. and Schroll, M. (2001). Survival and cause of death after myocardial infarction: The danish monica study. Journal of Clinical Epidemiology, 54, 1244–1250. Capewell, S., Livingston, B., MacIntyre, K., Chalmers, J., Boyd, J., Finlayson, A., Redpath, A., Pell, J., Evans, C. J. and McMurray, J. (2000). Trends in case-fatality in 117 718 patients admitted with acute myocardial infarction in scotland. European Heart Journal, 21, 1833–1840. Darwin, C. (1859). On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. Jon Murray, Albermarle Street, London. 165 Darwin, E. (1794). Zoönomia: or the laws of organic life. J. Johnson. Daykin, C., Akers, D., Macdonald, A., McGleenan, T., Paul, D. and Turvey, P. (2003). Genetics and insurance — some social policy issues (with discussions). British Actuarial Journal, 9, 787–874. Doherty, N. and Posey, L. (1998). On the value of a checkup: Adverse selection, moral hazard and the value of information. Journal of Risk and Insurance, 65(2), 189–211. Doherty, N. and Thistle, P. (1996). Adverse selection with endogeneous information in insurance markets. Journal of Public Economics, 63, 83–102. Eisenhauer, J. and Ventura, L. (2003). Survey measures of risk aversion and prudence. Applied Economics, 35, 1477–1484. Goldberg, R., McCormick, D., Gurwitz, J., Yarzebsky, J., Lessard, D. and Gore, J. (1998). Age-related trends in short- and long-term survival after acute myocardial infarction: A 20-year population-based perspective (1975-1995). American Journal of Cardiology, 82, 1311–1317. Gutiérrez, C. and Macdonald, A. (2003). Adult polycystic kidney disease and critical illness insurance. North American Actuarial Journal, 7(2), 93–115. Gutiérrez, C. and Macdonald, A. (2004). Huntington’s disease, critical illness insurance and life insurance. Scandinavian Actuarial Journal, pages 279–313. Hoy, M. and Polborn, M. (2000). The value of genetic information in the life insurance market. Journal of Public Economics, 78, 235–252. Hoy, M. and Witt, J. (2005). Welfare effects of banning genetic information in the life insurance market: The case of brca1/2 genes. Technical report, University of Guelph Discussion Paper 2005-5. Jones, F. (2005). The effects of taxes and benefits on household income, 2004/05. Technical report, Office for National Statistics. 166 Khoury, M. and Flanders, W. (1996). Nontraditional epidemiologic approaches in the analysis of gene-environment interaction: case-control studies with no controls. American Journal of Epidemiology, 144, 207–213. Lewin, B. (2000). Genes VII. Oxford University Press. Macdonald, A. (2003). Moratoria on the use of genetic tests and family history for mortgage-related life insurance. British Actuarial Journal, 9(1), 217–237. Macdonald, A. (2004). Genetics and insurance management. In A. Sandström (ed.) The Swedish Society of Actuaries: One Hunderd Years. Svenska Aktuarieforeningen, StocKholm. Macdonald, A. and Pritchard, D. (2000). A mathematical model of alzheimer’s disease and the apoe gene. ASTIN Bulletin, 30, 69–110. Macdonald, A. and Pritchard, D. (2001). Genetics, alzheimer’s disease and long-term care insurance. North American Actuarial Journal, 5(2), 54–78. Macdonald, A., Pritchard, D. and Tapadar, P. (2006). The impact of multifactorial genetic disorders on critical illness insurance: A simulation study based on uk biobank. To appear in ASTIN Bulletin. Macdonald, A. and Tapadar, P. (2006). Multifactorial genetic disorders and adverse selection: Epidemiology meets economics. Submitted. Macdonald, A., Waters, H. and Wekwete, C. (2003a). The genetics of breast and ovarian cancer i: A model of family history. Scandinavian Actuarial Journal, pages 1–27. Macdonald, A., Waters, H. and Wekwete, C. (2003b). The genetics of breast and ovarian cancer ii: A model of family history. Scandinavian Actuarial Journal, pages 28–50. McCormick, A., Fleming, D. and Charlton, J. (1995). Morbidity Statistics from General Practice: Fourth National Study 1991-1992. Series MB5 No. 3. Washington, D.C.: OPCS, Government Statistical Service. 167 Mendel, G. (1866). Proceedings of the natural history society. Journal of Monetary Economics, 4, 3–47. Meyer, D. and Meyer, J. (2005). Risk preferences in multi-period consumption models, the equity premium puzzle and habit formation utility. Journal of Monetary Economics, 52, 1497–1515. Miettinen, O. (1970). Estimation of relative risk from individually matched series. Biometrics, 26, 75–86. Mill, J. (1879). Utilitarianism. Longmans, Green and Co. Mossin, J. (1968). Aspects of rational insurance purchasing. Journal of Political Economy, 76(4), 553–568. Nash, J. (1950). The bargaining problem. Insurance: Mathematics and Economics, 17, 155–162. Norberg, R. (1995). Differential equations for moments of present values in life insurance. Econometrica, 18(2), 171–180. Norstad, J. (1999). An introduction to utility theory. Unpublished manuscript at http://homepage.mac.com/j.norstad. Pasternak, J. (1999). An introduction to human molecular genetics: mechanisms of inherited diseases. Fitzgerald Science Press. Pratt, J. (1964). Risk aversion in the small and in the large. Econometrica, 32, 122–136. Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (2002). Numerical Recipes in C++. Cambridge University Press. Ridley, M. (1999). Genome: The autobiography of a species in 23 chapters. Fourth Estate. Robins, J., Greenland, S. and Breslow, N. (1986). A general estimator for the variance of the mantel-haenszel odds ratio. American Journal of Epidemiology, 124, 719–723. 168 Rothschild, M. and Stiglitz, J. (1976). Equilibrium in competitive insurance markets: An essay on the economics of imperfect information. The Quarterly Journal of Economics, 90(4), 630–649. Strachan, T. and Read, A. (1999). Human Molecular Genetics 2. BIOS Scientific Publishers Ltd. Sudbery, P. (1998). Human molecular genetics. Addison Wesley Longman Limited. Treasury, H. (2005). Economy charts and tables. Technical report, Pre-Budget Report. Tunstall-Pedoe, H., Kuulasmaa, K., Mähönen, M., Tolonen, H., Ruokokoski, E. and Amouyel, P. (1999). Contribution of trends in survival and coronary event rates to changes in coronary heart disease mortality: 10 year results from 37 who monica project populations. The Lancet, 353, 1547–1557. Von Neumann, J. and Morgenstern, O. (1944). Theory of games and economic behavior. Princeton University Press. Watson, J. and Crick, F. (1953). Moelcular structure of nucleic acids. Nature, 171, 737–738. Woodward, M. (1999). Epidemiology: Study Design and Data Analysis. Chapman & Hall. Xie, D. (2000). Power risk aversion utility functions. Annals of Economic and Finance, 1, 265–282. 169