The Impact Of Multifactorial Genetic Disorders On Long

advertisement
THE IMPACT OF MULTIFACTORIAL GENETIC
DISORDERS ON LONG-TERM INSURANCE
By
Pradip Tapadar
Submitted for the Degree of
Doctor of Philosophy
at Heriot-Watt University
on Completion of Research in the
School of Mathematical and Computer Sciences
January 2007.
This copy of the thesis has been supplied on the condition that anyone who consults
it is understood to recognise that the copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without
the prior written consent of the author or the university (as may be appropriate).
I hereby declare that the work presented in this thesis was carried out by myself at Heriot-Watt University,
Edinburgh, except where due acknowledgement is made,
and has not been submitted for any other degree.
Pradip Tapadar (Candidate)
Professor Angus S. Macdonald (Supervisor)
Date
ii
Contents
Acknowledgements
xiii
Abstract
xv
Introduction
1
1 Genetics and Insurance
1.1 Introduction . . . . . . . . . . . . .
1.2 Genes . . . . . . . . . . . . . . . .
1.3 Genetic Disorders and Insurance . .
1.3.1 Huntington’s Disease . . . .
1.3.2 Alzheimer’s Disease . . . . .
1.3.3 Cancer . . . . . . . . . . . .
1.3.4 Cardiovascular disease . . .
1.4 Genetics and Insurance Regulations
1.5 The UK Biobank Project . . . . . .
1.6 A UK Biobank Simulation Model .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 A Model for Heart Attack
2.1 Specification of the Model . . . . . . .
2.2 The Heart Attack Transition Intensity
2.3 Mortality After First Heart Attacks . .
2.3.1 Literature Review . . . . . . . .
2.3.2 Data . . . . . . . . . . . . . . .
2.3.3 Fitting a Parametric Function .
2.3.4 Discussion of the Fitted Model
2.4 Mortality Before First Heart Attacks .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Gene-Environment Interaction
3.1 Definition of Strata: A Simple Example . . . . . . . .
3.2 A Sample Realisation of UK Biobank . . . . . . . . .
3.3 Epidemiological Analysis . . . . . . . . . . . . . . . .
3.4 An Actuarial Investigation . . . . . . . . . . . . . . .
3.5 Premium Rating for Critical Illness Insurance . . . .
3.5.1 A Critical Illness Model . . . . . . . . . . . .
3.5.2 Premium Rating for Critical Illness Insurance
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
15
15
16
18
19
20
22
24
.
.
.
.
.
.
.
.
27
27
28
29
30
32
33
37
40
.
.
.
.
.
.
.
47
47
50
50
54
56
56
58
4 UK
4.1
4.2
4.3
4.4
4.5
Biobank Simulation Results
Varying the Genetic and Environment Model . . . .
Outcomes of 1,000 Simulations: The Base Scenario
A Measure of Confidence . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . .
5 Adverse Selection and Utility Theory
5.1 Risk and Insurance . . . . . . . . . . . . . . . . .
5.2 Underwriting Risk . . . . . . . . . . . . . . . . .
5.3 Multifactorial Disorders . . . . . . . . . . . . . .
5.4 Literature Review . . . . . . . . . . . . . . . . . .
5.5 Adverse Selection . . . . . . . . . . . . . . . . . .
5.6 Utility of Wealth . . . . . . . . . . . . . . . . . .
5.7 Coefficients of Risk-aversion . . . . . . . . . . . .
5.8 Families of Utility Functions . . . . . . . . . . . .
5.9 Estimates of Absolute and Relative Risk-aversion
6 Adverse Selection in a 2-state Insurance Model
6.1 A Simple Gene-environment Interaction Model . .
6.2 Single Premiums . . . . . . . . . . . . . . . . . .
6.3 Threshold Premium . . . . . . . . . . . . . . . . .
6.4 The Additive Epidemiological Model . . . . . . .
6.5 Immunity From Adverse Selection . . . . . . . . .
6.6 The Multiplicative Epidemiological Model . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Adverse Selection in a Critical Illness Insurance Model
7.1 A Heart Attack Model . . . . . . . . . . . . . . . . . . . .
7.2 Threshold Premium for Critical Illness Insurance . . . . .
7.3 Premium Rates for Critical Illness Insurance . . . . . . . .
7.4 High Relative Risks . . . . . . . . . . . . . . . . . . . . . .
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
61
64
67
68
76
.
.
.
.
.
.
.
.
.
79
79
80
82
83
85
86
89
90
92
.
.
.
.
.
.
.
.
.
.
.
.
95
95
96
97
98
100
104
.
.
.
.
.
107
. 107
. 109
. 109
. 117
. 121
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Conclusions
125
8.1 UK Biobank Simulation Study . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Adverse Selection Issues . . . . . . . . . . . . . . . . . . . . . . . . . 128
A Epidemiology
A.1 Introduction . . . . . . . . . . .
A.2 Measuring risks . . . . . . . . .
A.3 Models of Disease Association .
A.4 Relative Risk and Odds Ratio .
A.5 Analysis of Grouped Data . . .
A.6 Analysis of Matched Studies . .
A.7 Effects of Combined Exposures
.
.
.
.
.
.
.
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
. 131
. 133
. 137
. 139
. 140
. 142
. 146
B Numerical Methods
B.1 Differential Equations . . . . . . . .
B.1.1 Introduction . . . . . . . . .
B.1.2 Euler Method . . . . . . . .
B.1.3 Runge-Kutta Method . . . .
B.2 Random Numbers . . . . . . . . . .
B.2.1 Introduction . . . . . . . . .
B.2.2 Uniform Deviates . . . . . .
B.2.3 The Transformation Method
B.2.4 The Rejection Method . . .
References
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
. 151
. 151
. 152
. 152
. 153
. 153
. 154
. 155
. 157
165
v
vi
List of Tables
2.1
2.2
2.3
2.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
4.16
4.17
4.18
Survival probabilities after first heart attack. . . . . . . . . . . . . .
Parameter estimates. . . . . . . . . . . . . . . . . . . . . . . . . . .
Odds of dying within first 30 days, one year and two years following
a first heart attack. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adjusted odds ratios and the corresponding 95% confidence intervals
of dying within first 30 days, one year and two years following a first
heart attack according to Goldberg et al. (1998). . . . . . . . . . . .
The factor ρs , in Equation (3.14), for each gene-environment combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The multipliers k s × ρuv for each stratum. . . . . . . . . . . . . . .
The true relative risks for each stratum, relative to the baseline ge
stratum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The simulated life histories of the first 20 (of 500,000) individuals
showing their genders, exposure to environmental factors, genotypes
and the times and types of all transitions made within 10 years. . .
Number of individuals in each state at the end of the 10-year follow-up
period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Odds ratios with respect to the ge stratum as baseline, based on a 1:5
matching strategy using all cases and 5-year age groups. Approximate
95% Confidence intervals are shown in brackets. There were no cases
among females age 45–49 in stratum GE. . . . . . . . . . . . . . . .
The age-adjusted odds ratios calculated for both males and females.
The estimated multipliers cs for each stratum. . . . . . . . . . . . .
m
28-Day mortality rates, q01
(x) = 1 − pm
01 (x), for males following heart
attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The true critical illness insurance premiums for different strata as a
percentage of those for stratum ge. . . . . . . . . . . . . . . . . . .
The actuary’s estimated critical illness insurance premiums for different strata as a percentage of those for stratum ge. . . . . . . . . . .
The model parameters for different scenarios. Odds ratios are also
shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The correlation matrix of the strata-specific premium rates for males
aged 45 and policy term 15 years under the Base scenario, all cases
included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The correlation matrix of the premium ratings for males aged 45 and
policy term 15 years under the Base scenario, all cases included. . .
vii
. 33
. 35
. 40
. 40
. 48
. 49
. 49
. 51
. 51
. 53
. 54
. 55
. 57
. 59
. 60
. 63
. 65
. 67
4.19 The measure of overlap O for CI insurance premium ratings for males
aged 45, with policy term 15 years, for different scenarios. . . . . . . 69
4.20 The measure of overlap O for CI insurance premium ratings for females aged 45 with policy term 15 years, for different scenarios. . . . 74
4.21 The measure of overlap O for CI insurance premium ratings for males
aged 45, with policy term 15 years, for different scenarios and a 1:1
matching strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.22 The number of simulations rejected due to the inability to calculate
the odds ratios for a 1:1 matching strategy. . . . . . . . . . . . . . . . 76
6.23 The relative risk k above which persons in stratum ge with initial
wealth W = £100, 000 will not buy insurance, using ω = 0.5 and an
additive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.24 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions. 102
6.25 The relative risk k above which persons in stratum ge with initial
wealth W = £100, 000 will not buy insurance, using ω = 0.9 and an
additive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.26 The relative risk k above which persons in stratum ge with initial
wealth W = £100, 000 will not buy insurance, using ω = 0.9 and a
multiplicative model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.27 The premium rates of critical illness contracts of duration 15 years. . 110
7.28 P † for males, which solves Z(P ) = 0, for different combinations of
utility functions and losses, using initial wealth W = £100,000. . . . . 112
7.29 P † for females, which solves Z(P ) = 0, for different combinations of
utility functions and losses, using initial wealth W = £100,000. . . . . 113
7.30 The population average premium rate for CI insurance, P0 , as if heart
attack risk were absent (λ12 = 0). . . . . . . . . . . . . . . . . . . . . 114
7.31 The relative risk k above which males of different ages in stratum
ge with initial wealth W = £100, 000 will not buy critical illness
insurance policies of term 15 years, where ω = 0.9. . . . . . . . . . . . 115
7.32 The relative risk k above which females of different ages in stratum
ge with initial wealth W = £100, 000 will not buy critical illness
insurance policies of term 15 years, where ω = 0.9. . . . . . . . . . . . 115
7.33 The loss L0 in £,000 above which adverse selection cannot occur.
Initial wealth W = £100,000. . . . . . . . . . . . . . . . . . . . . . . 116
7.34 q̄, the probability that a healthy person aged x has a heart attack
before age x + t, for policy duration t = 15 years. . . . . . . . . . . . 119
7.35 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions,
for males purchasing CI insurance. . . . . . . . . . . . . . . . . . . . 120
7.36 The proportions ω exposed to each low-risk factor above which persons in the baseline stratum will buy insurance at the average premium regardless of the relative risk k, using different utility functions,
for females purchasing CI insurance. . . . . . . . . . . . . . . . . . . . 121
A.37 List of odds ratios obtained from the 2 × 4 table in Figure A.33. . . . 149
viii
A.38 Other measures based on the 2 × 4 table in Figure A.33. . . . . . . . 149
ix
x
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
3.18
4.19
4.20
4.21
A 4-state heart attack model. . . . . . . . . . . . . . . . . . . . . . .
The transition intensity of all first heart attacks, by gender. . . . . .
Subset of the model in Figure 2.1 to study survival after heart attacks.
The plots of the data from Table 2.1. . . . . . . . . . . . . . . . . . .
The plots of f (t) = 1/(1 + ta ) against t for values of a = 0.25, 0.50,
1.00, 2.00, 4.00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The plots of survival probabilities, P22 (x, x+t), against duration after
heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. . .
The plots of transition intensities, λ24 (x, t), against duration after
heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years. . .
Graphs of λ24 (x, t), assigned to representative ages for each age group,
and the force of mortality of the ELT15 life tables. . . . . . . . . . .
The plots of survival probabilities of men aged 50, 60, 70, 80 and 90
following ELT15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The plots of survival probabilities of women aged 50, 60, 70, 80 and
90 following ELT15. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The plots of survival probabilities, of individuals aged 50, 60, 70, 80
and 90, over the first 30 days after a first heart attack. . . . . . . . .
The plots of survival probabilities of individuals aged 50, 60, 70, 80
and 90, who survived the first 30 days after a first heart attack. . . .
4-state heart attack model - Grouping of states. . . . . . . . . . . . .
A 2-state mortality model. . . . . . . . . . . . . . . . . . . . . . . . .
The graph of the integrand in Equation 2.11. . . . . . . . . . . . . . .
The graph of the integrand in Equation 2.13. . . . . . . . . . . . . . .
Transition intensities of non-heart-attack deaths plotted along with
ELT15 for both males and females. . . . . . . . . . . . . . . . . . . .
A full critical illness model for gender s. . . . . . . . . . . . . . . . .
Scatter plots of CI insurance premium rates for strata gE, Ge and
GE versus that of ge under the Base scenario for males aged 45 and
policy term 15 years. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The scatter plots of the premium ratings Ge/ge and GE/ge versus
gE/ge and the corresponding density plots for males aged 45 and
policy term 15 years under the Base scenario, all cases included. . . .
Marginal densities of premium ratings in the Base scenario (males)
with different numbers of cases in the case-control study. . . . . . . .
xi
28
29
30
33
34
36
36
37
38
38
39
39
41
42
44
45
46
56
65
66
71
4.22 The empirical cumulative distribution function of the premium ratings gE/ge, Ge/ge and GE/ge for males aged 45 and policy term 15
years under the Base scenario. . . . . . . . . . . . . . . . . . . . . .
4.23 Marginal densities of premium ratings in different scenarios (males),
with 5,000 cases in the case-control study. . . . . . . . . . . . . . .
5.24 Utility of wealth for a risk averse individual. . . . . . . . . . . . . .
6.25 A two state model . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.26 A full critical illness model. . . . . . . . . . . . . . . . . . . . . . .
7.27 The ratio of heart attack transition intensity to total critical illness
transition intensity, by gender. . . . . . . . . . . . . . . . . . . . . .
A.28 A schematic diagram of a case-control study. . . . . . . . . . . . . .
A.29 A 2-state model. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.30 A 2 × 2 table for stratum k with corresponding probabilities. . . . .
A.31 A 2 × 2 table with data for stratum k. . . . . . . . . . . . . . . . .
A.32 The types of table for each case-control pair in a 1:1 matching. . . .
A.33 A 2 × 4 table with data for stratum k. . . . . . . . . . . . . . . . .
B.34 The Exp(1) density and the majorising function with δ = 0.10. . . .
B.35 The Exp(1) density and the majorising function with δ = 0.01. . . .
B.36 The N(0,1) density and the majorising function with δ = 0.10. . . .
B.37 The N(0,1) density and the majorising function with δ = 0.01. . . .
B.38 Density estimates based on the simulated 50,000 random deviates
from Exp(1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.39 Density estimates based on the simulated 50,000 random deviates
from N (0, 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
. 72
.
.
.
.
73
87
95
108
.
.
.
.
.
.
.
.
.
.
.
110
134
135
139
140
143
148
161
162
162
163
. 163
. 164
Acknowledgements
First of all, I would like to thank my supervisor, Professor Angus Macdonald, for
his continuous support, guidance and encouragement for the entire duration of this
project. However busy, he always found time to meet regularly and discuss my work.
I found his constructive criticisms and eye for technical detail absolutely invaluable
throughout the course of the study. I would also like to thank Dr Delme Pritchard
for his suggestions and advice for the first half of the thesis.
This work was carried out at the Genetics and Insurance Research Centre at
Heriot-Watt University. I would like to thank the sponsors for funding, and members
of the Steering Committee for helpful comments. It has also been a pleasure to work
with my colleagues: Lu Li, Tunde Akodu, Laura MacCalman and Tushar Chatterjee.
My parents and my sister have encouraged and inspired me to pursue knowledge
to the best of my abilities all through my life. Without their love, support and
guidance, I would not have come this far. A special thanks to Bruce Porteous, a
guide and a friend, for his enthusiastic support.
Finally, I dedicate this thesis to my wife, Vaishnavi, for her unfailing love, patience and support. At my insistence, she has had to endure reading numerous
versions of the thesis at various stages of development. She provided me unconditional support, without which this thesis would not have been a reality. Thank
you.
xiii
xiv
Abstract
Rapid advances in genetic epidemiology and the setting up of large-scale cohort
studies, like the UK Biobank project, have shifted the focus from severe, but rare,
single gene disorders to less severe, but common, multifactorial disorders. This will
lead to the discovery of genetic risk factors for common diseases of major importance in insurance underwriting. Given this backdrop, we have two specific aims
for this thesis. In the first half of the thesis (also the subject matter of Macdonald
et al. (2006)), we analyse the impact of results emerging out of UK Biobank on the
insurance industry. In the second half (subject matter of Macdonald and Tapadar
(2006)), we consider the related adverse selection issues.
The UK Biobank project is a large-scale investigation of the combined effects of
genotype and environmental exposures on the risk of common diseases. It is intended
to recruit 500,000 subjects aged 40–69, to obtain medical histories and blood samples
from them at outset, and to follow them up for at least 10 years. This will have a
major impact on our knowledge of multifactorial genetic disorders, rather than the
rare but severe single-gene disorders that have been studied to date. The question
arises, what use may insurance companies make of this knowledge, particularly if
genetic tests can identify persons at different risk? We describe here a simulation
study of the UK Biobank project. We specify a simple hypothetical model of genetic
and environmental influences on the risk of heart attack. A single simulation of UK
Biobank consists of 500,000 life histories over 10 years; we suppose that case-control
studies are carried out to estimate age-specific odds ratios, and that an actuary uses
these odds ratios to parameterise a model of critical illness insurance. From a large
number of such simulations we obtain sampling distributions of premium rates in
different strata defined by genotype and environmental exposure. We conclude that
xv
the ability of such a study reliably to discriminate between different underwriting
classes is limited, and depends on large numbers of cases being analysed.
As is the situation now in many countries, if genetic information continues to be
treated as private, adverse selection becomes possible. But it should occur only if
the individuals at lowest risk obtain lower expected utility by purchasing insurance
at the average price than by not insuring. We explore where this boundary may lie,
using a simple 2 × 2 gene-environment interaction model of epidemiological risk, in a
simplified 2-state insurance model and in a more realistic model of heart-attack risk
and critical illness insurance. Adverse selection does not appear unless purchasers
are relatively risk-seeking (compared with a plausible parameterisation) and insure
a small proportion of their wealth; or unless the elevated risks implied by genetic
information are implausibly high. In many cases adverse selection is impossible
if the low-risk stratum of the population is large enough. These observations are
strongly accentuated in the critical illness model by the presence of risks other than
heart attack, and the constraint that differential heart-attack risks must agree with
the overall population risk. We find no convincing evidence that adverse selection
is a serious insurance risk, even if information about multifactorial genetic disorders
remains private.
xvi
Introduction
Much of human genetics is concerned with studying the genetic contribution to
diseases, and this leads to a profound distinction between the single-gene disorders
and the multifactorial disorders.
(a) Single-gene disorders are caused, as their name suggests, by a defect in a single
gene. Because most genes are inherited in a simple way according to Mendel’s
laws, these diseases show characteristic patterns of inheritance from one generation to the next, known to geneticists and underwriters alike as a ‘family
history’. Single-gene disorders are quite rare but often severe.
(b) Multifactorial disorders are (mostly) common diseases, such as coronary heart
disease and cancers, whose onset or progression may be influenced by variations
in several genes, acting in concert with environmental differences. The effect
is likely to be quite slight, conferring an altered predisposition to the disease
rather than a radically different risk.
Most genetic epidemiology has, until now, concentrated on single-gene disorders.
One reason is that the clear patterns of Mendelian inheritance identified affected
families long before the direct examination of DNA, and location of the relevant
genes, became possible. So when these tools did emerge in the 1990s, geneticists
knew where to look; affected families were studied, genes were identified, and the
key epidemiological parameters were estimated. The parameter of most interest to
actuaries is the age-related penetrance, which is the probability that a person who
carries a risky version of the gene will have suffered onset of the disease by age x.
It is entirely analogous to the life table probability x q0 . (Often, the risky versions
of the gene are called ‘mutations’, and a person carrying one is called a ‘mutation
carrier’ or just ‘carrier’.)
1
Studies of affected families are by definition retrospective in nature; families are
studied because they are known to be affected. Retrospective studies are subject to
uncontrolled sources of bias, precisely because they are based on a non-randomly
selected sample of the population; so they are, if possible, avoided in favour of
prospective studies, in which a properly randomised sample of healthy subjects is
followed forwards in time. Despite this health warning, retrospective studies of
single-gene disorders have been carried out for reasons of convenience, cost and necessity: the ready availability of known affected families was convenient and made
data collection relatively cheap; and the rarity of single-gene disorders made prospective studies impractical. Moreover, a prospective study would take many years to
yield results. Another consequence of the rarity of most single-gene disorders is that
most studies have had quite small sample sizes, but if the penetrance is high enough
this is tolerable. These studies have successfully led to many gene discoveries and a
lot of progress has been made in understanding a number of single-gene disorders.
Multifactorial disorders, which are influenced by more than one gene or by interactions between genes and the environment, are not so well-studied. Many disorders,
including cancer, heart disease, diabetes and Alzheimer’s disease are believed to be
caused or influenced by complex interactions between multiple genes, environment
and lifestyle. The clear patterns of Mendelian inheritance are lost, and any familial
clustering of disease that may be observed could just as easily be caused by shared
environment as by shared genes. Therefore, there is no existing pool of known
affected families that can be studied straightaway. And, because the influence of
genetic variation may be slight (low penetrance) large samples will be needed to
detect such influence with any reliability.
At the risk of oversimplifying a little, single-gene disorders represent the genetical
research of the past, and multifactorial disorders represent the genetical research of
the future. Progress will need studies that are large-scale, prospective, and longterm (and therefore very expensive). These studies must capture both genetic and
environmental variation (and interactions) and relate them to the risks of common
diseases. This is extremely ambitious.
2
The proposed UK Biobank project aims to achieve this. This project will recruit
500,000 individuals aged 40 to 69, chosen as randomly as possible from the general
UK population, and collect data on them over a period of 10 years. We will discuss
the main features of UK Biobank in Section 1.5. A key point is that UK Biobank
aims only to collect data, not to analyse it. Its data will, in due course, be made
available to researchers interested in particular genes and particular diseases, who
will have to obtain separate funding for their studies in the usual way. This is
sensible because it is impossible to predict at outset just what combinations of
genes, environment and disease it will be most fruitful to study. Nevertheless, it
is necessary to have in mind the kinds of statistical studies that may, in future, be
carried out, so that UK Biobank can be set up to capture data of the correct form.
The presumption is that most studies will be case-control studies. We outline the
basics of case-control studies in Appendix A.
Given its size and significance, it is important to study the kind of results we
might expect to emerge out of UK Biobank. Our particular interest is in the implications of UK Biobank for insurance. There has been a lot of debate, often heated,
concerning genetics and insurance in the past 10 years, mainly focussed on singlegene disorders. We refer to Daykin et al. (2003) or Macdonald (2004) as sources.
It seems plausible that awareness of genetic issues will be heightened by enrolling
500,000 people into a high-profile genetic study. Insofar as insurance questions arise,
answers obtained from past actuarial research, based on single-gene disorders, may
be wholly inapplicable. But, since the single-gene disorders provide all the easily
grasped examples and paradigms, there is a risk that these examples and paradigms
will be grafted onto UK Biobank, however inappropriately, by the media if not by
the genetics community. It will then be unfortunate that, by its nature, UK Biobank
will not provide the evidence to refute such errors for 5–10 years.
We devote the first half of the thesis to modelling UK Biobank itself, so that
before a single person has been recruited, or gene sequenced, we may quantify the
implications of its outcomes for insurance. We choose critical illness (CI) insurance
as the simplest type of coverage, because the insured event is generally onset of
disease, and we need not model post-onset events (although as we shall see in Section
3
2, this is not entirely true in parameterising the model). We choose heart attack
(myocardial infarction) as the disease of interest, because this will certainly be a
major target of studies using UK Biobank data. Our approach is simple: simulate
500,000 random life histories, given an assumed model of genetic and environmental
influences on the hazard rate of heart attack. Then we may analyse these simulated
data just as an epidemiologist or an actuary may be expected to.
At this stage a further complication appears, one that is all too familiar to actuarial researchers who have modelled single-gene disorders.
Actu-
aries almost never have access to the original data upon which genetic studies are based.
In the case of UK Biobank, Section 5.2 of the draft protocol
(www.ukbiobank.ac.uk/docs/draft protocol.pdf) says: “Data from the project
will not be accessible to the insurance industry or any other similar body.” This
means that actuarial researchers will have to rely on the published outcomes of
medical or epidemiological research projects that use the UK Biobank data. The
ideal, given the models actuaries typically use for pricing and reserving, would be
age-dependent onset rates or penetrances, corresponding to µx or qx in a life-table
study. Unfortunately, this is far in excess of what is usually published in a medical
study, because the questions addressed by such studies can often be answered by
much simpler statistics. And, it must be said, the estimation of µx or qx is very demanding of the data. Since we expect case-control studies to be the most common
approach to UK Biobank, we must take account of this in analysing our simulated
data. We may not, realistically, assume that the actuary can analyse directly the
500,000 simulated life histories. Instead, an epidemiologist must first carry out a
case control study and publish the results, which most often will be expressed as
odds ratios (see Appendix A). Then the actuary must take these odds ratios and,
using whatever approximate methods come to hand, estimate onset rates or penetrances suitable for use in an actuarial model. We will model this process, with two
results:
(a) We will be able to estimate the impact on CI insurance premiums of representative multifactorial modifiers of heart attack risk.
(b) Having simulated the data from a known model of our own choosing, we can
4
assess the seriousness of the errors that must be made, in parameterising an
actuarial model from published odds ratios rather than from the raw data.
As mentioned before, previous actuarial studies have done exactly that (see
Macdonald and Pritchard (2000) for an example), but only in the context of
relatively high penetrances. We will be interested to see if robust actuarial
modelling of relatively low-penetrance disorders is possible using published casecontrol studies.
The plan of the first half of the thesis is as follows. After a general introduction
in Section 1.1, we provide a basic overview of genetics in Section 1.2. Section 1.3
gives examples of a few well-known genetic disorders along with reviews of relevant
actuarial literature. The regulatory developments in the UK, concerning genetics
and insurance are covered in Section 1.4. In Section 1.5 we describe the main
features of UK Biobank. In Section 1.6, we will introduce our general approach to
simulating UK Biobank. A specific multiple-state model representing heart attack
will be introduced in Section 2.1. The transition intensities underlying the model
will be developed in Sections 2.2–2.4.
In Section 3.1, we will hypothecate a simple 2 × 2 gene-environment interaction
model affecting the risk of heart attack. In Section 3.2, we present (in summary
form) a set of simulated UK Biobank data, namely 500,000 life histories. Then,
we analyse these simulated data, in two stages as described above. First, a model
epidemiologist will carry out a case-control study (actually, we will look at several
different case-control studies that may be carried out). This is presented in Section
3.3. Then, our model actuary will use these ‘published’ figures to construct CI
insurance models allowing for genetic variability and environmental exposures. The
actuarial investigation is discussed in Section 3.4. In Section 3.5, premium rates
based on these critical illness models will be calculated and compared for these
different subgroups.
Despite its great size, UK Biobank is essentially an unrepeatable single sample.
Any estimated quantity based upon its data is subject to the usual statistical sampling error — and a premium rate is just such an estimated quantity. It is to be
hoped that the large samples available from UK Biobank will reduce sampling error
5
to a low level. In reality, the designers of UK Biobank can only estimate the statistical power of representative case-control studies, which was certainly done before
choosing 500,000 lives as the sample size. We, however, with control over our simulated data, can assess directly the sampling properties of estimates based on UK
Biobank data. In particular, and of direct relevance to the criteria established in
the UK by the Genetics and Insurance Committee (GAIC) we can assess in statistical terms the reliability of CI premium rates based on UK Biobank data. We do
this simply by repeating the simulation of 500,000 life histories as many times as
necessary, and constructing the empirical distributions of derived quantities such as
odds ratios and premiums. This is in Section 4.
The second half of the thesis addresses the issues relating to adverse selection in
the context of multifactorial disorders. Insurance companies have developed sophisticated underwriting techniques to cope with the problems of adverse selection. The
principle behind underwriting is to identify key risk factors that stratify applicants
into reasonably homogeneous groups, for each of which the appropriate premium
rate can be charged. The risk of death or ill health is affected by, among other
things, age, gender, lifestyle and genotype. However, the use of certain risk factors
is sometimes controversial. In particular, this is true of factors over which individuals have no control, such as genotype. As a result, in many countries a ban has been
imposed, or moratorium agreed, limiting the use of genetic information. In the UK,
GAIC is providing guidance to insurers on the acceptable use of genetic test results.
As discussed earlier, disorders caused by mutations in single genes, which may
be severe and of late onset, but are rare, have been quite extensively studied in
the insurance literature. One reason is that the epidemiology of these disorders is
relatively advanced, because biological cause and effect could be traced relatively
easily. The conclusion has been that single-gene disorders, because of their rarity, do
not expose insurers to serious adverse selection in large enough markets. However,
this conclusion need not be valid for multifactorial disorders. The vast majority
of the genetic contribution to human disease, will arise from combinations of gene
varieties (called ‘alleles’) and environmental factors, each of which might be quite
common, and each alone of small influence but together exerting a measurable effect
6
on the molecular mechanism of a disease. Although the epidemiology of multifactorial disorders is not very advanced, this should make progress in the next 5–10 years
through the very large prospective studies now beginning in several countries, like
the UK Biobank project. If these studies are successful in capturing both genetic
and environmental variations and interactions, and relate them to the risks of common diseases, the genetics and insurance debate will, in the fairly near future, shift
from single-gene to multifactorial disorders.
Any model used to study adverse selection risk must incorporate the behaviour
of the market participants. Most of those applied to single-gene disorders in the
past did so in a very simple and exaggerated way, assuming that the risk implied by
an adverse genetic test result was so great that its recipient would quickly buy life
or health insurance with very high probability. These assumptions were not based
on any quantified economic rationale, but since they led to minimal changes in the
price of insurance this probably did not matter. The same is not true if we try
to model multifactorial disorders. Then ‘adverse’ genotypes may imply relatively
modest excess risk but may be reasonably common, so the decision to buy insurance
is more central to the outcome.
Most research on adverse selection concentrates primarily on providing a proper
economic rationale for the impact, on the insurance market, of genetic tests for,
mainly, rare diseases. In this thesis, we try to bring together plausible quantitative
models for the epidemiology and the economic issues, in respect of more common
disorders, therefore affecting a much larger proportion of the insurer’s customer
base. We wish to find out under what circumstances adverse selection is likely to
occur.
The plan of the second half of the thesis is as follows. In Sections 5.1–5.3, we
provide background information on risk, insurance and underwriting. In Section
5.4, we review existing literature. Adverse selection in the context of multifactorial
disorder is defined in Section 5.5. A basic introduction to utility theory and estimates
of risk-aversion is discussed in Sections 5.6–5.9.
In Chapter 6, we develop techniques to determine the conditions leading to adverse selection for a 2×2 gene-environment interaction in a simple 2-state insur7
ance model. We study the impact of additive and multiplicative impacts of geneenvironment interactions in Sections 6.4 and 6.6 respectively. The rôle played by
population proportion in each risk category is studied in Section 6.5.
In Chapter 7, we extend the results from the 2-state model to a CI insurance
model. We propose a simple model of a multifactorial disorder, with two genotypes
and two levels of environmental exposure, and either additive or multiplicative interactions between them. These factors affect the risk of myocardial infarction (heart
attack), therefore the theoretical price of CI insurance. The situation here is slightly
different from the 2-state insurance model, in that there are risks, other than heart
attack, which affect CI insurance. Conclusions and suggestions for further work are
in Chapter 8.
We have also provided two appendices at the end for background information.
Appendix A gives a brief overview of epidemiology and Appendix B provides introductions to some relevant numerical methods.
8
Chapter 1
Genetics and Insurance
1.1
Introduction
With the discovery of genes, we are closer than ever before to a clearer understanding
of our biological roots; our place in the history of evolution. As it turns out, the
essence of life is embedded in the genes. Genes contain all the information necessary
to create a life form out of a single cell. They are the units of heredity passed
down from one generation to the next. They shape our physical characteristics and
behavioural patterns. In short, genes are key to life, the reasons for our existence.
But they cannot work alone, environment plays an active role too. It is increasingly
becoming obvious that it is the interplay between genes and the environment that
shapes what we are.
Genes thrive on diversity. We, human beings, are all distinct from one another
and not just mere clones, thus proving the existence of wide variations even within a
single species. But diversity also brings with it its own complications. For example,
although all variations of the same gene are supposed to perform the same function,
they all do it in slightly different ways. Inevitably, this leads to differences in their
performance. In particular, underperformance can produce unwanted side-effects in
the form of genetic disorders.
This has practical implications in all spheres of human life. Here we are interested in the impact of genetic disorders on the insurance industry. Insurance in its
basic form is a simple principle of cooperation, where each individual in a group
9
contributes a small amount towards a common fund, which can be used to support
the few who suffer losses; a small price to pay for guaranteed support in times of
misfortune. However, even in this basic set-up, it is clear that insurance cannot be
provided to all at the same price. There will always be heterogeneity in risk profiles,
where a few individuals will be exposed to a greater risk of loss compared to the
rest. Then it will be unfair on the low-risk individuals to ask them to subsidise
the high-risk group. So, charging risk-based insurance premiums seems a sound
alternative.
The reasoning appears perfectly logical when smokers are charged a higher life
insurance premium. It can be argued that individuals who smoke choose to do so
of their own free will, fully understanding the related health hazards. However, the
same logic becomes untenable when applied to an individual who has inherited a
“faulty” gene from his parents. It might be obvious that he faces a greater risk, but
is it fair to penalise him for his own misfortune?
The answer is far from being straight-forward. Apart from ethical and moral
issues, there are economic and political angles to it as well. Governments and
insurance regulators might find it difficult to let market economics takes its own
course. But if they intervened, the outcome may not be entirely beneficial, as it will
ultimately be the general public who will pay for any market inefficiency.
Given this backdrop, our aim in this thesis will be to analyse the impact of geneenvironment interactions on insurance from the perspective of both insurers and
consumers. We ask, for what types of gene-environment interactions:
(a) Can an insurer justify charging different premiums for different groups?
(b) Does an insurer face the risk of adverse selection?
But, before tackling these questions, we will provide some background information
in the remainder of this chapter. In Section 1.2, we provide an overview of genetics.
In Section 1.3, we provide examples of a few well-known genetic disorders. In Section
1.4, we will give a brief history of how regulations on genetics and insurance have
been shaped in the UK. A brief description of UK Biobank project is in Section 1.5.
An outline of our proposed model of UK Biobank, to analyse the results that might
come out of the project, is given in Section 1.6.
10
1.2
Genes
“ ..., as the earth and ocean were probably peopled with vegetable productions
long before the existence of animals; and many families of these animals long
before other families of them, shall we conjecture that one and the same kind
of living filaments is and has been the cause of all organic life?”
This is a bold conjecture by Erasmus Darwin, in his book Zoönomia (Darwin
(1794)), more than 60 years before his famous grandson Charles Darwin produced
his epic, On the Origin of Species (Darwin (1859)). The conjecture by Erasmus
Darwin has turned out to be fantastically close to what is reality. But to arrive there
we have to start with the theory proposed by his grandson Charles. Charles Darwin
coined the term “natural selection” which he used to mean that each individual
has to struggle to survive where resources are limited. Individuals with the “best”
characteristics will be more likely to survive and those desirable traits will be passed
down through generations and will eventually be dominant in the population over
time.
When Charles Darwin proposed his theory of natural selection, it was at odds
with the existing model of blending inheritance, which predicted that an offspring
is an average of its parents. This would mean that an offspring of a tall parent and
a short parent will be of medium height, who will then pass on the trait of medium
height to the next generation and so on. So the tall and short traits will be lost
in future generations, and this contradicted the theory of natural selection which
required accumulation of desirable traits.
At around the same time, Gregor Johann Mendel was conducting his revolutionary experiments on pea plants. He noticed that if he crossed two pure contrasting
traits, the next generation hybrids showed only one trait, the dominant one. And
if he crossed only hybrids, the recessive trait re-appeared in 25% of cases. Mendel
realised that offsprings inherit a pair of traits, one from each parent, of which the
dominant trait is expressed. This was a profound observation, which Mendel published in Mendel (1866). Unfortunately, his work remained largely unnoticed for
more than three decades before it was re-discovered in 1900.
Following re-discovery, Mendel’s laws of inheritance and Darwin’s natural se11
lection were hotly debated among the scientific community. While Darwinism demanded variety, Mendelism offered stability instead. The marriage between the
two theories happened only when Joseph Muller discovered mutation by subjecting
fruit-flies to X-rays. Once the conflict was resolved, scientists started wondering how
inherited traits are passed between generations. The breakthrough was finally made
by Watson and Crick (1953), who discovered the molecular structure of nucleic acids
and unravelled the rôle of deoxyribonucleic acid (DNA) in heredity.
In the rest of this section, we will provide a very brief introduction to molecular
genetics. This is not meant to be a comprehensive review of the subject but only
an overview of the fundamental concepts. For detailed discussions, Lewin (2000),
Pasternak (1999), Strachan and Read (1999) and Sudbery (1998) are standard textbooks on human genetics. For a popular exposition, please refer to Ridley (1999).
Unless specific references are provided, all material in this section, and the next, are
obtained from the above-mentioned sources.
All living creatures are made up of cells. The cell is the structural and functional
unit of all living beings and is sometimes called the building block of life. Some
organisms, like bacteria, are unicellular, while other complex life-forms, such as
human beings, are multicellular. A human body has an estimated 100 trillion cells.
Cells are made up of a number of subcellular components. Except red blood cells,
all cells in a human body contain a membrane-enclosed organelle called the nucleus.
Other subcellular components, like ribosomes, remain suspended outside the nucleus
in a jelly-like material called cytoplasm.
Leaving out the red blood cells along with the egg and sperm cells, each human
cell nucleus contains 23 pairs of filaments called chromosomes. As mentioned above
red blood cells do not have nuclei. The egg cells and the sperm cells contain only
one of each pair of chromosomes, i.e., they have 23 chromosomes instead of 23 pairs.
An offspring is produced by fertilisation of an egg cell by a sperm cell, whereby all
chromosomes become paired again.
Inside a chromosome there exists a paired molecule called DNA, with two long
strands of sugar and phosphate running parallel to each other. Embedded on each
strand is a sequence of nucleotides or bases, which come in four varieties – adenine
12
(A), cytosine (C), guanine (G) and thymine (T). The two strands of a DNA molecule
are structured in such a way that if nucleotide A is positioned in a particular location
of a strand, the opposite strand will have nucleotide T at the same location. Similarly
for C and G. Now, using the property that, A has great affinity for T while C likes
to pair with G, the nucleotides on opposite strands form bonds between them and
are called base-pairs. This produces the well-known structure of a double helix,
where the two strands of DNA stay intertwined with each other. Note that, as the
sequence of nucleotides in one strand is a complementary copy of the other, the
whole double-stranded sequence is described by the sequence of only one, chosen by
convention.
The sequence of nucleotides in DNA contains vital information on how to synthesise different types of proteins necessary for the existence of living creatures. Almost
everything in a human body is made of protein or made by them. So an efficient
mechanism for protein synthesis is critical for survival. On one of the two strands of
a DNA molecule, each sequence of three consecutive nucleotides, e.g. ACT, CAG,
TTT, is called a codon. Except for a few codons (which are used as stop signals),
all codons correspond to particular amino acids, which are the building blocks of
any protein. There are 64 possible codons, whereas there are only 20 amino acids.
So there are multiple codons which refer to the same amino acid.
There are large stretches of DNA which do not contain any useful information;
only a small fraction of the complete DNA sequence appears to encode proteins. A
gene is a region of DNA that contains the code for synthesising a particular protein.
Even within a gene there are sections of meaningless information called introns, in
between sections of actual code, called exons.
When a cell needs to manufacture a particular protein, appropriate signals are
generated to identify the gene containing the recipe for the protein in question.
Then a complementary copy of that section of DNA is made to form a new single
stranded molecule called messenger ribonucleic acid (mRNA). This process is called
transcription. mRNA is very similar to a single strand of a DNA, except that the
nucleotide T in DNA is replaced by the nucleotide uracil (U) is mRNA. After transcription, mRNA is stripped of its introns and the exons are spliced together to form
13
a seamless code. The edited mRNA then moves out of the nucleus and approaches
a ribosome. Ribosomes translate the information contained in the mRNA into a
sequence of amino acids, which then folds up into a distinctive shape (depending
on the sequence) to form a protein. This is how a cell uses the code in DNA to
manufacture a protein it needs.
DNA can also replicate to produce two identical copies of itself. The technique
is similar to the one used for mRNA transcription. However, instead of working
only on a section of a particular strand, replication works on both strands of DNA
simultaneously. At first, the bonds between the base-pairs are broken to separate
the complimentary strands. Simultaneously, two new strands are constructed with
appropriate nucleotides to form two identical double-stranded DNA. This technique
is used to pass on genetic information from cell to cell (mitosis) and from generation
to generation (meiosis).
The discussion above depicts an idealised scenario. In reality, there are a number
of places where things can go wrong. For example, in the replication stage, one
nucleotide might get replaced by another by mistake. This can be critical if this
happens in the coding region of DNA. Unless the changed codon corresponds to the
same amino acid, the gene will not be able to synthesise the correct protein. This
can be disastrous depending on the function of the protein. Similar problems will
arise if one or more nucleotides are deleted from or inserted in the DNA sequence.
Any change to a DNA sequence is termed mutation.
Although the consequences can be catastrophic, not all mutations are deleterious.
In fact, multiple variations of the same gene is quite common. These are called
alleles. The variations between alleles explain simple differences, like hair colours.
However, for a particular gene, one allele might produce a slightly different version of
a protein from the other alleles. This might turn out to be slightly better or worse at
performing a specific function. One might ask: why aren’t inefficient alleles getting
purged by natural selection? One answer might be that these alleles might be better
at doing other things. We have to wait until we fully understand the implications
of all interactions between different genes and the environment to truly appreciate
all the nuances of human genetics.
14
Let us now look at a few well-known genetic disorders.
1.3
1.3.1
Genetic Disorders and Insurance
Huntington’s Disease
Huntington’s disease (HD) or Huntington’s chorea is a rare neurological disorder. It
got its name from physician George Huntington who studied the disorder in detail
in his paper in 1872. HD can strike at an age less than twenty and the early
symptoms include a slight deterioration of the intellectual faculties. Gradually,
physical symptoms appear in the form of jerky, uncontrollable, random movements,
collectively known as chorea. Patients also exhibit slowing of thought process, speech
impairment and inability to learn new skills. They descend into deep depression,
with occasional hallucinations and delusions.
The disorder has been traced to a particular gene in chromosome 4. As is the case
for many genes, this gene also has a large number of alleles. The alleles differ from
each other in the number of occurrences of a single codon CAG in the middle of the
gene. The number of CAG repeats can vary from six to over a hundred depending
on the allele. Individuals with 35 or fewer CAG repeats are safe from HD. For genes
with more than 35 copies of CAG, the DNA replication process becomes unstable
and the number of repeats can increase in successive generations. Because of the
progressive increase in repeat lengths, the disorder tends to increase in severity as
it passes from one generation to the next, and to trigger earlier onsets. Also, the
disorder is a dominant trait, so even a single affected allele from a parent is enough
to trigger HD. For individuals with 39 CAG repeats, there is a 90% probability of
first symptoms appearing before age 75. However with 50 CAG repeats, onset of
HD, on average, is at age 27. The disorder is incurable and takes 15-25 years to run
its full course.
The codon CAG corresponds to the amino acid glutamine. It is a necessary
ingredient for the production of a protein called huntingtin. However more than 39
CAG repeats produce a mutated form of the protein, which gradually accumulate
in neurone cells. This continuous aggregation causes the cells to die off in selected
15
regions of the brain and trigger HD.
Even before the actual discovery of the gene responsible for HD, it was obvious
that the disorder was hereditary in nature. Insurance companies offering health
insurance, like CI insurance, used family histories as an underwriting tool to protect
themselves from adverse selection. With the better understanding of the genetics
behind HD, insurance companies will be interested to find out if their underwriting
techniques could be improved further. This has been studied in detail in Gutiérrez
and Macdonald (2004).
The authors first estimated the age-dependant rates of onset of HD for males and
females with different CAG repeats. They had to take into account the severity of
the symptoms that would lead to a successful CI insurance claim. Then the authors
calculated the net level CI premium rates for both sexes with 36-50 CAG repeats.
They found that insurance companies, following standard underwriting guidelines,
will be unable to insure individuals with very long CAG repeats. This is particularly
true for younger individuals and longer policy durations. For comparison purposes,
the authors have also calculated premiums based on family history alone.
The authors then investigated the cost of adverse selection in case of a moratorium
on the use of genetic test results and also possibly family history. They found that
moratoria on genetic test results can lead to an increase of premiums of about 0.1%,
while including family history in the moratoria will increase premiums by 0.35%.
The whole exercise was repeated for a life insurance model. Although the results
show a discernible increase in the risk of mortality with increase in CAG repeats, the
impact is less severe than that in the context of CI insurance. The cost of adverse
selection arising from a moratorium on the use of genetic tests for HD was found to
be negligible for life insurance.
1.3.2
Alzheimer’s Disease
Alzheimer’s Disease (AD) got its name from a German psychiatrist Dr Alois
Alzheimer. In 1901, he interviewed a patient, Mrs Auguste D, who showed signs
of dementia, a medical term for progressive decline in cognitive functions affecting
memory, language and problem solving. The patient died in 1906, and Dr Alzheimer
16
along with his colleagues examined her anatomy and neuropathology. He found deposits of plaques on the outside of the neurones and severance of the connections
between the neurones. These have been identified as classical pathological signs of
AD.
AD is a disorder of old age, rarely affecting people less than 60 years old. The
early symptoms include short-term memory loss with a tendency to become less energetic or spontaneous. With the progression of the disease, patients start forgetting
well-known skills or objects or persons. At a later stage, the patients find it difficult
to perform the simplest of tasks and require constant supervision.
The Apolipoprotein E (ApoE) gene on chromosome 19 has been identified as a
risk factor for development of AD. The ApoE gene has three alleles ²2, ²3 and ²4,
found in the general population in the proportions 0.09, 0.77 and 0.14 respectively.
Individuals with ²4 allele in their gene have a greater chance of developing AD; more
so if they have two ²4 alleles. In contrast, ²2 allele appears to have a protective effect
against AD.
The difference between the alleles is that at two locations, two A nucleotides in
²4 are replaced by two Cs in ²2. ²3 is intermediate. As these alleles produce slightly
different proteins, the protein derived from ²4 allele appear to aid in the formation
of plaques in the neurones. Although the actual biochemical process is not well
understood, there is significant statistical evidence of a correlation.
Patients with AD can survive up to 15 years after the first symptoms are noticed.
This is of significant importance to the long-term care insurance market. Macdonald
and Pritchard (2000) and Macdonald and Pritchard (2001) are studies on the impact
of AD on long-term care insurance.
Macdonald and Pritchard (2000) proposed a multiple-state model for AD and
went on to estimate the transition intensities for different possible genotypes of
ApoE. Macdonald and Pritchard (2001) applied the model to calculate long-term
care insurance premiums. The authors found that insurers, if allowed to use ApoE
test results, would probably charge ratings of +25% and +50% for individuals with
one and two ²4 alleles in their genes, respectively. The authors also estimated the
cost of adverse selection if a moratorium is in place on using genetic test results.
17
They found that the cost will not exceed 5% of premiums and can probably be
ignored.
1.3.3
Cancer
The two genetic disorders, discussed so far – HD and AD, are commonly known
as single-gene disorders. For each, there is a strong link between the disorder and
mutations in a particular gene. However, it is important to remember that with
advances in genetical research, it is quite possible that links with other genes and
environmental factors will come to light in future. HD and AD are unusual in a
sense that most common disorders are much more complex in nature and arise out
of interactions between a number of genes along with environmental factors. Cancer
is one such common multifactorial genetic disorder.
As we saw in the discussion of genes, all cells contain the necessary information to
replicate themselves. However, unorderly cell replication can lead to cell proliferation and ultimate production of malignant tumours. There is a complex mechanism
in place to protect against such an eventuality. Most notably, the tumour suppressor genes or anti-oncogens identify any irregularities and produce a dampening or
repressive effect on the cell division cycle. If such repairs prove futile, the genes promote apoptosis, a kind of programmed cell death. Most tumour suppressor genes
can function even with one functional allele, i.e. both alleles of these genes must
be mutated before a tumour suppression fails. In this section, we will consider two
such tumour suppressor genes – BRCA1 and BRCA2.
The BRCA1 gene is located on chromosome 17 and codes a protein which regulates the cycle of cell division and inhibits uncontrolled growth of cells, in particular,
those that line the milk ducts in the breast. A large number of alleles of the BRCA1
gene have been identified, many of which are associated with an increased risk of
breast cancer. The BRCA2 gene, based on chromosome 13, has a function similar to the BRCA1 gene. Again a number of alleles of the BRCA2 gene have been
linked to increased risk of breast cancer. There are also studies which have linked
BRCA1 and BRCA2 genes with ovarian cancer. It is important to note here that
only about 5 to 10% of breast cancers are due to mutations in BRCA1 and BRCA2
18
genes, suggesting that most cases are sporadic in nature.
Macdonald et al. (2003a) studies the genetics of breast and ovarian cancer from
the perspective of a life and health insurance underwriter, who can only have access
to family histories (often incomplete) of prospective consumers. The authors developed a multiple-state model and estimated the transition intensities from UK population data. Using the model, they computed conditional probabilities of women
being BRCA1 and BRCA2 mutation carriers (individuals with alleles which possess
greater risk of breast and ovarian cancer) given the family history. The authors
found that these probabilities are very sensitive to the estimates of mutation frequencies and penetrances. They concluded that it may not be appropriate to apply
risk estimates based on studies of high risk families to other groups.
Macdonald et al. (2003b) applied the model to CI insurance. The authors found
that if insurance underwriters had access to genetic test results, most BRCA1 and
BRCA2 mutation carriers will be uninsurable. On the other hand, if underwriting
is based on family history alone, only a few cases will exceed the usual underwriting
limits. If insurers were unable to use genetic test results or family history information
for underwriting, adverse selection was found to be significant in a small CI insurance
market, in case of high penetrances or if higher sums assured could be obtained.
1.3.4
Cardiovascular disease
In breast and ovarian cancer, the two genes involved accounted for a small number
of cases. Cancer can also be caused by mutations due to environmental factors,
like exposure to harmful radiations. So, we have gradually shifted our focus from
simple, but rare, single-gene disorders to complex, but relatively common, multifactorial disorders. In this section, we will discuss one more common disorder —
cardiovascular disease.
Cardiovascular disease is a class of disease that involves the heart and blood
vessels. In one common form, fatty deposits (plaques) in the blood vessels make
them narrow and restrict blood flow. The plaques can sometime rupture forming
blood clots that obstruct the artery and stop blood flow to the heart muscles. This
is commonly known as myocardial infarction or heart attack. A number of risk
19
factors have been identified for cardiovascular disease. Large-scale studies have
found evidence that tobacco smoking can significantly increase an individual’s risk
of heart attack. For an example of such a study, see Woodward (1999). Among
other risk factors, hypercholesterolemia, or elevated cholesterol levels in the blood
stream has been directly linked to heart attacks.
To understand hypercholesterolemia, we have to return to the ApoE gene on
chromosome 19. The function of the protein coded by the gene is to facilitate
transfer of fat and cholesterol from very low density lipoprotein (VLDL), which
carries fat and cholesterol from the liver to the cells that need them. If there is a
malfunction, much fat and cholesterol remains in the blood stream and form plaques
on the walls of arteries, which can ultimately lead to heart attacks.
The efficiency with which the ApoE gene carries out its function depends on
its alleles. It has been found that individuals with two ²4 alleles or two ²2 alleles
are at a heightened risk of cardiovascular disease compared to those who have at
least one copy of the ²3 allele. Of course, a low cholesterol diet can reduce the risk
considerably. So again we can see that external intervention plays an important rôle
on the efficient functioning of genes.
Clearly, cardiovascular disease is a multifactorial disorder. The risk not only depends on genetic factors (alleles of ApoE gene), but also environmental interactions
(smoking habits, dietary control etc.). We will study heart attack in much greater
detail in later chapters. In Chapter 2, we will develop a multiple-state model for
heart attack and estimate the transition intensities. In Chapter 3, we will show how
we can hypothecate a 2×2 gene-environment interaction based on this model. All
our subsequent analysis will be based on that model.
1.4
Genetics and Insurance Regulations
Insurance companies set premium rates based on the assumption that they have
access to all information relevant to the risk involved. If consumers can withhold
any information from an insurance company, there is a risk that the company will
face adverse selection. This is the basic principle behind underwriting insurance
20
risks. However, this is not the only consideration behind underwriting classifications.
There might be competitive pressure to charge different premium levels to different
groups of consumers. One such example is charging higher life insurance premiums
to smokers. It is unlikely that smokers, while purchasing insurance, will take into
account the adverse health effects of smoking and over-insure themselves to select
against an insurer. But once an insurer decides to charge differential premiums,
other insurers will have to follow suit, as then charging the average premium will
expose them to attracting only high risk consumers. For an in-depth discussion on
this topic, please refer to Macdonald (2004).
However, underwriting based on genetic test results is very different from the
smoker/non-smoker example. One’s own genes are a very private matter and discrimination based on such information has both moral and social implications. At
the same time, the possibility of adverse selection cannot be ruled out altogether.
Given this dilemma, governments and insurance regulators in different countries
have adopted different approaches to deal with the issue. Sweden, for example,
does not allow the use of genetic test results or family histories for underwriting.
Developments in the UK have been particularly interesting, as the scientific basis
for underwriting has come under fierce scrutiny. We will briefly recount the main
milestones in this section. For a more detailed discussion, please refer to Macdonald
(2003).
In 1997, the Human Genetics Advisory Committee (HGAC) asked the UK Government to impose a moratorium on the use of all genetic test results for insurance
underwriting purposes. The Government, instead, set up the Genetics and Insurance
Committee (GAIC), in 1999, to scrutinise the use of genetic tests in underwriting
on a case-by-case basis. In 2000, GAIC approved the use of genetic test results for
HD for life insurance contracts over £500,000. GAIC made it clear that insurance
companies could not ask individuals to undergo genetic tests for HD. Only if individuals have already been tested, can insurance companies ask for access to that
information. GAIC noted that it would actually enhance the access to insurance for
individuals with normal test results, but with family history of HD.
In the meantime, HGAC and other advisory bodies were merged to form the
21
Human Genetics Commission (HGC), which was particularly critical of the rôle
of GAIC. The Association of British Insurers (ABI), who were representing the
majority of UK insurers, also came in for some criticism. ABI advised its member
insurers that they could continue to use genetic test results unless their use had
been rejected by GAIC. Few agreed with this interpretation. The ABI then agreed
to restrict the use of test results to those that GAIC had approved.
In 2001, after more intense debate on the topic, the ABI withdrew all the applications it had made to GAIC (in respect of HD, breast and ovarian cancer and
AD) and agreed on a five year moratorium on the use of genetic test results. Under
the terms of the moratorium, customers will not be required to disclose the results
of predictive genetic tests for policies up to £500,000 for life insurance, £300,000
for health insurance and paying annual benefits of £30,000 for income protection
insurance. In 2005, the original moratorium was extended for five more years and
will be valid until 1 November, 2011.
The current Concordant and Moratorium on Genetics and Insurance which came
into effect from 14 March, 2005, mentions that GAIC will continue to liaise with the
clinical genetics community, patient groups and experts in insurance and actuarial
science and monitor new developments relevant to genetics and insurance. In the
meantime, the UK Biobank project has been launched to analyse the impact of geneenvironment interaction on common multifactorial disorders. With rapid advances
in genetics, aided by such large-scale population studies, it is likely that new facts
and evidence will come to light with regularity. In particular, it is important to
analyse the kind of results that might come out of UK Biobank and its implications
for the insurance industry.
1.5
The UK Biobank Project
The website http://www.ukbiobank.ac.uk/ is the main source of information on
UK Biobank. In particular, it provides a draft protocol. There (Section 1.2) it is
stated that:
“The main aim of the study is to collect data to enable the investigation of the
22
separate and combined effects of genetic and environmental factors (including
lifestyle, physiological and environmental exposures) on the risk of common
multifactorial disorders of adult life.”
UK Biobank is a cohort study, meaning that a large number of people will be
recruited, as randomly as possible, and then followed over time. The main features
of the study design are as follows:
(a) The cohort will consist of at least 500,000 men and women recruited from the
UK general population.
(b) The chosen age range is 40 to 69 (note that earlier versions, including the draft
protocol referred to above, proposed an age range 45 to 69).
(c) The initial follow-up period is 10 years.
(d) Participants will be recruited through their local general practitioners. Participants are expected to come from a broad range of socio-economic backgrounds
and regions throughout the UK, with a wide range of exposures to factors of
interest.
(e) The project will be conducted through the UK National Health Service.
(f) UK Biobank is funded by the Department of Health, the Medical Research
Council, the Scottish Executive and The Wellcome Trust. The cost is approximately £40 million.
People registered with participating general practices will be requested to join
the study by completing a self-administered questionnaire, attending an interview,
undergoing examination by a research nurse and giving a blood sample, to enable
DNA extraction at a later date. The protocol assumes that DNA extraction would
be deferred and done as and when genotyping is required.
The Office of National Statistics will provide routine follow-up data regarding
cause-specific mortality and cancer incidence. Hospitalisation and general practice
records will provide data regarding incident morbidity. Every two years a subset
of 2,000 participants and every five years the entire cohort will be re-surveyed by
postal questionnaire to update exposure data and to ascertain self-reported incident
morbidity.
23
It is envisaged that the main study design for assessing the combined effect of
environment and genotype will consist of a series of case-control studies (see Appendix A) nested within the cohort. Options for the selection of controls include
an individually matched design or a panel of controls selected at random from the
cohort, probably weighted by age and sex. An important principle underlying the
design of the study and the statistical methods that will be applied is to minimise the
assumptions made about the underlying nature of the relationship between genetic
and environmental factors and the risk of disease.
As a comprehensive prospective study with biological samples, UK Biobank is
expected to contribute substantially to international knowledge regarding the combined effects of genotype and exposure on the risk of disease. Its design means
that the study will provide a structure and resources for future research, and will
enable researchers to address current and unforeseen scientific questions. While UK
Biobank will collect and store the data, any analysis of the data in the future will
require further funding.
1.6
A UK Biobank Simulation Model
In this section we will outline how we plan to to simulate the UK Biobank project.
We suppose that the study population is subdivided (or stratified) into subgroups with respect to: (a) different genotypes; (b) different levels of environmental
exposures; and (c) other relevant factors such as sex. Genotype defines discrete categories, and we suppose that any environmental exposures or other factors defined
on a continuous scale are grouped into discrete categories. Thus, we always have a
small number of discrete subgroups (or strata).
The life history of each participant will be represented by a multiple-state model,
with states and transitions defining onset and possibly progression of the disease of
interest. Some of the model parameters, namely the transition intensities, will be
different in different strata — most obviously those associated with the disease of
interest. These intensities are the key to the whole UK Biobank project, as well as
our study.
24
(a) The real-life epidemiologist wants to estimate them (or in practice, odds ratios) from UK Biobank data, given a hypothesis about the effect of measured
exposures on the disease.
(b) The real-life actuary wants to take the estimated intensities (or in practice,
approximate them from published odds ratios) and use them in pricing and
reserving.
(c) We wish to specify hypothetical but plausible dependencies of these intensities,
on genotype and other exposures, so that we can observe our model epidemiologist and model actuary at work.
The steps in simulating UK Biobank are then as follows.
(a) We choose the number of genotypes and the number of levels of environmental
exposure, and also the frequencies with which each appears in the population.
Thus we can model simple or complex genotypes and exposures, and allow them
to be more or less common or rare. These define the subgroups or strata. The
simplest example (used in the UK Biobank draft protocol) is to have two genotypes and two levels of environmental exposure. We also choose the intensities
of onset of heart attack in each stratum to reflect the strength of the association
between stratum and the risk of heart attack.
(b) We randomly ‘create’ 500,000 individuals, each equally likely to be male or
female, and with ages uniformly distributed in the range 40 to 69, and allocated
to strata at random according to the chosen frequencies.
(c) The life history of each individual is modelled by simulating the times of any
transitions between states in the model, as governed by the intensities. We
record the times of any transitions taking place within the 10-year follow-up
period of UK Biobank.
We implicitly assume that the 500,000 participants are independent in the statistical sense, which is unlikely to be true. The sample is so large that some related
individuals are likely to be recruited by chance, but also the method of recruitment (through general practices) guarantees some level of familial and geographical
clustering.
25
26
Chapter 2
A Model for Heart Attack
2.1
Specification of the Model
Heart attack, cancer and stroke are the three major illnesses generally covered under
a critical illness (CI) insurance contract. Other minor CIs, sometimes included in
the list, are:
(a) coronary artery bypass,
(b) major organ transplant,
(c) chronic kidney failure,
(d) multiple sclerosis, and
(e) total permanent disability.
Our main focus in this thesis will be on heart attacks. The objective is to build a simple but comprehensive model for heart attacks, which can then be used to represent
hypothetical, but plausible, multifactorial gene-environment interactions. We can
then subsequently analyse the impact of multifactorial disorders on CI insurance.
Hazards of heart attacks have been widely studied by a number of research programmes. The interested parties include clinical researchers, pharmaceutical industries, epidemiologists and also actuaries. However, as remits of these papers are
very different, it is difficult to develop a complete model of heart attacks from any
one of these reports. For example, Gutiérrez and Macdonald (2003) gives transition
intensities or hazard rates of an individual suffering a heart attack. The authors
were investigating CI insurance and the subject of interest was the incidence of dif27
λ12 (x)
1 = Healthy
λ13 (x)
-
2 = Heart Attack
λ24 (x, t)
?
?
3 = Dead
4 = Dead
Figure 2.1: A 4-state heart attack model.
ferent CIs. So, naturally, their analysis did not include post heart attack survivals.
On the other hand Capewell et al. (2000) investigates only survival after a heart
attack. In this chapter, our aim is to bring together all these results and develop a
multiple-state model, which will enable us to track individuals from their birth to
any incidence of heart attack and follow them up until they die.
We propose a simple 4-state heart attack model given in Figure 2.1. All individuals are assumed to start in State 1, the Healthy state. From there, they may have a
heart attack and move to State 2, or die and move to State 3. As our ultimate goal
is to apply the model for CI insurance, we are interested in first heart attacks only,
because this will trigger a claim under a CI policy, so any subsequent heart attacks
are ignored. The only possible transition from the Heart Attack state is death. It
is convenient to distinguish deaths occurring after a heart attack, so States 3 and 4
are separate.
A basic introduction to multiple state models and transition intensities is given
in Appendix A. Please refer to Woodward (1999) and Breslow and Day (1980) for
a detailed discussion.
2.2
The Heart Attack Transition Intensity
Once we have formed the structure of the model, we now move on to parameterise
the transition intensities. First we specify the heart attack transition intensity in
the general population, denoted λ12 (x), separately for males and females. Gutiérrez
and Macdonald (2003) fitted parametric functions to the transition intensities of all
28
Transition Intensity
0.02
Male
Female
0.015
0.01
0.005
0
0
10
20
30
40
50
Age (years)
60
70
80
Figure 2.2: The transition intensity of all first heart attacks, by gender.
major critical illnesses, including heart attacks. The authors used numbers of firstever cases of heart attacks between September 1991 and August 1992, taken from
McCormick et al. (1995). The exact exposed to risk is calculated and a parametric
function is fitted to it.
For males, it is given by:



exp(−13.2238 + 0.152568x)
if x ≤ 44



x − 44
49 − x
λ12 (x) =
× λ12 (49) +
× λ12 (44) if 44 < x < 49

49 − 44
49 − 44



 − 0.01245109 + 0.000315605x
if x ≥ 49
(2.1)
and for females, it is given by:
λ12 (x) =
0.598694
× 0.1531715.6412 exp(−0.15317x)x14.6412 .
Γ(15.6412)
(2.2)
These intensities are shown in Figure 2.2.
2.3
Mortality After First Heart Attacks
We will now focus on what happens after an individual has experienced his or her
first heart attack. Figure 2.3 shows the part of the full 4-state model (Figure 2.1),
29
2 = Heart Attack
λ24 (x, t)-
4 = Dead
Figure 2.3: Subset of the model in Figure 2.1 to study survival after heart attacks.
we are interested in.
Here, the individuals who have had their first heart attack, start off from State
2. We then observe these individuals until their death, at which point they move on
to State 4. We are interested in the transition intensity from State 2 to State 4.
In Section 2.3.1, we will review a number of published articles on survival after
first heart attacks. In Section 2.3.2, we will identify a study which we believe to be
the most appropriate for our model. In Section 2.3.3, we will propose a parametric
form for λ24 (x, t). And finally in Section 2.3.4, we will provide a discussion on the
fitted model and validate our model against other relevant data available in the
scientific literature.
2.3.1
Literature Review
There are a number of articles in published journals which study prognosis following
heart attacks. The articles vary widely in their scope and focus. There are articles
like Tunstall-Pedoe et al. (1999), which is an outstanding example of a populationbased study, but concentrates only on short-term survival after heart attacks. As
our interest lies in both short and long-term prognosis, we will review articles which
observe the study subjects over longer periods of time.
Capewell et al. (2000) describes a retrospective cohort study involving 117,718
patients admitted to hospital with heart attacks in Scotland between 1986 and 1995.
This is one of the largest population-based investigations which deals with both
short and long-term prognosis following a first heart attack. The study classifies
individuals according to:
(a) age groups <55, 55–64, 65–74, 75–84, ≥85,
(b) gender,
(c) deprivation categories and
30
(d) co-morbidity.
Case-fatality rates are aggregated for each of these groups. So it is not possible
to model the transition intensity in terms of all these variables. However, we are
only interested in modelling post heart attack mortality in terms of age and gender.
The case fatality rates appear to be higher for women. The authors have confirmed
that this apparent high case fatality rate is due to the fact that the average age of
women in the study was significantly higher than that of men.
From the published case fatality rates based on age-groups, it is clear that the
rates depend on:
(a) the age at first heart attack; and
(b) the duration of survival after suffering first heart attack.
So we will model λ12 (x, t) as a function of x and t, where x is the age at first heart
attack and t is the survival duration post-first heart attack.
Goldberg et al. (1998) conducted a similar population-based investigation on patients admitted in all acute care hospitals in the Worcester, Massachusetts metropolitan area (1990 census population of 437,000) between 1975 and 1995. A total of
8,070 patients were studied in the investigation.
The study classified individuals according to the study periods 1975–78, 1981–
84, 1986–88, 1990–91 and 1993–95, and uses the same age-groups as Capewell et al.
(2000). The results published include:
(a) odds of dying during hospitalisation, and after 1 year and 2 years following
hospital discharge for all age-groups as compared to patients < 55 years;
(b) trends in the odds of dying during hospitalisation, and after 1 year and 2 years
following hospital discharge for each age-group;
(c) 1-year and 2-year death rates of hospital survivors by age-group; and
(d) a graph of long-term survival rates among hospital survivors by age-group.
Brønnum-Hansen et al. (2001) studied patients registered during 1982–91 in 11
municipalities in the western part of Copenhagen County, Denmark. During the
study period, the average size of the population was 202,000 and a total of 3,926
first heart attacks were registered. The patients were classified according to gender,
31
two age-groups (30–59 and 60–74) and three study periods (1982–84, 1985–87 and
1988–91). The published figures include:
(a) a table of fatal and non-fatal heart attack cases for each age-group and gender
covering the full duration;
(b) a table of standardised mortality ratios (quotient of observed to expected number of deaths) and excess death rates (observed minus expected number of
deaths per 1,000 person-years) by age-group and gender; and
(c) separate graphs of short-term (≤28 days) and long-term (28 days to 15 years)
survival probabilities for men and women.
The authors point out that according to their findings the age-adjusted casefatality rates after a first heart attack do not differ between the sexes. This agrees
with the findings of Capewell et al. (2000).
Among these articles, Capewell et al. (2000) appears most relevant on three
counts. Firstly, the study population is Scottish which provides most relevant data
appropriate for modelling heart attacks in the UK. Secondly, it has the largest study
population providing substantial credibility to the figures published. Thirdly, the
figures published in this article are presented in a suitable format and can be readily
used for parameterising λ24 (x, t).
Most of the data in Goldberg et al. (1998) and Brønnum-Hansen et al. (2001)
are presented in the form of graphs, odds ratios, standardised mortality ratios and
excess death rates. Although results in these formats are not suitable for directly
parameterising transition intensities, they can still be used as an independent check
of λ24 (x, t), which we will finally propose.
2.3.2
Data
As mentioned in the previous section, Capewell et al. (2000) provides case-fatality
rates for different age-groups for durations 30 days, 1 year, 5 and 10 years following
first heart attacks. We will represent the five age-groups <55, 55–64, 65–74, 75–84,
≥85 by single representative ages, namely, 50, 60, 70, 80 and 90 respectively. For our
calculations, we will transform the case-fatality rates into survival probabilities, by
subtracting the case-fatality rates from 1. The survival probabilities thus calculated
32
Table 2.1: Survival probabilities after first heart attack.
Survival Probability
Age
Range
<55
55–64
65–74
75–84
≥85
Representative
Age
50
60
70
80
90
0 days
1.000
1.000
1.000
1.000
1.000
Duration after first heart attack
30 days 1 year 5 years 10 years
0.949
0.921
0.834
0.737
0.880
0.827
0.672
0.528
0.771
0.677
0.465
0.312
0.641
0.499
0.255
0.133
0.545
0.351
0.123
0.052
P22 (50, 50 + t)
P22 (60, 60 + t)
P22 (70, 70 + t)
P22 (80, 80 + t)
P22 (90, 90 + t)
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Duration (years) after first heart attack
10
Figure 2.4: The plots of the data from Table 2.1.
are given in Table 2.1.
2.3.3
Fitting a Parametric Function
Based on the data, we will first parameterise the survival function following a first
heart attack. Let Pij (y, z) denote the conditional probability that a person is in
State j at age z given that he or she was in state i at age y. Table 2.1 gives the
survival probabilities P22 (x, x + t) for specific values of x and t, where x denotes
the age at first heart attack and t denotes the survival duration after the first heart
attack.
To get an initial idea of the shape of the functions we are dealing with, we plot
33
1
a
a
a
a
a
0.8
=
=
=
=
=
0.25
0.50
1.00
2.00
4.00
f(t)
0.6
0.4
0.2
0
0
1
2
3
4
5
t
6
7
8
9
10
Figure 2.5: The plots of f (t) = 1/(1 + ta ) against t for values of a = 0.25, 0.50, 1.00,
2.00, 4.00.
the data of Table 2.1 in Figure 2.4. For ease of comparison, the data-points for each
age-group are connected by straight lines.
As an initial guess of a suitable functional form, consider functions of the form
fa (t) = 1/(1 + ta ). Figure 2.5 shows fa (t) for a = 0.25, 0.50, 1.00, 2.00 and 4.00.
Note that for all values of a, fa (0) = 1, fa (1) = 0.5 and fa (+∞) = 0. The smaller
the value of a, the steeper is the initial descent of fa (t), but flatter is the descent
later.
A quick glance at Figure 2.4 reveals that we require P22 (x, x + t) for the older
ages to have both the initial and later descents steeper than that of the younger
ages. It is apparent that a better fit can be achieved by combining the properties of
fa (t) for both high and low values of a. So we propose an enhanced version of fa (t)
for parameterising P22 (x, x + t) as follows:
P22 (x, x + t) =
1
1 + ax × tbx + cx × tdx
,
(2.3)
where, without loss of generality, we assume 0 < bx < 1 and dx > 1. Note that ax
and cx are scaling parameters.
Clearly by definition, P22 (x, x) = 1. For each representative age, we have four
34
Table 2.2: Parameter estimates.
Age
Range
<55
55–64
65–74
75–84
≥85
Representative
Age
50
60
70
80
90
a
0.0684
0.1686
0.4001
0.8564
1.5181
b
0.1040
0.0911
0.1237
0.1732
0.2431
c
0.0174
0.0406
0.0770
0.1476
0.3309
d
1.1919
1.2280
1.3370
1.5504
1.6727
data-points (Table 2.1) and four parameters (ax , bx , cx and dx ) to estimate. Solving
these equations, we obtain the values of ax , bx , cx and dx , given in Table 2.2.
Given the parametric form of P22 (x, x + t), the transition intensities λ24 (x, t) can
be derived using:
λ24 (x, t) = −
d
log P22 (x, x + t).
dt
(2.4)
For the derivation of the above expressions and the underlying assumptions, please
refer to Appendix A. Hence:
λ24 (x, t) =
ax × bx × tbx −1 + cx × dx × tdx −1
.
1 + ax × tbx + cx × tdx
(2.5)
Using the parameters from Table 2.2, the graphs of P22 (x, x + t) and λ24 (x, t)
are provided in Figures 2.6 and 2.7, respectively. The graphs of λ24 (x, t) for x =
50, 60, 70, 80 and 90 and the transition intensity for both genders from ELT15 are
given in Figure 2.8.
From the graphs, we observe that both P22 (x, x + t) and λ24 (x, t) differ significantly between age-groups. To extend the definition of the transition intensity to all
ages x and durations 0 ≤ t ≤ 10, we first assign λ24 (x, t) for each age-group to its representative age. Then define λ24 (x, t) = λ24 (50, t) for x < 50, λ24 (x, t) = λ24 (90, t)
for x > 90, and interpolate linearly in x between the given values for 50 < x < 90.
Capewell et al. (2000) do not give survival rates more than 10 years after the first
heart attack. For survival rates after more than 10 years, to ensure that the force of
mortality does not drop below general population mortality, we take the maximum
of λ24 (x, t) defined above and the general population mortality given by ELT15.
35
Survival Probability
P22 (50, 50 + t)
P22 (60, 60 + t)
P22 (70, 70 + t)
P22 (80, 80 + t)
P22 (90, 90 + t)
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Duration (years) after first heart attack
10
Figure 2.6: The plots of survival probabilities, P22 (x, x + t), against duration after
heart attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years.
Transition Intensity
10
λ24 (50, t)
λ24 (60, t)
λ24 (70, t)
λ24 (80, t)
λ24 (90, t)
1
0.1
0
1
2
3
4
5
6
7
8
Duration (years) after first heart attack
9
10
Figure 2.7: The plots of transition intensities, λ24 (x, t), against duration after heart
attacks for age-groups <55, 55–64, 65–74, 75–84, ≥85 years.
36
1
Transition Intensity
ELT15 Male
0.1 ELT15 Female
λ24 (50, t)
λ24 (60, t)
λ24 (70, t)
λ24 (80, t)
0.01
λ24 (90, t)
0.001
0.0001
0
10
20
30
40 50 60
Age (years)
70
80
90
100
Figure 2.8: Graphs of λ24 (x, t), assigned to representative ages for each age group,
and the force of mortality of the ELT15 life tables.
2.3.4
Discussion of the Fitted Model
First, let us compare the survival probabilities of the fitted model with that of the
general population. The survival probabilities of men and women aged 50, 60, 70, 80
and 90 following ELT15 are shown in Figures 2.9 and 2.10. These can now be
compared with the P22 (x, x + t) given in Figure 2.6.
For all ages, P22 (x, x + t) are lower for all durations as compared to those derived
from ELT15. However, the slope of P22 (x, x + t) is significantly lower than that of
ELT15 for longer durations. This seems to suggest that survival for a long duration
after a first heart attack implies better overall health as compared to the general
population.
We have also plotted P22 (x, x + t) over the first 30 days following a first heart
attack in Figure 2.11. This can be compared with Fig 1 of Brønnum-Hansen et al.
(2001), which gives the graphs of survival probabilities for men and women combined
for all ages over three different time periods. Although not directly comparable, the
graphs show similar features.
Figure 2.12 shows the survival probabilities for hospital survivors calculated from
the P22 (x, x + t). Again we find that these graphs show similar features when
37
Survival Probability
1
0.8
0.6
0.4
50
60
0.2 70
80
90
0
0
2
4
6
Duration (years)
8
10
Figure 2.9: The plots of survival probabilities of men aged 50, 60, 70, 80 and 90
following ELT15.
Survival Probability
1
0.8
0.6
0.4
50
60
0.2 70
80
90
0
0
2
4
6
Duration (years)
8
10
Figure 2.10: The plots of survival probabilities of women aged 50, 60, 70, 80 and 90
following ELT15.
38
Survival Probability
1
0.8
0.6
0.4
50
60
0.2 70
80
90
0
0
0.02
0.04
Duration (years)
0.06
0.08
Figure 2.11: The plots of survival probabilities, of individuals aged 50, 60, 70, 80
and 90, over the first 30 days after a first heart attack.
Survival Probability
1
0.8
0.6
0.4
50
60
0.2 70
80
90
0
1 month 1
2
3
4
5
6
Duration (years)
7
8
9
10
Figure 2.12: The plots of survival probabilities of individuals aged 50, 60, 70, 80
and 90, who survived the first 30 days after a first heart attack.
39
Table 2.3: Odds of dying within first 30 days, one year and two years following a
first heart attack.
Age(years)
<55
55–64
65–74
75–84
≥85
30 days
1.00
2.35
4.49
7.04
8.92
Duration
1 year 2 years
1.00
1.00
2.19
2.12
4.09
3.80
6.34
5.73
8.21
7.28
Table 2.4: Adjusted odds ratios and the corresponding 95% confidence intervals of
dying within first 30 days, one year and two years following a first heart attack
according to Goldberg et al. (1998).
Age(years)
<55
55–64
65–74
75–84
≥85
30 days
1.00 (–)
1.87 (1.30, 2.68)
4.00 (2.86, 5.60)
7.77 (5.55, 10.88)
11.67 (8.10, 16.81)
Duration
1 year
1.00 (–)
1.78 (1.27, 2.51)
3.00 (2.18, 4.13)
4.55 (3.28, 6.30)
8.76 (6.12, 12.54)
2 years
1.00 (–)
1.65 (1.25, 2.18)
2.83 (2.18, 3.68)
5.30 (4.05, 6.93)
10.57 (7.75, 14.42)
compared with Fig 3 of Brønnum-Hansen et al. (2001) and Figure 2 of Goldberg
et al. (1998).
Finally, we calculate the odds of dying within the first 30 days, 1 year and 2 years
following a first heart attack. The numbers are given in Table 2.3. Most of these
fall within the 95% confidence intervals given in Tables II and IV of Goldberg et al.
(1998). For reference the relevant numbers are reproduced in Table 2.4.
Based on the discussion above, the proposed model appears to be consistent with
other relevant data relating to survival after first heart attack.
2.4
Mortality Before First Heart Attacks
Going back to the heart attack model proposed in Section 2.1, we have already
parameterised λ12 (x) and λ24 (x, t). In this section, we will complete the model by
40
λ12 (x)
1 = Healthy
-
2 = Heart Attack
λ13 (x)
λ24 (x, t)
?
?
3 = Dead
4 = Dead
Figure 2.13: 4-state heart attack model - Grouping of states.
parameterising λ13 (x). This is the force of mortality affecting individuals who have
not had a heart attack.
To parameterise λ13 (x), we make use of the mortality transition intensity affecting
all individuals in the general UK population. Mortality of the general UK population
is well studied and is analysed separately for males and females, and the latest
intensities are given in ELT15. To make use of ELT15 for our investigation, we need
to make the following observations.
The 4-state heart attack model introduced in Section 2.1 is reproduced in Figure
2.13. Note that individuals are alive in States 1 and 2; and they move to States
3 and 4 when they die. The grouping shown in Figure 2.13 using dashed lines,
produces a simple 2-state mortality model, given in Figure 2.14. Here, States 1 and
2 are combined to produce State 5, the Alive state, while States 3 and 4 are grouped
to form State 6, the Dead state. The resulting transition intensity from State 5
to State 6, λ56 (x), is the force of mortality for the general population as given by
ELT15 for respective genders.
Recalling that the notation Pij (y, z) denotes the conditional probability that a
person is in State j at age z given that he or she was in state i at age y, the
probability of an individual dying before attaining age x in the 2-state mortality
model can be expressed as:
µ Z
P56 (0, x) = 1 − P55 (0, x) = 1 − exp −
x
¶
λ56 (s)ds .
(2.6)
0
Note that we can numerically compute the probability P56 (0, x) for all ages x, as
the transition intensity λ56 (x) is known and given by ELT15.
41
5 = Alive
λ56 (x)
?
6 = Dead
Figure 2.14: A 2-state mortality model.
Going back to our original 4-state heart attack model, we can express the same
probability of dying, in terms of the transition intensities pertinent to that model.
We will assume that all individuals belong to State 1 when they are born. Note that,
according to the definitions of the states in the heart attack model, all individuals
are born healthy, as individuals who are alive and not have suffered a heart attack
are classified as healthy. So in the 4-state heart attack model, the probability of
person dying before attaining age x is given by:
P56 (0, x) = P13 (0, x) + P14 (0, x)
Z
Z xh
P11 (0, z)λ13 (z) +
=
z
i
P11 (0, y)λ12 (y)P22 (y, z)λ24 (y, z − y)dy dz,
0
0
(2.7)
where
h
Z
z
P11 (0, z) = exp −
i
(λ12 (y) + λ13 (y)) dy , and
(2.8)
0
Z
h
z−y
P22 (y, z) = exp −
i
λ24 (y, s)ds .
(2.9)
0
We see that λ13 (x) is the only unknown variable above. So now we can solve
for λ13 (x) numerically using the above equation. The iterative algorithm to solve
λ13 (x) is outlined below.
(a) For a given age x, let us assume that λ13 (y) is known for all y < x. Based on
this information, we will now solve for λ13 (x).
42
(b) Set an initial guess for the value of λ13 (x). The better the initial guess, the
faster will be the convergence to the solution. We have used simple linear
extrapolation based on the values of λ13 (x − δ) and λ13 (x − 2δ) for a small value
of δ > 0.
(c) The approximate value of λ13 (x) can then be used to calculate P11 (0, x). We
can now calculate P13 (0, x) + P14 (0, x) using Equation 2.7, assuming that λ13 (x)
and P11 (0, x) are known quantities. P56 (0, x) can be computed independently
and compared with the value of P13 (0, x) + P14 (0, x) thus obtained. Depending
on the magnitude and sign of the difference between these quantities, we can
refine our initial estimate of λ13 (x). Repeat this step with improved estimates
λ13 (x) until convergence is achieved.
(d) The above process can be used to calculate λ13 (x) for different ages progressively,
starting from age 0. As a starting value, we have assumed λ13 (0) = λ56 (0).
In the above steps, we have to compute a number of integrals numerically, for
which we have used Romberg Integration. For a detailed discussion on Romberg
Integration see Press et al. (2002). The integration involving λ23 (x, t) in Equation
2.7 requires special treatment. The section of the integral we are interested in is
given below:
Z
z
P11 (0, y)λ12 (y)P22 (y, z)λ24 (y, z − y)dy.
I=
(2.10)
0
For convenience, we make a transformation u = z − y, which gives us the following
integral:
Z
z
I=
P11 (0, z − u)λ12 (z − u)P22 (z − u, z)λ24 (z − u, u)du.
(2.11)
0
Recall from Section 2.3.3, that for all values of x and t ≤ 10, λ24 (x, t) is of the form
λ24 (x, t) =
ax × bx × tbx −1 + cx × dx × tdx −1
,
1 + ax × tbx + cx × tdx
(2.12)
where 0 < bx < 1 and dx > 1. This implies that limt→0+ λ24 (x, t) = ∞. Also the
smaller the value of bx , the steeper is the initial descent of λ24 (x, t). Convergence
is difficult to achieve for numerical integration of an unbounded function. If the
43
0.001
Integrand
0.0008
0.0006
0.0004
0.0002
0
0
10
20
30
40
u
50
60
70
80
Figure 2.15: The graph of the integrand in Equation 2.11.
integral exists, it is easier to deal with a transformed integrand which is bounded
within the required range. For the type of function given in Equation 2.12 we can use
a transformation of the form w = uα , where α < bx for all x. For our computations,
we have chosen α = 0.05.
Using this transformation, Equation 2.11 becomes
Z
zα
I=
0
1
1
1
1
1 1
1
P11 (0, z − w α )λ12 (z − w α )P22 (z − w α , z)λ24 (z − w α , w α ) w α −1 dw. (2.13)
α
We show the effect of this transformation in Figures 2.15 and 2.16. Figure 2.15
gives the graph of the integrand in Equation 2.11 before the transformation and Figure 2.16 shows the graph of the integrand in Equation 2.13 after the transformation.
For both graphs, z has been set to 80.
From the figures we can see that the transformation has successfully converted
the unbounded function in Figure 2.15 to the bounded function in Figure 2.16. Now
we can successfully apply Romberg Integration to evaluate the transformed integral
in Equation 2.13.
Using the techniques outlined above, we have obtained estimates of λ13 (x) for
both males and females. They are given in the Figures 2.17. For comparison, we
have also included the gender-specific forces of mortality given in ELT15.
44
Transformed Integrand
0.1
0.08
0.06
0.04
0.02
0
0
0.2
0.4
0.6
w
0.8
1
1.2
Figure 2.16: The graph of the integrand in Equation 2.13.
45
Transition Intensity
1
ELT15 - Male
Non-heart-attack deaths - Male
0.1
0.01
0.001
0.0001
0
10
20
30
40
50
Age (years)
60
70
80
60
70
80
Transition Intensity
1
0.1
ELT15 - Female
Non-heart-attack deaths - Female
0.01
0.001
0.0001
0
10
20
30
40
50
Age (years)
Figure 2.17: Transition intensities of non-heart-attack deaths plotted along with
ELT15 for both males and females.
46
Chapter 3
Gene-Environment Interaction
3.1
Definition of Strata: A Simple Example
The parameters of the heart attack model estimated above are supposed to apply
to the general population. However, the general population is divided into strata
according to genotype, environmental exposures and other factors, and we now
suppose that the intensity of heart attack λs12 (x) depends on the stratum s.
In this chapter, we will introduce the simplest possible gene-environment interactions into our model. We suppose that there is a single genetic locus with two
alleles, denoted G and g, therefore just two genotypes. Also, there are just two
levels of environmental exposures, denoted E and e (a simple example might be E
= ‘smoker’ and e = ‘non-smoker’). This simple model can be used as a stepping
stone to study higher dimensional multifactorial models. Note that the UK Biobank
draft protocol used the same assumptions in its examples, despite the fact that the
project aims to study complex multifactorial disorders. We will suppose that G and
E are adverse exposures, while g and e are beneficial. Therefore, we have four strata
for each sex — ge, gE, Ge and GE — and eight in total.
We must choose plausible values for the frequencies with which each stratum is
present in the population, and the stratum-specific heart attack intensities. Since,
unlike the study of single-gene disorders, we are considering common risk factors
for common diseases, let us assume that the probability that a person possesses
genotype G is 0.1, and the probability that a person has environmental exposure E
47
Table 3.5: The factor ρs , in Equation (3.14), for each gene-environment combination.
E
e
G
1.3
1.1
g
0.9
0.7
is also 0.1. Assuming independence, the four strata (for each sex) ge, gE, Ge and
GE occur with frequencies 0.81, 0.09, 0.09 and 0.01 respectively.
We will suppose that the heart attack intensity in each stratum is proportional
to the population average intensity. For stratum s, set:
λs12 (x) = k × ρs × λ12 (x) ,
(3.14)
where λ12 (x) is the population intensity given in Section 2.2 and k×ρs is the constant
of proportionality for each stratum. We suppose, for clarity, that ρs does not depend
on sex, but the constant k does. Again, noting that our interest is in genotypes of
modest penetrance, we choose the values of ρs given in Table 3.5. Then, we choose k
so that the strata-specific heart attack intensities are consistent, in aggregate, with
the population heart attack intensities, for males and females separately. Let the
proportion of the healthy population in stratum s at age x be ws (x). Then:
³ R
´
t s
s
s ws (x) × exp − 0 λ12 (x + y)dy × λ12 (x + t)
³ R
´
.
P
t s
w
(x)
×
exp
−
λ
(x
+
y)dy
s
s
0 12
P
λ12 (x + t) =
(3.15)
Substituting Equation (3.14) in Equation (3.15), we get:
P
λ12 (x + t) =
³ R
´kρs
t
w
(x)
×
exp
−
λ
(x
+
y)dy
k × ρs × λ12 (x + t)
s
12
s
0
.
³
´kρs
Rt
P
s ws (x) × exp − 0 λ12 (x + y)dy
(3.16)
From Equation (3.16) we see that k ought to depend on a specific choice of age
x and duration t. However, to keep the model simple we will assume that k is
constant and calculate it from Equation (3.16) for a representative choice of age and
duration. Given that the UK Biobank protocol proposes an age range of 40 to 69 and
48
Table 3.6: The multipliers k s × ρuv for each stratum.
Stratum
Male
Female
ge
0.922
0.921
gE
1.186
1.185
Ge
1.449
1.448
GE
1.712
1.711
Table 3.7: The true relative risks for each stratum, relative to the baseline ge stratum.
Stratum
Male
Female
ge
1.000
1.000
gE
1.286
1.286
Ge
1.571
1.571
GE
1.857
1.857
a follow-up period of 10 years, we have chosen x = 60 and t = 5. If we assume that
the weights ws (x) are equal to the population frequencies of each stratum, then for
males k = 1.317274 and for females k = 1.316406. The constants of proportionality
(k × ρs ) in Equation (3.14) are given in Table 3.6 for future reference.
Having formulated a relationship between strata and the risk of heart attack, we
now consider the quantities likely to be estimated by epidemiologists. We have the
advantage of being able to compute their true values, because we know the true
intensities. From now on, we define the baseline population as the most common
stratum, namely the gene-environment combination ge.
(a) The relative risk in stratum s, with respect to the baseline stratum ge, is denoted
rs and is:
rs =
ρs
k × ρs
=
.
k × ρge
ρge
(3.17)
The values of rs are given in Table 3.7.
(b) The odds ratio at age x in stratum s, with respect to the baseline stratum ge,
based on 1-year probabilities, is denoted ψs (x) and is given by:
Ã
ψs (x) =
1
s
(x, x + 1)
P12
s
− P12
(x, x + 1)
!,Ã
1
ge
P12
(x, x + 1)
ge
− P12
(x, x + 1)
!
(3.18)
s
(x, x + 1) is the conditional probability that a person in stratum s
where P12
who was healthy at age x will suffer a heart attack before age x + 1.
49
We have verified (not shown here) that the odds ratios computed using Equation (3.18) do not vary significantly with age and are approximately equal to the
corresponding relative risks. The latter is not surprising, as we have used 1-year
probabilities to calculate the odds ratios.
For details on relative risks and odds ratios, see Appendix A, or Woodward (1999)
or Breslow and Day (1980).
3.2
A Sample Realisation of UK Biobank
With the parameterised model, we simulated the life histories of 500,000 people
recruited to UK Biobank and followed up for 10 years. The life histories of the first
20 people are shown in Table 3.8. Consider person No.2 in Table 3.8. He is a male
with the adverse allele G and is exposed to the beneficial environment e. He entered
the study in State 1 as a healthy individual at age 58.74. During the follow-up
period he had a heart attack at age 63.89 and moved to State 2. Finally, he died at
age 63.94 and moved to State 4. The numbers of people in each state at the end of
the 10-year follow-up period are given in Table 3.9.
3.3
Epidemiological Analysis
With 500,000 simulated life histories, we can now carry out one or more typical
epidemiological analyses. Apart from the life histories, the following information is
available to the epidemiologist:
(a) the framework of the UK Biobank project;
(b) the structure of the 4-state Heart Attack model given in Section 2.1;
(c) the transition intensities given in Sections 2.2–2.4;
(d) the stratum to which each person is allocated; and
(e) the proportion ws (x) of individuals in each stratum at a particular age x, say
60.
The UK Biobank protocol suggests that the combined effect of environment and
genotype be analysed using matched case-control studies nested within the cohort.
50
Table 3.8: The simulated life histories of the first 20 (of 500,000) individuals showing
their genders, exposure to environmental factors, genotypes and the times and types
of all transitions made within 10 years.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sex
M
M
M
M
F
M
M
F
M
M
F
F
F
F
F
M
F
M
F
M
E/e
e
e
e
e
e
e
e
e
e
e
E
e
e
E
e
e
e
e
e
e
G/g
g
G
g
g
G
g
g
G
g
g
g
g
g
g
g
g
g
g
g
g
Age
41.10
58.74
52.27
68.39
60.94
62.49
55.50
58.95
65.67
49.79
45.43
57.58
59.68
55.14
42.93
56.23
62.84
62.29
43.69
45.16
State
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Age
State
Age
State
63.89
2
63.94
4
63.81
68.18
61.57
2
3
3
69.58
3
Table 3.9: Number of individuals in each state at the end of the 10-year follow-up
period.
Sex
G/g
E/e
State 1
State 2
State 3
State 4
Total
Male
G
G
g
g
E
e
E
e
1,871
17,579
17,588
162,474
126
928
775
5,426
356
3,219
3,236
29,610
115
934
702
5,002
2,468
22,660
22,301
202,512
Female
G
G
g
g
E
e
E
e
2,178
19,746
19,811
178,718
70
397
367
2,320
214
2,021
2,095
18,891
52
408
330
2,441
2,514
22,572
22,603
202,370
419,965
10,409
59,642
9,984
500,000
Total
51
In a case-control study, the first step is to define the cases and controls. Here, clearly,
the cases are persons who had first heart attacks during the study period.
In real studies, epidemiologists will face problems such as missing data and cost
constraints, and in most circumstances they will use only a subset of all cases for
their analysis. Here, we have no such problems, unless we choose to model them.
So, in the first instance, we will include all cases in the analysis. Later, we will
consider the more realistic possibility that a subset of all cases is used.
An appropriate matching strategy is particularly important for a matched casecontrol study. Firstly, we match controls with cases by age. Suppose, for example,
that we are comparing stratum s with the baseline stratum ge, and that a case
entered the study at age x last birthday and had a heart attack at age x + t last
birthday. A matched control is a person chosen randomly from persons in these two
strata who also entered the study at age x last birthday and remained healthy at
least until age x + t + 1 last birthday. Once chosen as a control, that person cannot
be chosen as a control again. As controls are plentiful compared with cases, we will
match 5 controls to each case, called a 1:5 matching strategy. In Section 1.5, we
mentioned that the genotyping of individuals will be done as and when it is required.
So, it might be necessary to genotype a large number of people to ensure that enough
controls are available for a 1:5 case-control study. Other matching strategies with
fewer controls per case will obviously be cheaper to implement.
To calculate odds ratios, we need to group ages sensibly. Note that epidemiological studies often use quite wide age groups, much wider than actuaries are
accustomed to using. We will use 5-year age bands as a reasonable compromise between accuracy and sample size. Note that the definition of the ages of the controls
needs to be adjusted appropriately to maintain consistency. The results are given
in Table 3.10. We can see no particular trend with respect to age, so we calculate
the age-adjusted odds ratio for each stratum (a weighted average of the age-specific
odds ratios, see the Mantel-Haenszel method described in Appendix A or Woodward
(1999)), shown in Table 3.11.
We can compare the estimated age-adjusted odds ratios with the true odds ratios
given in Table 3.7. The estimates are better for strata gE and Ge where the numbers
52
Table 3.10: Odds ratios with respect to the ge stratum as baseline, based on a
1:5 matching strategy using all cases and 5-year age groups. Approximate 95%
Confidence intervals are shown in brackets. There were no cases among females age
45–49 in stratum GE.
Age
40–44
45–49
50–54
55–59
60–64
65–69
70–74
75–79
Age
40–44
45–49
50–54
55–59
60–64
65–69
70–74
75–79
1.043
1.069
1.330
1.358
1.175
1.267
1.362
1.487
1.167
0.944
0.947
1.243
1.634
1.321
1.257
1.203
gE
(0.527,2.065)
(0.816,1.400)
(1.117,1.583)
(1.168,1.579)
(1.020,1.352)
(1.116,1.438)
(1.179,1.574)
(1.160,1.907)
gE
(0.301,4.520)
(0.523,1.702)
(0.659,1.361)
(0.967,1.597)
(1.343,1.988)
(1.111,1.571)
(1.045,1.511)
(0.893,1.620)
2.628
1.670
1.578
1.665
1.708
1.592
1.542
1.534
Males
Ge
(1.561,4.423)
(1.317,2.118)
(1.336,1.865)
(1.448,1.914)
(1.507,1.935)
(1.416,1.789)
(1.348,1.764)
(1.187,1.983)
2.375
1.929
1.725
2.133
1.976
1.721
1.907
1.667
1.333
1.869
1.298
1.280
1.867
1.601
1.538
1.220
Females
Ge
(0.463,3.836)
(1.139,3.067)
(0.929,1.814)
(0.999,1.641)
(1.538,2.267)
(1.359,1.887)
(1.296,1.825)
(0.896,1.659)
GE
5.000 (0.313,79.942)
–
4.167 (1.800,9.644)
2.324 (1.282,4.211)
1.842 (1.112,3.053)
2.457 (1.637,3.689)
2.354 (1.528,3.626)
1.773 (0.788,3.986)
53
GE
(0.712,7.917)
(0.940,3.959)
(1.121,2.654)
(1.486,3.062)
(1.417,2.753)
(1.251,2.368)
(1.334,2.726)
(0.910,3.052)
Table 3.11: The age-adjusted odds ratios calculated for both males and females.
Strata
Male
Female
gE
1.285 (1.209,1.365)
1.298 (1.188,1.418)
Ge
1.625 (1.536,1.719)
1.538 (1.413,1.674)
GE
1.880 (1.620,2.182)
2.250 (1.814,2.790)
of cases are higher than in stratum GE. However all the true odds ratios lie within
the 95% confidence intervals given in Table 3.11.
3.4
An Actuarial Investigation
The actuary starts with the model of Figure 2.1 in mind, and wishes to estimate the
intensity λs12 (x) for each stratum. We assume, realistically, that the best available
data are the published odds ratios. The ‘estimation’ procedure, therefore, consists of
finding a reasonably robust way to estimate transition intensities from odds ratios.
There is no simple mathematical relationship, so approximations must be made.
Supposing that the actuary knows the rates of heart attack in the general population λ12 (x) (separately for males and females) a simple assumption is that the heart
attack intensity for each stratum is proportional to λ12 (x). In stratum s, define:
s
γ12
(x) = cs (x) × λ12 (x)
(3.19)
s
where γ12
(x) is the actuary’s ‘estimate’ of λs12 (x). Assuming that the odds ratios
(denoted ψs (x)) are good approximations of the relative risks, which is reasonable
as long as the age groups are not too broad, we have:
cs (x)
γs (x)
=
γge (x)
cge (x)
(3.20)
cs (x) = ψs (x) × cge (x).
(3.21)
ψs (x) =
which leads to:
As observed from Table 3.10, the odds ratios do not appear to depend strongly on
age. So we further assume that cs (x) is a constant cs for all ages (hence also ψs (x)
54
Table 3.12: The estimated multipliers cs for each stratum.
Stratum
Male
Female
ge
0.918
0.920
gE
1.179
1.194
Ge
1.492
1.415
GE
1.726
2.070
is a constant ψs ), and therefore:
cs = ψs × cge
(3.22)
where ψs is the age-adjusted odds ratio. Thus Equation (3.19) becomes:
s
γ12
(x) = cge × ψs × λ12 (x).
(3.23)
Now Equation (3.16) can be written:
³ R
´
t
w
(x)
exp
−
c
ψ
λ
(x
+
y)dy
cge ψs λ12 (x + t)
s s
0 ge s 12
³
´
.
Rt
P
w
(x)
exp
−
c
ψ
λ
(x
+
y)dy
s s
0 ge s 12
P
λ12 (x + t) =
(3.24)
Let us assume that at age x = 60, the ws (x) are given by the population frequencies of the respective strata. Now we can solve Equation (3.24) for the multiplier cge
for a particular choice of age x and any duration t. Then we can use Equation (3.22)
to obtain cs for s = gE, Ge and GE. We find (not shown here) that the results
are very similar for different values of t. In Table 3.12, we show the ‘estimated’ cs
for representative age x = 60 and duration t = 5, based on the the age-adjusted
odds ratios in Table 3.11. These values can be compared with the true constants
of proportionality of the underlying model given in Table 3.6. They are in good
agreement for strata s = ge, gE and Ge. The agreement for stratum s = GE is not
so good, but it was based on a small number of cases, 241 males and 122 females.
55
3.5
3.5.1
Premium Rating for Critical Illness Insurance
A Critical Illness Model
s
The actuary will use the intensities γ12
(x) ‘estimated’ in Section 3.4 to calculate CI
insurance premiums. Gutiérrez & Macdonald (2003) obtained the following model
for critical illness insurance based on medical studies and population data. Full
references can be found in that paper. The structure of the model, as outlined
in the paper, is given in Figure 3.18. The relevant transition intensities are listed
below.
State 1 Heart Attack
µs01 (x) ¡
µ
¡
¡
¡
¡
State 0 Healthy
¡
*
©
©
¡ µs02 (x)
©
©
¡
©
¡ ©©
¡ ©
¡©©
µs03 (x) ©
¡
©
H
@HH
@ HH
HH
@
HH
@
@ µs (x)
H
H
j
H
@ 04
@
@
@
@
µs05 (x) @
R
@
State 2 Cancer
State 3 Stroke
State 4 Other CI
State 5 Dead
Figure 3.18: A full critical illness model for gender s.
(a) For males, the age-dependent transition intensities governing the incidence of
heart attack are given below:
56
m
Table 3.13: 28-Day mortality rates, q01
(x)
attacks.
m
m
age x q01
(x)
age x q01
(x)
20–39 0.15
47–52 0.18
40–42 0.16
53–56 0.19
43–46 0.17
57
0.20
= 1 − pm
01 (x), for males following heart
age x
58–59
60–61
62–64
m
q01
(x)
0.21
0.22
0.23
age x
65–74
75–79
80+
m
q01
(x)
0.24
0.25
0.26



exp(−13.2238 + 0.152568x)
if x ≤ 44



49 − x
x − 44
µm
× µm
× µm
01 (x) =
01 (49) +
01 (44) if 44 < x < 49

49
−
44
49
−
44



 − 0.01245109 + 0.000315605x
if x ≥ 49
(3.25)
For females, the age-dependent transition intensities are:
µf01 (x) =
0.598694
× 0.1531715.6412 exp(−0.15317x)x14.6412
Γ(15.6412)
(3.26)
We also need the 28-day survival factors following heart attacks. This relates to
the common contractual condition, that payment depends on surviving for 28
s
days. Let ps01 (x) be the 28-day survival probabilities for gender s, and q01
(x) =
f
1 − ps01 (x). For females, at ages 20–80, q01
(x) = 0.21, and for males the values
are given in Tables 3.13.
The 28-day mortality rates given in Table 3.13 can be compared against the
survival probabilities obtained from Capewell et al. (2000) and given in Table
2.1. (Note that the odds ratios given in Table 2.3 is derived from the survival
probabilities in Table 2.1.) As compared with the Capewell et al. (2000) data,
the 28-day mortality rates in Table 3.13 appear slightly higher at younger ages
and lower for older ages. However, to maintain consistency with the CI insurance
model we will use the rates in Table 3.13 to calculated the CI insurance premium
rates.
(b) For males, the age-dependent transition intensities governing the incidence of
57
cancer are given below:



exp(−11.25 + 0.105x)
if x ≤ 51





 x − 51 × µm (60) + 60 − x × µm (51)
if 51 < x < 60
02
02
m
60 − 51
60 − 51
µ02 (x) =



− 0.2591585 − 0.01247354x




 + 0.0001916916x2 − 8.952933 × 10−7 x3 if x ≥ 60
For females, the age-dependent transition intensities are:

 exp(−10.78 + 0.123x − 0.00033x2 ) if x < 53
f
µ02 (x) =
 − 0.01545632 + 0.0003805097x
if x ≥ 53
(3.27)
(3.28)
(c) For males, the age-dependent transition intensities governing the incidence of
stroke are given below:
2
3
µm
03 (x) = exp(−16.9524 + 0.294973x − 0.001904x + 0.00000159449x ) (3.29)
For females, the age-dependent transition intensities are:
µm
03 (x) = exp(−11.1477 + 0.081076x)
(3.30)
We need the 28-day survival factors following stroke. Let ps03 (x) be the 28-day
s
survival probabilities for gender s, and q03
(x) = 1 − ps03 (x). For both males and
s
females, q03
(x) = 0.002x/0.9.
(d) The transition intensities for other minor causes of critical illnesses amount to
15% of those arising from cancer, heart attack and stroke. So the aggregate rate
of critical illness claims, for gender s, is:
µs (x) = 1.15(µs01 (x) × ps01 (x) + µs02 (x) + µs03 (x) × ps03 (x))
(3.31)
(e) Population mortality rates, given by English Life Tables No. 15 (ELT15) were
adjusted to exclude deaths which would have followed a critical illness insurance
claim.
3.5.2
Premium Rating for Critical Illness Insurance
We will assume that all intensities except those for heart attack are as given here.
s
(x). We compute expected present values
For heart attack, we use the intensities γ12
58
Table 3.14: The true critical illness insurance premiums for different strata as a
percentage of those for stratum ge.
Stratum
Males
Age
gE
Ge
GE
45
55
65
75
45
55
65
75
45
55
65
75
Females
Term
5
112%
110%
107%
106%
124%
119%
114%
111%
136%
129%
120%
117%
15
111%
108%
106%
25
109%
107%
35
107%
121%
116%
112%
117%
114%
115%
131%
124%
118%
126%
121%
122%
Age
45
55
65
75
45
55
65
75
45
55
65
75
Term
5
103%
104%
105%
106%
105%
109%
111%
111%
108%
113%
116%
117%
15
103%
105%
106%
25
104%
105%
35
104%
107%
110%
111%
108%
110%
108%
110%
115%
117%
112%
115%
112%
by solving Thiele’s differential equations numerically, with a force of interest of
δ = 0.044017 (see Norberg (1995)).
Table 3.14 shows the true premiums for the strata s = ge, Ge and GE, as a percentage of the premiums for stratum ge, for males and females and for different ages
and terms. Here, ‘true’ means that they have been computed using the intensities
of Chapter 2, not the actuary’s estimates. Table 3.15 then shows the corresponding
premiums, again as a percentage of those charged for stratum ge, using the actuary’s estimates from Section 3.4. The results are similar to those in Table 3.14.
Comparing the actuary’s estimates with the true CI insurance premiums, we can
see that the estimates are very good for stratum gE. For stratum Ge, the estimates
are also within ±2% of the true values. However, the estimates are not as accurate
for females in stratum GE. As mentioned before, stratum GE had relatively few
cases resulting in high volatility of the estimated values.
59
Table 3.15: The actuary’s estimated critical illness insurance premiums for different
strata as a percentage of those for stratum ge.
Stratum
Males
Age
gE
Ge
GE
45
55
65
75
45
55
65
75
45
55
65
75
Females
Term
5
112%
110%
107%
106%
126%
121%
115%
112%
137%
129%
121%
117%
15
110%
108%
106%
25
109%
107%
35
107%
123%
117%
113%
119%
115%
116%
132%
124%
119%
126%
121%
123%
60
Age
45
55
65
75
45
55
65
75
45
55
65
75
Term
5
103%
105%
106%
106%
105%
108%
110%
111%
111%
119%
124%
125%
15
104%
105%
106%
25
104%
105%
35
104%
106%
109%
110%
108%
109%
108%
115%
121%
124%
118%
121%
118%
Chapter 4
UK Biobank Simulation Results
4.1
Varying the Genetic and Environment Model
In the last chapter, we estimated parameters of a heart attack model and the resulting CI insurance premiums, based on a simulated realisation of UK Biobank. The
underlying ‘true’ model (chosen by us) was particularly simple — two genotypes,
two environmental exposures and proportional hazards of heart attack — and by
great good luck, our model epidemiologist hit upon exactly the correct hypotheses
in fitting his/her model. So it is not surprising that he/she obtained good parameter
estimates, with the possible exception of those in respect of the smallest stratum,
GE.
In reality, the epidemiologist faces more difficult problems:
(a) There is likely to be more than one gene, many with more than two variants,
as candidates for influencing the disease.
(b) Similarly, there are likely to be several environmental exposures of interest.
(c) Model mis-specification is always possible (indeed, it may be the norm).
(d) On grounds of cost, the number of cases and the number of controls per case
may be limited.
(e) As mentioned earlier, UK Biobank is a single unrepeatable sample, hence sampling error is present. Although 500,000 seems like a huge sample, it may not
be when smaller numbers of cases are sampled from within it.
In a simulation study, we are in a position to explore these problems. In par61
ticular, we can address (d) and (e) above, because we can replicate the entire UK
Biobank dataset many times, and repeat the epidemiological and actuarial analyses
using each realisation. Thus we can estimate the sampling distributions of parameter
estimates and premium rates, while the analysis of the single realisation in Section
3 only gave us point estimates of the latter. (We did give approximate confidence
intervals of the estimated odds ratios, because they can be derived on theoretical
grounds. This is not possible for such a complicated function of the model parameters as a premium rate, and simulation is one of the few practical approaches.) We
concentrate on this question in the rest of this thesis, because it is directly relevant
to the approach adopted by GAIC in the UK, and likely to be adopted by similar
bodies elsewhere, which demands that the reliability of prognoses based on genetic
information must be demonstrated if it is to be used in any way. In the case of
multifactorial disorders, we assume that this requirement is to be interpreted in the
statistical sense rather than as applying to individual applicants. Our exploration
of (a), (b) and (c) above will be the subject of a future paper.
In addition to simulating many replications of UK Biobank, we will consider the
effect of stronger or weaker genetic and environmental effects, and of more common
and less common adverse genotypes. We call each such variant of the underlying
model a ‘scenario’, which should not be confused with the simulation procedure
discussed above. We will hold each scenario fixed, and then simulate outcomes of
UK Biobank under those assumptions.
We have already introduced one set of assumptions is Section 3.1, which we will
refer to as our Base scenario. The details of all the scenarios are given in Table 4.16.
The parameters that must be specified are:
(a) The population frequency of each stratum (the same for males and females).
(b) The parameters k for each sex and ρs for each stratum. Although ρs does not
depend on sex, for convenience Table 4.16 shows the combined constants of
proportionality k × ρs for each sex.
Although the odds ratios are derived quantities rather than parameters, they are
also shown in Table 4.16 for convenience.
The Low and High Penetrance scenarios assume smaller and larger differences,
62
Table 4.16: The model parameters for different scenarios. Odds ratios are also
shown.
Parameters
Stratum
Base
Penetrance
Low
High
Population
Frequency
ge
gE
Ge
GE
0.81
0.09
0.09
0.01
0.81
0.09
0.09
0.01
0.81
0.09
0.09
0.01
0.9025
0.0475
0.0475
0.0025
0.64
0.16
0.16
0.04
ρs
ge
gE
Ge
GE
0.70
0.90
1.10
1.30
0.85
0.95
1.05
1.15
0.55
0.85
1.15
1.45
0.70
0.90
1.10
1.30
0.70
0.90
1.10
1.30
k (Male)
k (Female)
All
All
1.317274
1.316406
1.136603
1.136463
1.568090
1.564821
1.370745
1.370230
1.221620
1.220385
k × ρs
(Male)
ge
gE
Ge
GE
0.922
1.186
1.449
1.712
0.966
1.080
1.193
1.307
0.862
1.333
1.803
2.274
0.960
1.234
1.508
1.782
0.855
1.099
1.344
1.588
k × ρs
(Female)
ge
gE
Ge
GE
0.921
1.185
1.448
1.711
0.966
1.080
1.193
1.307
0.861
1.330
1.800
2.269
0.959
1.233
1.507
1.781
0.854
1.098
1.342
1.587
Odds Ratio
ge
gE
Ge
GE
1.000
1.286
1.571
1.857
1.000
1.118
1.235
1.353
1.000
1.545
2.091
2.636
1.000
1.286
1.571
1.857
1.000
1.286
1.571
1.857
63
Frequency
Low
High
respectively, between the effects of the different strata, governed by ρs . The Low
and High Frequency scenarios assume that disadvantageous G genotype and E environment have population frequencies half (0.05) or double (0.2) those in the baseline
scenario (0.1), respectively.
In Section 3.3, we noted that problems like missing values and cost constraints
might limit the number of cases that can be used for analysis. So we will also examine
the effect of limiting the number of cases used in the analysis. From Table 3.9,
around 20,000 individuals were eligible to be considered as cases (in that particular
realisation). For each scenario, we will show results based on 1,000, 2,500, 5,000 and
10,000 cases as well as those based on all cases.
4.2
Outcomes of 1,000 Simulations:
The Base
Scenario
We will make 1,000 simulations of UK Biobank. The outcomes will be the empirical
distributions of the parameters of the epidemiologist’s model, and of CI insurance
premium rates. Let us first consider the Base scenario, all cases included, for males
aged 45 taking out a CI insurance policy with term 15 years. Figure 4.19 shows
scatter plots of the CI insurance premium rates per unit sum assured for strata gE,
Ge and GE versus those of ge. More precisely, the outcome of the ith simulation is a
drawing pi = (pige , pigE , piGe , piGE ) from the sampling distribution of the 4-dimensional
random variable P = (Pge , PgE , PGe , PGE ), where Ps is the premium rate in stratum
s.
The scatter plots show clearly that the premium rate pairs (Pge ,PgE ) and
(Pge ,PGe ) are more strongly correlated than the pair (Pge ,PGE ). This is true, as
the correlation matrix given in Table 4.17 shows, but note that the scale of the
x-axis is greatly compressed compared with that of the y-axis. The reason they
are correlated is that, as outlined in Section 3.4, the actuary uses the three odds
ratios published by the epidemiologist, plus the overall population intensity of heart
attack, to obtain the heart attack intensities for the four strata, so the four premium
estimates are not independent. The reason that the correlations are negative is that
64
+
++ +
+
++++++
+
+ ++
+
+
+
+ +++++
+ + ++++ +++++
+ +++
++++++++ ++++ ++++
+
++ ++++
+
+
+
+
+++
+
+
++
+++ + ++ +
+
++++
+ ++ + ++++++++
+
+
++ + ++++ +++ + +++
+
++
+++
++ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +++
+++ + +++++++
+ ++ + +
+ +
++++
++ +
+++
++++
+
+ ++
++
++
++
++
+
++
+++
+++
+
+ +++
++++ ++
+
++
+
+
++
+ + +++
++
+
++
+
+
+++
++
+
+
++
+++
+++
+
+++++
++
+
+++
+
++
+
+
+
+
+++
+++
++
+
++
+++
+
++ + + ++
+
++
+
+
++
+
+
+
+
++
++
++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++++
+ + + +++++++++
+++
++
++++++ ++++
+++++++
+++++
+++
++
+
+++
+
++
++
++
++
+
+++++
+++++ ++++
++
+++
+
+
++
+
++
++
++
+
+
+++
+++++++
+ +++++++
+
++
++
++
+
++
++
++
++
++
+++
+
++
++++
++
++
+
+++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+ + ++ +
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
o
o
+
+
o
+
o
+
+
+ o+ + ++
oo+oo+
+++++ ++++
oo o oo oooooo+
o+o+
oo+
ooooooooooooooo+
++++
ooo+
o
+
o
o
o
+
+
+
o
o
o
o
o
+
o
+
+
o
o
o
o
o
+
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooooo+
oooooo+
o oooooooooo+
o+
ooo+
o ++ + +
ooooooooooooooooooo+
ooooooooooooooooooooooooooooooo+
o o oooooooooooooooooooo
oo+
ooooooooo+
o oooooooooo
ooo ooooo oo
ooooooo+
oooo
o
oo oooooooooooooooooo+
ooooooooooooooooooooo+
oooooooooooooooooooooooooooooooooooooooo ooooo o
ooooooooooooooooooooooooooooo+
o o oooo ooo o oo o
+ o+
o
* * ** ******************************************************************** *** *
** * *** *************************************************************************************
** *************** *** * *
* *** *************************
**************************
**** ******** **
* ******************************************************************** **
0.012
+
0.011
+
o
0.010
PgE , PGe , PGE
0.013
+
0.009
*
0.00865
0.00870
*
o
+
0.00875
( Pge , PgE )
( Pge , PGe )
( Pge , PGE )
+
o
*
0.00880
Pge
Figure 4.19: Scatter plots of CI insurance premium rates for strata gE, Ge and
GE versus that of ge under the Base scenario for males aged 45 and policy term 15
years.
the overall level of the four intensities is adjusted so that their aggregate effect is
consistent with the general population. So, if the intensities in any of the strata are
high, the intensities in the others will tend to fall to restore consistency with the
aggregate intensity.
We also consider the premium rates for strata gE, Ge and GE as a proportion
of those for stratum ge, namely PgE /Pge , PGe /Pge and PGE /Pge . These correspond
to premium ratings, if we take the standard premium rate to be that of stratum
Table 4.17: The correlation matrix of the strata-specific premium rates for males
aged 45 and policy term 15 years under the Base scenario, all cases included.
Stratum
ge
gE
Ge
GE
ge
1.000
−0.604
−0.656
−0.194
gE
Ge
GE
1.000
−0.123
−0.057
1.000
−0.095
1.000
65
1.8
( RgE , RGe )
( RgE , RGE )
1.4
+
+
+
+
+
+
+
+
+ +
++ +
+
+++ ++ ++++++++ ++ ++
+++++++++++++++ +
++ + ++ +
++ + +
++ ++++++
+
++
+++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++++++++ + +++++++++
+
+
++++
++ ++++
+ +++++++ + + +
+++
+
++
++++++++++
+ +++
+ ++ ++++
+
+
+++++
+++
+++
+
+++++++
++
++
+
++++
++
+++++
+
+++++
+
+
+++ + +
++
+
+++++
+
+
+++++
++
++
+
+
++
++
+
+
+
++++
+
+++
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
++++
+++++++
+
+
+
+
+
+
+
+
+
+
+
++ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+ ++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++++++++++ +++
+ ++
+++++++++++++++++
++++++
+
+
++
++++
+++++
+
+++++
++++++++
++
+
+
++ + ++++
++
+
++
+++++
++
++++
+
+++++++
+ ++++
++
++
+++
+++
+
+
++
+++
+++
+++
++
++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++o+ooo++
o o++o++++
+o
+ + +
+o+
o
o+ooo+oo+o+o+
oooo+
o+
o++ o+
oo+oo+o+ooo++o+ooo+
ooo+o+
oooooo+
+ooooooo+oooooo+oooo+oo+ooo
o+
oooo+
oo +oo++
ooooo
ooooo+
ooooo+
oo+
oooo+
oo+
+
+
ooo+
oooooooooooo o ooo o
o+
ooooooooo+ooooooo+
oooooooooooooooooooooooooooo+
ooooo+
ooooooooooooooooooooooo+
oooooooooooooooooooooooooooooooooooooooooooooooooooooo+ooo+
oooooooooo+
ooooooooooooooooooooooooooo+
oooo+
o
oooooooo+
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
+
o
o
o
o
o
o
oooooooooooooooooooooooo+oooo o o+ooo o o
o o o ooo oooo ooooooooo+
+
o
o o
1.0
1.2
RGe , RGE
1.6
o
+
1.05
1.10
1.15
1.20
25
30
RgE
15
0
5
10
Density
20
RgE
RGe
RGE
1.0
1.2
1.4
1.6
1.8
Premium Ratings
Figure 4.20: The scatter plots of the premium ratings Ge/ge and GE/ge versus
gE/ge and the corresponding density plots for males aged 45 and policy term 15
years under the Base scenario, all cases included.
66
Table 4.18: The correlation matrix of the premium ratings for males aged 45 and
policy term 15 years under the Base scenario, all cases included.
Stratum
RgE
RGe
RGE
RgE
1.000
0.095
0.013
RGe
RGE
1.000
−0.018
1.000
ge, and we will refer to them as such. For brevity, define Rs = Ps /Pge to be the
premium rating for stratum s with respect to stratum ge. The correlation matrix
of these premium ratings is given in Table 4.18 and the corresponding scatter plots
are given in Figure 4.20. Both suggest correlations are small enough to neglect,
which means that instead of always considering the full joint distribution of the
premiums P , we can obtain all the information of interest by separate examination
of the marginal distributions of the premium ratings. The densities of these marginal
distributions are given in Figure 4.20. This immediately suggests a simple approach
to the questions that GAIC must ask, because the reliability of the premium rating
in each stratum — in terms of its distinguishability from the premium ratings in
the other strata — is revealed by the degree to which its marginal density overlaps
the marginal densities of the others. Presented with Figure 4.20, we might expect
GAIC to agree that strata Ge and GE had premium ratings distinct from that of
stratum gE, but to ask whether or not they had premium ratings reliably distinct
from each other.
4.3
A Measure of Confidence
Our precise formulation of the question that GAIC might now ask is: are the
marginal empirical distributions of premium ratings in different strata sufficiently
different to support charging different premiums (when doing so is allowed)? Hence,
we need some kind of measure of confidence in distinguishing one stratum from
another in terms of CI insurance premium ratings.
Statisticians normally use non-parametric tests, like the Kolmogorov-Smirnov
67
test, to check whether two underlying one-dimensional probability distributions differ from one another by comparing their empirical distribution functions. However, these types of test cannot be applied in a simulation exercise as the power of
Kolmogorov-Smirnov type tests increases as the number of observations available for
each distribution increase. In a simulation exercise, it is possible to generate a large
number of estimates by repeating the experiment any number of times and thus
superficially increasing the power of the test. As a consequence, the KolmogorovSmirnov test could not be used to distinguish one risk stratum from another. In the
remainder of this section, we will suggest a simple alternative measure to achieve
this.
Let X and Y be two continuous random variables with cumulative distribution
functions FX and FY respectively. We can find u such that FX (u) + FY (u) = 1. If
the ranges of X and Y overlap, u lies in both and is unique, otherwise any u that
lies between their ranges will do. This can be rewritten as FX (u) = 1 − FY (u), or
P[X ≤ u] = P[Y > u].
Without loss of generality, let us also assume that FX (u) ≥ FY (u). Let us
define our measure of confidence to be 2 × FX (u) − 1, which gives a measure of the
overlap of FX and FY . Denote this O(X, Y ), or just O if the context is clear. If
FX (u) = FY (u) = 0.5, then we are as unsure as we can be that FX and FY are
distinct, and O = 0. As FX (u) increases to 1, the area of overlap decreases. If the
ranges of X and Y do not overlap at all, FX (u) = 1 and we have high confidence in
deciding that FX and FY are distinct; in this case O = 1. In this sense, O measures
how confident the underwriter can be that the two distributions are different.
4.4
Results
In this section, we simulate 1,000 realisations of UK Biobank under each scenario
outlined in Table 4.16. Our aim is to examine how reliably UK Biobank might
identify differences in premium ratings, as a body like GAIC might require. This is
measured by the three quantities O(RgE , RGe ), O(RGe , RGE ) and O(RgE , RGE ). We
have verified (not shown here) that these do not vary significantly by age or policy
68
Table 4.19: The measure of overlap O for CI insurance premium ratings for males
aged 45, with policy term 15 years, for different scenarios.
Scenario
Cases
O(RgE , RGe )
O(RgE , RGE )
O(RGe , RGE )
Base
All
10,000
5,000
2,500
1,000
1.000
0.968
0.872
0.718
0.490
1.000
0.962
0.850
0.698
0.416
0.924
0.632
0.484
0.356
0.176
Low Penetrance
All
10,000
5,000
2,500
1,000
0.918
0.662
0.528
0.412
0.250
0.904
0.658
0.472
0.360
0.222
0.572
0.346
0.216
0.148
0.076
High Penetrance
All
10,000
5,000
2,500
1,000
1.000
1.000
0.984
0.906
0.688
1.000
0.998
0.970
0.886
0.658
0.992
0.844
0.692
0.540
0.354
Low Frequency
All
10,000
5,000
2,500
1,000
0.996
0.892
0.712
0.566
0.386
0.948
0.706
0.516
0.322
0.394
0.658
0.352
0.208
0.060
0.226
High Frequency
All
10,000
5,000
2,500
1,000
1.000
0.988
0.932
0.806
0.594
1.000
1.000
0.986
0.902
0.716
0.994
0.896
0.744
0.546
0.358
term, so in Table 4.19, we present results for a representative policy for males aged
45 and policy term 15 years.
Note that it is impossible to calculate an odds ratio for a given age group unless
there is at least one case in that age group in each stratum. In some circumstances
some of the 1,000 simulations failed this criterion, and these were omitted from the
results in Table 4.19. Those affected were the Base and the Low Penetrance scenarios
with 1,000 cases (1 simulation omitted in each case) and the Low Frequency scenarios
with 2,500 and 1,000 cases (10 and 238 simulations omitted, respectively). We make
the following comments on Table 4.19:
69
(a) We saw in Section 4.3 that under the Base Scenario, all cases included, the densities of RGe and RGE overlap over a small region. This qualitative observation
is made more concrete by Table 4.19, which shows that O(RGe , RGE ) = 0.924 in
this case. By definition, this means that there exists x such that P[RGe < x] =
P[RGE > x] = 0.924, and we (or GAIC) may have high confidence in assigning
these strata to different underwriting groups.
(b) Stratum GE is always the smallest, so the distribution of RGE is always the
most spread out. This is also evident from the scatter plots in Figure 4.20.
(c) We expect real case-control studies to use only a subset of cases, and Table 4.19
shows that the effect of this is very great. For example, in the Base scenario,
O(RGe , RGE ) falls from 0.924 to 0.176 as the number of cases used falls from ‘All’
to 1,000. Figure 4.21 shows, for the Base scenario, the marginal densities with
different numbers of cases. The densities overlap considerably if the number of
cases is small (and bear in mind that 1,000 cases is not a very small investigation
by normal standards).
(d) Figure 4.22 shows the empirical distribution functions of the premium ratings
for males under the Base scenario. For each premium rating, we show the effect
of using different numbers of cases. For example, if only 1,000 cases were used,
there is about a 30% chance that underwriters would incorrectly assume RGE
to be 150% or higher. If instead 10,000 cases were used the chance of making
this error is very small.
(e) Figure 4.23 shows, for 5,000 cases, the effect of the different scenarios. Reduced
frequency of the adverse genetic and environmental exposures, or reduced penetrance of the adverse genotype, both reduce the ability to discriminate between
different underwriting classes. Changes in the opposite direction improve the
discrimination. This qualitative observation is backed up in a more quantitative
way by Table 4.19.
Table 4.20 gives the corresponding results for females. When a fixed number of
cases is used the results are very similar to those for males. This is as expected,
as we assumed that the effects of genotype and environmental exposures were the
same for males and females, albeit acting on different baseline risks of heart attack.
70
25
30
Base − All Cases.
15
0
5
10
Density
20
RgE
RGe
RGE
1.0
1.2
1.4
1.6
1.8
Premium Ratings
25
20
5
0
1.0
1.2
1.4
1.6
1.8
1.0
1.2
1.4
1.6
1.8
Base − 2,500 Cases.
Base − 1,000 Cases.
25
20
RgE
RGe
RGE
5
0
0
5
10
Density
15
20
RgE
RGe
RGE
15
25
30
Premium Ratings
30
Premium Ratings
10
Density
RgE
RGe
RGE
10
Density
15
0
5
10
Density
20
RgE
RGe
RGE
15
25
30
Base − 5,000 Cases.
30
Base − 10,000 Cases.
1.0
1.2
1.4
1.6
1.8
1.0
Premium Ratings
1.2
1.4
1.6
1.8
Premium Ratings
Figure 4.21: Marginal densities of premium ratings in the Base scenario (males)
with different numbers of cases in the case-control study.
71
1.0
0.8
0.6
0.4
0.2
0.0
Cumulative Distribution Function
All
10,000
5,000
2,500
1,000
1.0
1.5
2.0
2.5
0.8
0.6
0.2
0.4
All
10,000
5,000
2,500
1,000
0.0
Cumulative Distribution Function
1.0
RgE
1.0
1.5
2.0
2.5
0.8
0.6
0.2
0.4
All
10,000
5,000
2,500
1,000
0.0
Cumulative Distribution Function
1.0
RGe
1.0
1.5
2.0
2.5
RGE
Figure 4.22: The empirical cumulative distribution function of the premium ratings
gE/ge, Ge/ge and GE/ge for males aged 45 and policy term 15 years under the
Base scenario.
72
25
30
Base − 5,000 Cases.
15
0
5
10
Density
20
RgE
RGe
RGE
1.0
1.2
1.4
1.6
1.8
Premium Ratings
25
20
5
0
1.0
1.2
1.4
1.6
1.8
1.0
1.2
1.4
1.6
1.8
Low Penetrance − 5,000 Cases.
High Penetrance − 5,000 Cases.
25
20
RgE
RGe
RGE
5
0
0
5
10
Density
15
20
RgE
RGe
RGE
15
25
30
Premium Ratings
30
Premium Ratings
10
Density
RgE
RGe
RGE
10
Density
15
0
5
10
Density
20
RgE
RGe
RGE
15
25
30
High Frequency − 5,000 Cases.
30
Low Frequency − 5,000 Cases.
1.0
1.2
1.4
1.6
1.8
1.0
Premium Ratings
1.2
1.4
1.6
1.8
Premium Ratings
Figure 4.23: Marginal densities of premium ratings in different scenarios (males),
with 5,000 cases in the case-control study.
73
Table 4.20: The measure of overlap O for CI insurance premium ratings for females
aged 45 with policy term 15 years, for different scenarios.
Scenario
Cases
O(RgE , RGe )
O(RgE , RGE )
O(RGe , RGE )
Base
All
10,000
5,000
2,500
1,000
0.990
0.958
0.850
0.728
0.466
0.990
0.948
0.844
0.706
0.488
0.734
0.626
0.494
0.378
0.244
Low Penetrance
All
10,000
5,000
2,500
1,000
0.778
0.680
0.528
0.392
0.238
0.762
0.646
0.506
0.326
0.198
0.402
0.302
0.222
0.122
0.078
High Penetrance
All
10,000
5,000
2,500
1,000
1.000
1.000
0.992
0.914
0.716
1.000
0.998
0.984
0.884
0.656
0.906
0.836
0.696
0.484
0.320
Low Frequency
All
10,000
5,000
2,500
1,000
0.932
0.896
0.748
0.552
0.406
0.800
0.676
0.486
0.340
0.374
0.436
0.298
0.192
0.134
0.218
High Frequency
All
10,000
5,000
2,500
1,000
0.998
0.994
0.922
0.814
0.598
1.000
1.000
0.986
0.914
0.678
0.922
0.884
0.756
0.576
0.348
However, when all cases are included, the values of O are smaller than those for
males. This is because the lower incidence of heart attack among females results in
fewer cases, therefore estimates with higher variances.
Until now, we have used a 1:5 matching strategy for all case-control studies; that
is, five controls per case. However, cost constraints might dictate the use of fewer
controls. In Table 4.21, we show the values of O for males if a 1:1 matching strategy
is used. As expected these are decreased significantly under all scenarios.
As we mentioned when discussing Table 4.19, we may find simulations under
which the odds ratios cannot be calculated because of a lack of cases. Also, note
74
Table 4.21: The measure of overlap O for CI insurance premium ratings for males
aged 45, with policy term 15 years, for different scenarios and a 1:1 matching strategy.
Scenario
Cases
O(RgE , RGe )
O(RgE , RGE )
O(RGe , RGE )
Base
All
10,000
5,000
2,500
1,000
0.990
0.886
0.740
0.554
0.378
0.990
0.872
0.720
0.544
0.400
0.774
0.454
0.374
0.248
0.222
Low Penetrance
All
10,000
5,000
2,500
1,000
0.808
0.558
0.372
0.288
0.232
0.820
0.526
0.378
0.308
0.204
0.456
0.220
0.188
0.184
0.048
High Penetrance
All
10,000
5,000
2,500
1,000
1.000
0.988
0.898
0.762
0.548
1.000
0.978
0.902
0.742
0.480
0.908
0.680
0.494
0.366
0.222
Low Frequency
All
10,000
5,000
0.954
0.738
0.574
0.856
0.558
0.464
0.474
0.284
0.228
High Frequency
All
10,000
5,000
2,500
1,000
1.000
0.944
0.826
0.668
0.474
1.000
0.986
0.932
0.802
0.594
0.950
0.746
0.592
0.456
0.306
that the calculation of odds ratios requires the existence of enough exposed controls.
This is more demanding under a 1:1 matching strategy, as fewer controls are available
than in 1:5 matching strategy. At first sight this is surprising; it ought to be easier to
find a smaller number of controls. This is true, but there is also a higher chance that
one of the cells in the 2 × 2 table used to calculate the odds ratio will be empty (see
Table A.31 in Appendix A). Table 4.22 shows the numbers of simulations rejected
for this reason. The numbers are very high for the Low Frequency scenarios where
1,000 and 2,500 cases were used. The values of O based on the remaining simulations
are not reliable and so these are not given in Table 4.21.
75
Table 4.22: The number of simulations rejected due to the inability to calculate the
odds ratios for a 1:1 matching strategy.
Scenario
Base
Low Penetrance
High Penetrance
Low Frequency
High Frequency
4.5
All
0
0
0
0
0
Number of Cases
10,000 5,000 2,500
0
0
0
0
0
0
0
0
6
0
0
0
0
123
0
1,000
13
16
36
630
0
Conclusions
Earlier in this chapter, we asked the question: how well may UK Biobank distinguish
between different levels of risk associated with the influence of genes, environment
and their interactions on a given multifactorial disorder? Using a simple model
of heart attack as an example, we simulated the outcome of UK Biobank, each
simulation consisting of 500,000 life histories. Then we supposed that a model
epidemiologist carried out case-control studies using the UK Biobank data, and a
model actuary used the published odds ratios from these studies to parameterise a
pricing model for CI insurance.
We supposed that GAIC (in the UK) would approach the question of the reliability of any genetic test capable of detecting the genetic variation in terms of its
ability to allocate tested individuals to distinct underwriting classes. From each simulation of UK Biobank we could estimate the premium rates of a representative CI
insurance policy for each stratum defined by genoype and the environment, and for
each sex. From a large number of such simulations, we could estimate the sampling
distributions of premium ratings with respect to a chosen ‘standard’ underwriting
class.
For simplicity, we used only two genotypes and two levels of environmental exposure (as in the examples in the UK Biobank protocol). We used proportional
hazards of heart attack in different strata, and assumed that the model epidemiologist, in his/her analyses, hit upon the same model. Thus our results correspond
to the simplest possible hypothesis that might be investigated using UK Biobank,
76
and is free of model mis-specification on the part of the analyst, and of any noise,
nuisance parameters, or missing or contaminated data.
The parameters we chose as our baseline represented genetic and environmental
exposures that were fairly common (10% of the population with each adverse exposure) and had modest penetrance: the most adverse stratum (GE) and least adverse
stratum (ge) had intensities of heart attack 30% higher and 30% lower than average,
respectively. (For comparison, CI insurance underwriters typically might consider
an extra premium to be appropriate once the assessed premium exceeds about 25%
of the standard.) We also considered the effect of varying key parameters, as follows:
(a) The relative incidence rates of heart attack for each stratum.
(b) The population frequencies of each stratum.
(c) The number of cases used in the case-control study.
We defined a very simple measure of the extent to which two distributions overlap. We did not attempt to define a cut-off point, beyond which GAIC might deem
a genetic test to be insufficiently reliable to be used in underwriting, but the results
we obtained ranged across all values of this measure, showing that in some circumstances a genetic test would almost certainly be deemed reliable, and in other
circumstances it would almost certainly be deemed unreliable.
On the basis of this simple model, we conclude that the ability of case-control
studies based on UK Biobank to identify distinct CI underwriting classes was
marginal. If a very large number of cases was used, quite reliable discrimination
was achieved, but this is a very expensive option. If a more realistic number of
cases was used — a few thousands — the power to discriminate quickly diminished.
In particular, it was clear that if the effects of the adverse genotype and adverse
environment were any less than we had assumed, the power to discriminate would
be rather poor.
This conclusion ought to bring comfort to those who are worried about insurers’
use of genetic information, and to insurers themselves. This is particularly important
during the 5 to 10 years that must pass before UK Biobank itself starts to yield
results. We have found no support for the idea that very large-scale genetic studies
like UK Biobank will lead to significant changes in underwriting practice.
77
Our study has been very simple and idealised in several respects mentioned above.
Most obviously, our genetic model is not truly multifactorial, although it does allow
for a basic environmental interaction. Further research is in hand to extend the
model to a more realistic, though still hypothetical, representation of a multifactorial
genetic contribution to heart attack. Our aim will be to find out whether this will
strengthen or weaken the discriminatory power of genetic tests, along the lines that
GAIC has pioneered for single-gene disorders. Another point that will repay further
study is the possibility of model mis-specification.
78
Chapter 5
Adverse Selection and Utility
Theory
5.1
Risk and Insurance
An individual faces financial risk all the time. Be it the risk of losing one’s home
due to fire, flood, earthquake, or loss of a steady source of income due to failing
health; an individual is constantly undertaking huge financial risks. Although the
probability of such a high risk event is small, the resulting loss could be enormous
and potentially catastrophic for an individual.
Facing an uncertain future, an individual might do nothing and gamble on the
risk event not happening. Or, the individual can purchase insurance and pass the
risk on to an insurer at an appropriate price. So, which of the two options should
an individual choose? Economic studies, like Pratt (1964), show that individuals
are generally risk averse. If affordable, an individual would not gamble and would
opt for insurance protection. Of course, the price of insurance plays an important
role. If the insurance premium is set at the actuarially fair price for the risk, it can
be shown that a risk-averse individual would always put a higher value on insurance
as against gambling with the risk. Pursuing this further, it can also be proved that
risk-averse individuals are actually willing to pay more than the fair price for the
risk cover, up to a certain maximum. For more details on rational behaviour in
purchasing insurance coverage against a given risk please refer to Mossin (1968).
79
This risk-averse nature of individuals provides the business incentive for insurers to
operate in the market.
While a solitary individual prefers to insure against risks, insurers are in the
business of accepting risks. By pooling risks, an insurer can become virtually riskneutral. Coupled with the fact that a risk-averse individual is willing to pay more
than necessary, the insurer can charge a premium which will not only cover the
expected cost of claims, but also contribute to their profit margin.
However, an insurer cannot charge an arbitrarily high price, for several reasons.
Firstly, beyond a certain maximum, even the risk-averse individuals will find the
price unattractive, which sets an upper limit for the premium that can be charged.
Moreover, in a competitive market, where individuals can choose between competing
products, they will always buy the cheapest one available, all else being equal. So,
competition ensures that insurance is sold at prices much lower than the upper limit
that the risk-averse individuals could have paid. In fact, in a competitive market,
the equilibrium position for all insurance companies is to charge the fair actuarial
price for the risk involved. Rothschild and Stiglitz (1976) provides a model for
risk-neutral insurance firms in a competitive market.
What can we infer from all this? If the insurance companies can only charge
an actuarially fair premium, could the knowledge that the consumers would have
actually paid more be used under some other circumstances? As we will show here,
the answer to that question is, yes. In the remainder of the chapter, we will see
that in certain situations where insurers do not have access to consumers’ private
information, the upper limits for insurance premiums become relevant. We will
illustrate our results using CI insurance. But before proceeding further, we will
discuss the circumstances that lead to information asymmetry.
5.2
Underwriting Risk
Each individual is unique, their circumstances are different and so are their risk
profiles. So, even if two individuals wish to purchase the same cover from the same
insurer, still they might find that they have to pay very different prices. Insurers
80
would always want to charge a premium which is at least equal to the actuarially fair
price which is commensurate with the risk they are accepting. Although competition
ensures that they cannot charge more than the fair price, they do not want to quote
a lower price, as this would result in losses. So consumers with a higher risk profile
would be expected to pay a higher premium than those with lower risks.
Charging an appropriate premium for a risk involves a good understanding of
how different factors contribute to the risk in question. The factors which have a
quantitative impact on the risk are identified and are commonly referred to as risk
factors. Different levels of exposures of these risk factors would indicate different
levels of risks. In other words, exposure levels of risk factors stratify an insurer’s
consumer base into a number of homogeneous groups of individuals. Appropriate
premiums can then be set for these groups of individuals based on their exposures
to these risk factors. As and when a potential consumer approaches the insurer
for cover, the individual’s exposure to the risk factors would dictate the premium
to be charged. This is, broadly, how underwriting strategies work for insurance
companies.
However, acquiring information on risk factors has its disadvantages. Firstly,
there are costs associated with it. A piece of information is only useful for underwriting purposes if the advantages outweigh the cost of acquiring it. The risk factors
which satisfy this economic criterion can then be used for underwriting purposes.
As more and more risk factors come to light through medical research, it is also an
evolving process. This is very relevant for recent developments in genetics, as the
rôle of genes in an individual’s health becomes clearer. However, as of now, genetic
tests are expensive and it needs further research to establish the relative efficiency
of these tests as underwriting tools.
More importantly, though, there are ethical considerations in accessing private
information. Let us discuss this in the context of CI insurance. The risk of CI is
affected by, among other things, age, gender, lifestyle and genotype. Clearly, CIs are
more common at advanced ages. Medical research has also established differences
between CI incidences for males and females. The same is true for some lifestyle
factors like smoking habits. Use of these items of information for underwriting has
81
become standard and is widely practised in the industry.
However, the use of certain information on environmental exposures and genetic
test results has proved more controversial. Unlike smoking habits, there are some
environmental exposures which are beyond an individual’s control. And genes are
even more intrinsic to human beings, as individuals are born with them. Given
this backdrop, should insurers be allowed to use such information to discriminate
between individuals? In many countries, a ban has been imposed, or moratorium
agreed, limiting the use of genetic information. In the UK, GAIC is providing
guidance to insurers on the acceptable use of genetic information. As it stands now,
insurers are only allowed access to genetic test results for covers exceeding a certain
prescribed level.
Clearly, the regulators are now responsible for formulating policies on ethical
issues, while the insurers do not access genetic information for the majority of cases.
It is imperative here to understand the role of different types of genetic disorders
that might affect an individual’s health.
5.3
Multifactorial Disorders
We have discussed genetic disorders in detail in Chapter 1. In this section, we will
recount briefly the main issues.
Disorders caused by mutations in single genes, which may be severe and of late
onset, but are rare, have been quite extensively studied in the insurance literature,
see Macdonald and Pritchard (2000) for an example. One reason is that the epidemiology of these disorders is relatively advanced, because biological cause and effect
could be traced relatively easily. The conclusion has been that single-gene disorders
do not expose insurers to serious adverse selection, in large enough markets, because
of their rarity.
The vast majority of the genetic contribution to human disease, however, will
arise from combinations of gene varieties (called ‘alleles’) and environmental factors, each of which might be quite common, and each alone of small influence but
together exerting a measurable effect on the molecular mechanism of a disease.
82
Some combinations may be protective, others deleterious. These are the multifactorial disorders, and they are the future of genetics research. Their epidemiology is
not very advanced, but should make progress in the next 5–10 years through the
very large prospective studies now beginning in several countries. As discussed in
earlier chapters, one of the largest is the Biobank project in the UK, with 500,000
subjects. UK Biobank will recruit 500,000 people aged 40 to 69 from the general
population of the UK, and follow them up for 10 years. The aim is to capture both
genetic and environmental variations and interactions, and relate them to the risks
of common diseases. If successful, the outcome will be much better knowledge of the
risks associated with complex genotypes. Thus the genetics and insurance debate
will, in the fairly near future, shift from single-gene to multifactorial disorders.
5.4
Literature Review
Any model used to study adverse selection risk must incorporate the behaviour of
the market participants. Most of those applied to single-gene disorders in the past
did so in a very simple and exaggerated way, assuming that the risk implied by
an adverse genetic test result was so great that its recipient would quickly buy life
or health insurance with very high probability. These assumptions were not based
on any quantified economic rationale, but since they led to minimal changes in the
price of insurance this probably did not matter. The same is not true if we try
to model multifactorial disorders. Then ‘adverse’ genotypes may imply relatively
modest excess risk but may be reasonably common, so the decision to buy insurance
is more central to the outcome.
Information asymmetry and adverse selection have been considered before. Doherty & Thistle (1996) pointed out that, under symmetric information, the private
value of information is negative and insurance deters people from taking diagnostic
tests. This is because, from an individual’s perspective, before undertaking the test,
the premium is a random variable and, being risk-averse, the individual will decide
against testing and opt for an average premium instead. On the other hand, if
the insurer cannot observe test results, acquiring information has a positive private
83
value as it enables an informed choice to be made. However, as insurers adjust their
premiums to guard against adverse selection, there is a loss of market efficiency. The
authors used a general insurance model to show that insurers can only allow partial
cover for the lowest risk group if positive (beneficial) test results cannot be reported.
If reporting verifiable positive test results was allowed, the lowest risk group could
buy full coverage at the lower price. However, uninformed individuals would pay the
same higher premium charged to the high risk group. Assuming costless information, this provides an incentive for taking diagnostic tests. The authors concluded
that the loss of efficiency in the insurance market should be weighed against the
increased value of private information.
Hoy & Polborn (2000) analysed the same problem in a life insurance model. As
life insurance companies do not share information, restricting insurance cover is not
a viable option against adverse selection. Instead the authors propose an income
protection model, which they then use to compute an optimal insurance coverage.
Under specific assumptions, they showed that for a fixed insurance premium, appetite for insurance cover increases with risk. The authors constructed scenarios
where the effect, on welfare, of a new test could go either way. A Pareto worsening happens when very high risk individuals opt for insurance only when the test
produces very bad news. This increases the average insurance premium for life
insurance buyers and worsens everybody’s situation. On the other hand, if the individuals with positive (beneficial) test results have lower risk than the average life
insurance buyer, then there is a Pareto improvement. The authors also investigated
a third scenario under which individuals who go for tests gain and those who do not
lose. As, currently, very few people have diagnostic genetic tests, individuals with
bad news can only move insurance premiums by very small amounts in practice.
However, the authors conclude that if tests become cheaper and widely available,
testing could lead to either Pareto improvements or worsening.
Hoy & Witt (2005) applied the results from Hoy & Polborn (2000) to the specific
case of the BRCA1/2 breast cancer genes. They simulated the market for 10-year
term life insurance policies targeted at women aged 35 to 39. They stratified the
consumer base into 13 risk categories based on family background information. This
84
information is also available to insurers. Then within each risk group, they checked
the impact of test results for BRCA1/2 genes on welfare effects, using iso-elastic
utility functions. The authors showed that in the presence of a high risk group, and
in the presence of information asymmetry, the equilibrium insurance premium can
be as high as 297% of the population weighted probability of death.
All these papers assume that the genetic epidemiology implies that genetic tests
carry very strong information about risk; true of some single-gene disorders but
unlikely to be so true of multifactorial disorders. They concentrate primarily on
providing a proper economic rationale for the impact, on the insurance market, of
genetic tests for, mainly, rare diseases. Here, we try to bring together plausible
quantitative models for the epidemiology and the economic issues, in respect of
more common disorders, therefore affecting a much larger proportion of the insurer’s
customer base. We wish to find out under what circumstances adverse selection is
likely to occur.
5.5
Adverse Selection
We suppose that individuals are risk-averse, have wealth W and aim to buy CI
insurance with sum assured L ≤ W . Their decision is governed by expected utility,
conditioned on the information available to them. Insurers, in a competitive market,
charge an actuarially fair premium P , equal to the expected present value of the
insured loss, conditioned on the information available to them. See for example
Hoy and Polborn (2000) for a similar market model. Because they are risk-averse,
individuals will be willing to pay a premium up to a maximum of P ∗ > P , provided
that they and the insurer have the same information. We can then consider the
effect of genetic information that is only available to applicants.
We propose a simple model of a multifactorial disorder, with two genotypes and
two levels of environmental exposure, and either additive or multiplicative interactions between them. These factors affect the risk of myocardial infarction (heart
attack), therefore the theoretical price of CI insurance. However these price differences are not very large. To begin with, the risk factors are not observable,
85
because the epidemiology is unknown, or the necessary genetic tests have not yet
been developed. Insurers therefore charge everyone the same premium, which is
the appropriate weighted average of the genotype and environment-specific premiums. Subsequently, genetic tests that accurately predict the risk become available,
but only to individuals; insurers are barred from asking about genotype. Adverse
selection therefore becomes a possibility.
5.6
Utility of Wealth
Utility theory has its roots in the early works of the utilitarian philosophers, including Bentham (1789) and Mill (1879). They proposed that people ought to desire
those things that will maximise their utility, where utility is measured in terms
of happiness or satisfaction gained from consumption of commodities. Among the
early applications, Daniel Bernoulli suggested the use of expected utility theory to
solve the St. Petersburg paradox. However, the first important breakthrough came
from Von Neumann and Morgenstern (1944), who used the assumptions of expected
utility maximisation in their formulation of game theory. The work of Nash (1950)
on optimum strategies for multiplayer games ushered in a new era and since then
utility theory has been at the forefront of economic research activity. For a full
exposition of utility theory, see Binmore (1991).
In this chapter, we will define utility functions as a measure of an individual’s
preference for wealth. In other words, an individual, hypothetically, assigns a value
U (w) to every amount of wealth w that she can possess. Figure 5.24 shows a
specimen utility function plotted against a person’s wealth. For this individual,
the utility of wealth, measured in terms of U (w), increases with wealth, w. This
is known as the non-satiation property which states that more wealth is preferred
than less wealth. The other feature of the utility function is that it is concave, i.e.,
the rate of increase of U (w) slows down as the wealth goes up. In other words,
the marginal utility of wealth decreases as the wealth increases. For this individual
the value of an extra pound is more when her existing wealth is £1 rather than
£1, 000, 000. This is known as the risk-aversion property.
86
Utility
u(W )
u(W − qL)
u(W ∗ ) =
(1 − q)u(W )
+qu(W − L)
u(W − L)
W −L
W∗
Wealth
W − qL
W
Figure 5.24: Utility of wealth for a risk averse individual.
Let us now formalise our definition of utility function in terms of the non-satiation
and risk-aversion properties. In mathematical terminology, the utility function for a
risk-averse individual is increasing and concave. Now, U (w) is concave on an interval
[a,b], if for any points w1 and w2 in [a,b] and for any α in (0,1), we have,
U [αw1 + (1 − α)x2 ] > αU (w1 ) + (1 − α)U (w2 ).
(5.32)
If U (w) is twice-differentiable in [a,b], then a necessary and sufficient condition for
it to be concave on that interval is that the second derivative U 00 (w) < 0 for all
w in [a,b]. So a twice-differentiable utility function for a risk-averse individual has
the properties U 0 (w) > 0 (non-satiation property) and U 00 (w) < 0 (risk-aversion
property).
From the above formulation, it is not readily obvious how concavity of utility
functions relates to risk-aversion. To understand the relationship, let us assume that
an individual with a concave increasing utility function U (w), has initial wealth W
from which he might lose L with probability q. The ultimate wealth is the random
variable X, where X = W − L with probability q and X = W with probability 1 − q.
The expected utility of this gamble from the individual’s perspective is:
87
E[U (X)] = qU (W − L) + (1 − q)U (W ).
(5.33)
If he chooses, he can insure the risk for premium P , and accept W −P with certainty.
He should do so if:
U (W − P ) > E[U (X)] = qU (W − L) + (1 − q)U (W ).
(5.34)
In particular he should insure if the premium is equal to the expected loss qL since:
U (W − qL) = U (q(W − L) + (1 − q)W ) > qU (W − L) + (1 − q)U (W ).
(5.35)
The inequality of Equation 5.35 can also be verified from Figure 5.24, which shows
that an individual values certainty more than a gamble. He is more willing to forgo a
fixed loss of amount qL than to participate in the gamble. This is why the individual
is risk-averse.
In fact, we can see from Figure 5.24 that a risk-averse individual is willing to run
her wealth down further than that which is required for a fair actuarial premium.
If W ∗ is the amount of wealth for which:
U (W ∗ ) = (1 − q)U (W ) + qU (W − L)
(5.36)
then the individual will be ready forgo a maximum of W − W ∗ in order to avoid the
gamble. As this is greater than the fair actuarial premium of qL:
P ∗ = W − W ∗ = W − U −1 [(1 − q)U (W ) + qU (W − L)]
(5.37)
is the maximum premium that this person will pay for insurance. So in a market
where competition drives insurers to charge the actuarially ‘fair’ premium qL, insurance will be bought, but this is not the limiting case; insurance will be bought
as long as the premium is less than P ∗ .
88
5.7
Coefficients of Risk-aversion
Risk-aversion is the prerequisite for insurance. However, not all individuals have
the same attitude towards risk. Some individuals are more risk-averse than others.
In this section, we will define properties of utility functions which characterise an
individual’s attitude towards risk. For a comprehensive discussion on properties of
risk-averse utility functions, please refer to Pratt (1964).
Let us consider two utility functions U (w) and V (w), where for a > 0, V (w) =
aU (w) + b. In mathematical terminology, U (w) and V (w) are said to be related by
a positive affine transformation. How different are these two utility functions from
each other? If U (w) represents the utility function for a risk-averse individual, i.e.,
U 0 (w) > 0 and U 00 (w) < 0, then so does the function V (w), i.e., V 0 (w) > 0 and
V 00 (w) < 0. Now, assuming an initial wealth of W , if there is a risk of losing L with
probability q, how will decisions based on utility function U (w) be different from
those based on V (w)? Note that:
V −1 [qV (W − L) + (1 − q)V (W )] = V −1 [a{qU (W − L) + (1 − q)U (W )} + b]
= U −1 [qU (W − L) + (1 − q)U (W )].
(5.38)
From Equations 5.37 and 5.38, we can see that the maximum premium payable under
both these utility functions is the same. So in a way, a positive affine transformation
has preserved the inherent characteristics of these utility functions.
To understand the underlying mechanics, let us define the absolute risk-aversion
function for a utility function U (w), as follows:
AU (w) = −
U 00 (w)
.
U 0 (w)
(5.39)
Clearly, for a positive affine transformation V (w) = aU (w) + b, we have:
AV (w) = −
V 00 (w)
aU 00 (w)
U 00 (w)
=
−
=
−
= AU (w).
V 0 (w)
aU 0 (w)
U 0 (w)
(5.40)
So a positive affine transformation leaves the absolute risk-aversion functions unaltered. Conversely, if we assume AU (w) = AV (w) for two risk-averse utility functions
U (w) and V (w), then we have:
89
V 0 (w)
V 00 (w)
=
.
U 0 (w)
U 00 (w)
(5.41)
(5.42)
Let us now define:
f (w) =
V 0 (w)
.
U 0 (w)
(5.43)
Taking derivatives of both sides:
f0 =
V 0 U 00
V 0 £ V 00 U 00 ¤
V 00
−
=
− 0 .
U0
(U 00 )2
U0 V 0
U
(5.44)
From Equation 5.41, f 0 = 0 implying that V (w) = aU (w) + b where a > 0. So,
we can see that the absolute risk-aversion function is the same for two functions
which are related by a positive affine transformation. In other words, the absolute
risk-aversion coefficient fully characterises a utility function.
We will also introduce here a related quantity called the relative risk-aversion
function, defined as follows:
R(w) = AU (w)w = −
5.8
U 00 (w)w
.
U 0 (w)
(5.45)
Families of Utility Functions
We introduce two families of utility functions which we will use in examples throughout the rest of the document.
(a) The Iso-Elastic utility functions are defined by:

 (wλ − 1)/λ λ < 1 and λ 6= 0
UI(λ) (w) =
 log(w)
λ = 0.
(5.46)
The condition λ < 1 ensures concavity. Log-utility is the limiting case as λ → 0.
The family gets its name, iso-elastic, from the property that scaling wealth by
a certain amount k produces a utility function which is just a positive affine
transformation of the original utility function. In mathematical notation, for
90
all k > 0, there exist some functions f (k) > 0 and g(k), which are independent
of wealth w, such that:
U (kw) = f (k)U (w) + g(k).
(5.47)
It is easy to verify that this family of utility functions satisfies iso-elasticity:

 k λ UI(λ) (w) + (k λ − 1)/λ λ < 1 and λ 6= 0
UI(λ) (kw) =
 U (w) + log(k)
λ = 0.
I(λ)
(5.48)
This property plays an important role, as we will see later that individuals with
iso-elastic utility functions put more emphasis on the proportion of loss under
risk than the actual amount of loss itself.
The absolute risk-aversion function of UI(λ) (w) is:
A(w) =
1−λ
w
(5.49)
and the relative risk-aversion function is constant, R(w) = R = 1 − λ. Hence
higher λ means less risk aversion.
(b) The Negative Exponential family of utility functions is parameterised by a constant absolute risk-aversion function A(w) = A, as follows:
UN (A) (w) = − exp(−Aw), where A > 0.
(5.50)
Clearly, a higher value of A implies more risk aversion.
The Negative Exponential utility functions possess the interesting property that
they are invariant under any translation of wealth. In other words, for all k > 0,
there exist some functions f (k) > 0 and g(k), which are independent of wealth
w, such that:
U (k + w) = f (k)U (w) + g(k).
(5.51)
It is easy to verify that for Negative Exponential utilities,
UN (A) (k + w) = exp(−kA)UN (A) (w).
91
(5.52)
We will see later that this property ensures that individuals with Negative Exponential utility functions put all emphasis on the actual amount of loss, ignoring
completely their initial wealth.
The basic properties of these families of utility functions along with some simple
applications to portfolio optimisation are given in Norstad (1999).
5.9
Estimates of Absolute and Relative Riskaversion
To parameterise these utility functions, we need estimates of absolute or relative riskversion coefficients. Eisenhauer and Ventura (2003) pointed out that past research
was inconclusive; estimates of average relative risk-aversion coefficients ranged from
less than 1 to well over 40. Hoy and Witt (2005) illustrated their model using
iso-elastic utilities with R = 0.5, 1 and 3. We will adopt a similar strategy, as
follows.
Eisenhauer and Ventura (2003) estimated the risk-aversion function based on a
thought experiment conducted by the Bank of Italy for its 1995 Survey of Italian
Households’ Income and Wealth. Under certain assumptions, they estimated that
a person with an average annual income of 46.7777 million lira had absolute riskaversion coefficient 0.1837, and relative risk-aversion coefficient 8.59.
Allowing for the sterling/lira exchange rate in 1995 (average £1 = 2570.60 lira
http://fx.sauder.ubc.ca/) and price inflation in the UK between July 1995 and
June 2006 (Retail Price Index 149.1 and 198.5, respectively) an average income of
46.7777 million lira in 1995 equates to about £24,226 in 2006, not very different
from the actual average of £25,810 (Jones (2005)).
We need utility functions of wealth, so an estimate of the wealth-income ratio
is required. Estimates of this ratio in the literature are quite varied. According to
Treasury (2005) in the U.K., it varies between 5 and 7 for total wealth, and between
2 and 4 for net financial wealth.
The Inland Revenue in the U.K. also publishes figures on personal wealth distribution http://www.hmrc.gov.uk/stats/personal wealth/menu.htm. Their lat92
est figure (for 2003) shows that 53% of the population has less than £50,000 and
83% has less than £100,000. As the distribution of wealth is positively skewed, we
will assume a total wealth of W = £100, 000. This gives a wealth-income ratio of 4
which is consistent with the figures published by Treasury (2005).
(a) The absolute risk-aversion function depends on the unit of wealth. Given utility functions U (w) and V (w) related by U (cw) = V (w) for some constant c,
their absolute risk-aversion functions are related by AU (cw) = AV (w)/c. Using
the exchange and inflation rates above, we suppose that a Briton in 2006 has
absolute risk-aversion coefficient 8.967 × 10−5 ≈ 9 × 10−5 , denominated in 2006
pounds.
(b) The relative risk-aversion function does not depend on the unit of wealth and
so the estimate of 8.59 can be used without any adjustment. We will use a
rounded-off value of 9 henceforth.
The formulation of utility functions with non-constant relative risk-aversion is an
active area of research. Meyer and Meyer (2005) specified a form of marginal utility
function which gives decreasing relative risk-aversion. Xie (2000) proposed a power
risk-aversion utility function which can produce increasing, constant or decreasing
risk-aversion depending on its parameterisation. These specialised utility functions
are not yet in widespread use and we will not consider them further.
We will use the following utility functions for the purposes of illustration:
(a) Iso-elastic utilities with parameter λ = 0.5, 0 and −8, which corresponds to
constant relative risk-aversion of 0.5, 1 and 9 respectively.
(b) Negative exponential exponential utility with absolute risk-aversion coefficient
A = 9 × 10−5 .
Since iso-elastic utility with λ = −8 has absolute risk-aversion coefficient equal
to 9 × 10−5 when wealth is £100,000, our assumption of W = £100, 000 allows us
to compare the two utility functions.
93
94
Chapter 6
Adverse Selection in a 2-state
Insurance Model
6.1
A
Simple
Gene-environment
Interaction
Model
We will illustrate the principles of underwriting long-term insurance in the presence
of a multifactorial disorder in the simple setting of the two-state continuous-time
model in Figure 6.25. We will also assume that all individuals have the same initial
wealth W and follow the same utility function of wealth U (w). The insured event
could be death or illness, and it is represented by transition from state A to state
B. The probability of transition is governed by the transition intensity λs (x), which
depends on age x, and the values of various risk factors which are labelled s (for
‘stratum’).
The risk factors arise from a 2 × 2 gene-environment interaction model. That is,
λs (x)
-
A
B
Figure 6.25: A two state model
95
there are two genotypes, denoted G and g, and two levels of environmental exposure,
denoted E and e. We assume that G and E are adverse exposures while g and e
are beneficial. Therefore, there are four risk groups or strata, that we label ge, gE,
Ge and GE. Let the proportion of the population at a particular age (at which
an insurance contract is sold) in stratum s be ws . The epidemiology is defined as
follows.
(a) We assume proportional hazards, so for each stratum s there is a constant ks
such that λs (x)/λge (x) = ks for all ages x. Clearly kge = 1.
(b) We assume symmetry between genetic and environmental risks, as follows:
(1) The probability of possessing the beneficial gene g is the same as the probability of exposure to the beneficial environment e, each denoted ω. Assuming independence, wge = ω 2 , wgE = wGe = ω(1 − ω) and wGE = (1 − ω)2 .
(2) We assume that kgE = kGe = k.
(c) The gene-environment interaction is represented by either an additive or a multiplicative model, as follows:
(1) Additive Model: kGE = kGe + kgE − kge = 2k − 1.
(2) Multiplicative Model: kGE = kGe kgE /kge = k 2 .
See Woodward (1999) for a discussion of additive and multiplicative models.
Therefore, the epidemiology is fully defined by the parameters λge (x), ω and k
along with the choice of interaction.
6.2
Single Premiums
For simplicity, let the force of interest be δ = 0. (This is consistent with the
assumptions of Doherty and Thistle (1996), Hoy and Polborn (2000) and Hoy and
Witt (2005).) Then the single premium for an insurance contract of term n years,
with sum assured £1, sold to a person aged x who belongs to stratum s is:
· Z t
¸
qs = 1 − exp −
λs (x + y)dy = 1 − (1 − qge )ks .
(6.53)
0
If the proportion of insurance purchasers aged x is the same as the proportion
in the population, ws (for example if the stratum is not known to applicants or to
96
insurers) observation of claim statistics will lead the insurer to charge a weighted
average premium rate
q̄ =
X
ws qs =
s
X
ws [1 − (1 − qge )ks ] = 1 −
X
s
ws (1 − qge )ks
(6.54)
s
per unit sum assured. Given our assumption that the ks can all be expressed as
simple functions of k, the stratum-specific and average premium rates can also be
expressed as qs (k) and q̄(k). In particular, a neat expression can be derived using the
assumption of an additive model along with symmetry between genetic and environmental risks, set out in Section 6.1. Starting from Equation 6.54 and incorporating
these assumptions, we get:
1 − q̄(k) =
X
ws (1 − qge )ks
s
2
= ω (1 − qge ) + 2ω(1 − ω)(1 − qge )k + (1 − ω)2 (1 − qge )2k−1
= (1 − qge )[ω 2 + 2ω(1 − ω)(1 − qge )k−1 + (1 − ω)2 (1 − qge )2(k−1) ]
= (1 − qge )[ω + (1 − ω)(1 − qge )k−1 ]2 .
(6.55)
Alternatively, given values of q̄, qge and ω, one can solve Equation 6.55 for k, using:
hq
log
k =1+
6.3
1−q̄
1−qge
i
−ω
log(1 − ω)
.
(6.56)
Threshold Premium
Suppose all individuals have initial wealth W and that the net effect of suffering the
insured event in the next n years is a loss of L. We assume partial insurance is not
possible, so that the individual insures against the full loss L or does not insure at
all. Define the loss ratio f = L/W . If no-one knows to which stratum they belong
everyone will be willing to pay a single premium of up to:
P ∗ = W − U −1 [q̄(k)U (W − L) + (1 − q̄(k))U (W )].
(6.57)
However, someone who knows they are in stratum s will be willing to pay a single
premium of up to:
97
Ps∗ = W − U −1 [qs (k)U (W − L) + (1 − qs (k))U (W )].
(6.58)
Ps∗ is smallest for stratum ge. So if the insurer, ignorant of the stratum, continues
∗
to charge premium q̄(k)L, adverse selection will first appear if q̄(k)L > Pge
. That
is, if:
U (W − q̄(k)L) < qge (k)U (W − L) + (1 − qge (k))U (W ).
6.4
(6.59)
The Additive Epidemiological Model
Replace the inequality in Equation 6.59 with an equality and solve for k; this represents the relative risk (of each risk factor) with respect to stratum ge, above which
persons who know they are in stratum ge will cease to buy insurance. Doing this
with iso-elastic utility with λ 6= 0 we obtain:
(1 − q̄(k)f )λ = qge (1 − f )λ + (1 − qge ).
(6.60)
In the special case of logarithmic utility (iso-elastic utility with λ = 0) we obtain:
1 − q̄(k)f = (1 − f )qge
(6.61)
and under negative exponential utility:
eq̄(k)AL = qge eAL + (1 − qge )
(6.62)
in which wealth W does not appear. As expected, risk preferences characterised by
different utility functions produce different values of q̄(k). Once q̄(k) is obtained
for a particular utility function, the value of k can be derived from Equation 6.56.
Specifically, we have solved Equations 6.60, 6.61 and 6.62 for certain values of baseline risk qge and loss L, assuming an initial wealth of W = £100, 000. Then using
ω = 0.5 (a uniform distribution across strata) and an additive model, we solve for
k. The results are in Table 6.23. We observe the following:
98
Table 6.23: The relative risk k above which persons in stratum ge with initial wealth
W = £100, 000 will not buy insurance, using ω = 0.5 and an additive model.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
qge
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
10
1.025
1.024
1.022
1.021
1.019
1.051
1.048
1.045
1.042
1.039
1.598
1.546
1.498
1.451
1.405
1.566
1.516
1.468
1.423
1.379
20
1.053
1.050
1.047
1.044
1.041
1.110
1.104
1.098
1.091
1.084
2.755
2.512
2.322
2.163
2.023
2.504
2.292
2.126
1.987
1.864
30
1.085
1.081
1.076
1.072
1.066
1.180
1.170
1.160
1.149
1.138
4.947
4.153
3.664
3.313
3.035
3.917
3.337
2.963
2.684
2.457
40
1.122
1.116
1.110
1.103
1.096
1.264
1.250
1.235
1.220
1.203
8.831
6.972
6.148
5.810
6.107
5.793
4.617
3.972
3.536
3.206
99
loss L in £’000
50
60
1.165
1.217
1.158
1.209
1.150
1.200
1.142
1.189
1.132
1.178
1.368
1.504
1.350
1.479
1.330
1.453
1.308
1.425
1.286
1.395
15.950
–
14.430
–
–
–
–
–
–
–
8.036 10.574
6.119
7.911
5.204
6.857
4.655
6.519
4.305
7.636
70
1.284
1.274
1.263
1.251
1.238
1.691
1.659
1.626
1.590
1.551
–
–
–
–
–
13.428
10.226
9.812
–
–
80
1.373
1.364
1.352
1.339
1.324
1.976
1.939
1.898
1.854
1.805
–
–
–
–
–
16.739
13.900
–
–
–
90
1.513
1.506
1.497
1.486
1.472
2.524
2.488
2.451
2.413
2.372
–
–
–
–
–
20.862
–
–
–
–
(a) For low loss ratios, even small relative risks k will cause people in the baseline stratum to opt against insurance. This is as expected as small losses are
relatively tolerable.
(b) As the loss ratio f increases, so does the relative risk at which adverse selection
appears. This is simply risk aversion at work.
(c) The higher the baseline risk qge for a given loss ratio f , the lower the relative
risk at which adverse selection appears. This is the result of a concave utility
function, as the fair actuarial price increases and depletes wealth.
(d) Lower risk-aversion, under iso-elastic utility, (λ = 0.5) means that smaller relative risks would discourage members of the baseline stratum to buy insurance
at the average premium, and for higher risk-aversion (λ = −8) the reverse is
true.
(e) We have assumed here that everyone has the same utility function and that
partial insurance is not possible. This meant that in our model, individuals
either insure or decide not to insure. In reality, it is possible that individuals
would opt for partial insurance, which we ignore here to keep the model simple.
Comparing iso-elastic and negative exponential utilities, we see that the limiting
relative risks are broadly similar for smaller losses. For larger losses, however, isoelastic utility functions have much greater limiting relative risks. This is because
risk-aversion increases as wealth falls under iso-elastic utility, while for negative
exponential utility it is constant. As the fair actuarial premium for bigger losses
increases and depletes wealth, risk-aversion under iso-elastic utility climbs above
that under negative exponential utility, with the result shown.
6.5
Immunity From Adverse Selection
The missing entries in Table 6.23 mean that adverse selection never appears, whatever the relative risk k. Clearly, this must be related to the size of the high-risk
strata, and their ability, or otherwise, to move the average premium enough to affect
the baseline stratum. We may ask: given qge and f , is there some proportion wge
in the lowest risk stratum above which members of that stratum will always buy
100
insurance at the average premium rate? Begin by noting that:
lim q̄(k) = lim
k→∞
X
k→∞
ws [1 − (1 − qge )ks ] = wge qge +
s
X
ws = 1 − wge (1 − qge ) (6.63)
s6=ge
and that this limit is not a function of the ks and thus holds for additive and
multiplicative models. As a check, it can be easily verified from Equation 6.55,
that the limit is valid for additive models. Now, substituting this limiting value in
Equations 6.60 to 6.62, we can solve for wge as follows, for iso-elastic utility with
λ 6= 0:
wge
"
#
1
1 − (qge (1 − f )λ + (1 − qge ))1/λ
=
1−
,
1 − qge
f
(6.64)
for logarithmic utility:
wge
"
#
1
1 − (1 − f )qge
=
1−
1 − qge
f
(6.65)
and for negative exponential utility:
wge
"
#
1
log[qge eAL + (1 − qge )]
=
1−
.
1 − qge
AL
(6.66)
1/2
Values of ω = wge are given in Table 6.24. Values of ω < 0.5 in Table 6.24
correspond to missing entries in Table 6.23. Table 6.24 shows just how uncommon
an adverse exposure has to be to avoid adverse selection.
Assuming ω = 0.5 is perhaps extreme; it means that half the population possess
a significant genetic risk factor (modulated by environment) yet to be discovered.
This is by no means impossible, but we might expect most as-yet unknown risk
factors to affect a smaller proportion of the population, simply because they are asyet unknown. So, we increase ω to 0.9, so that only 10% of individuals are exposed
to the adverse environment or possess the adverse gene. The relative risks k at
which adverse selection appears are given in Table 6.25. They are larger than in
Table 6.23 because the relative risk experienced by the smaller number of high-risk
individuals has to be much higher to have the same impact on the average premium.
101
Table 6.24: The proportions ω exposed to each low-risk factor above which persons
in the baseline stratum will buy insurance at the average premium regardless of the
relative risk k, using different utility functions.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
qge
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
10
0.999
0.997
0.996
0.995
0.993
0.997
0.995
0.992
0.989
0.987
0.969
0.943
0.919
0.897
0.878
0.971
0.946
0.923
0.903
0.884
20
0.997
0.994
0.992
0.989
0.986
0.994
0.989
0.983
0.977
0.972
0.916
0.857
0.812
0.776
0.746
0.927
0.875
0.835
0.802
0.775
30
0.996
0.991
0.987
0.982
0.978
0.991
0.981
0.972
0.963
0.954
0.830
0.747
0.693
0.653
0.622
0.868
0.797
0.748
0.712
0.682
loss
40
0.994
0.987
0.981
0.974
0.968
0.986
0.973
0.960
0.947
0.934
0.719
0.632
0.580
0.543
0.515
0.802
0.723
0.673
0.637
0.608
102
L in £’000
50
60
0.991 0.989
0.983 0.977
0.974 0.966
0.965 0.954
0.956 0.942
0.981 0.974
0.962 0.949
0.945 0.925
0.927 0.902
0.910 0.880
0.603 0.496
0.525 0.431
0.480 0.393
0.448 0.367
0.424 0.347
0.738 0.682
0.660 0.607
0.612 0.562
0.577 0.530
0.551 0.505
70
0.985
0.970
0.955
0.940
0.924
0.965
0.932
0.900
0.870
0.841
0.398
0.345
0.315
0.294
0.279
0.635
0.564
0.522
0.492
0.468
80
0.981
0.961
0.941
0.920
0.899
0.951
0.906
0.863
0.823
0.786
0.304
0.264
0.241
0.225
0.213
0.595
0.528
0.488
0.460
0.439
90
0.974
0.947
0.919
0.890
0.860
0.926
0.859
0.798
0.743
0.693
0.203
0.176
0.161
0.150
0.142
0.562
0.498
0.461
0.434
0.414
Table 6.25: The relative risk k above which persons in stratum ge with initial wealth
W = £100, 000 will not buy insurance, using ω = 0.9 and an additive model.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
qge
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
10
1.126
1.120
1.113
1.106
1.099
1.257
1.246
1.233
1.220
1.205
4.458
4.823
5.705
–
–
4.246
4.514
5.109
7.984
–
20
1.269
1.258
1.246
1.233
1.218
1.563
1.546
1.526
1.504
1.479
18.642
–
–
–
–
13.531
–
–
–
–
30
1.433
1.419
1.404
1.387
1.367
1.934
1.923
1.910
1.894
1.876
–
–
–
–
–
–
–
–
–
–
loss L in £’000
40
50
60
1.625 1.855 2.140
1.613 1.852 2.158
1.599 1.847 2.180
1.582 1.841 2.210
1.562 1.833 2.250
2.399 3.004 3.839
2.418 3.107 4.170
2.444 3.268 4.844
2.482 3.555 8.317
2.542 4.296
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
103
70
2.511
2.577
2.668
2.807
3.055
5.101
6.164
–
–
–
–
–
–
–
–
–
–
–
–
–
80
3.033
3.212
3.502
4.108
–
7.368
13.981
–
–
–
–
–
–
–
–
–
–
–
–
–
90
3.899
4.419
5.689
–
–
13.841
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Table 6.26: The relative risk k above which persons in stratum ge with initial wealth
W = £100, 000 will not buy insurance, using ω = 0.9 and a multiplicative model.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
6.6
qge
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
10
1.125
1.119
1.113
1.106
1.098
1.254
1.243
1.231
1.218
1.203
4.223
4.723
5.676
–
–
4.024
4.410
5.073
7.981
–
20
1.265
1.255
1.243
1.230
1.216
1.549
1.533
1.516
1.495
1.472
18.561
–
–
–
–
13.391
–
–
–
–
30
1.424
1.412
1.398
1.381
1.362
1.899
1.892
1.884
1.873
1.859
–
–
–
–
–
–
–
–
–
–
loss L in £’000
40
50
60
1.608 1.825 2.090
1.598 1.825 2.115
1.586 1.824 2.144
1.571 1.822 2.181
1.553 1.817 2.229
2.328 2.880 3.645
2.360 3.018 4.065
2.399 3.212 4.805
2.449 3.527 8.314
2.521 4.288
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
70
2.431
2.511
2.617
2.773
3.037
4.839
6.086
–
–
–
–
–
–
–
–
–
–
–
–
–
80
2.907
3.119
3.447
4.086
–
7.107
13.967
–
–
–
–
–
–
–
–
–
–
–
–
–
90
3.701
4.315
5.660
–
–
13.706
–
–
–
–
–
–
–
–
–
–
–
–
–
–
The Multiplicative Epidemiological Model
Unlike Equation 6.55 for additive models, we cannot derive a neat expression for q̄(k)
in multiplicative models. However, the equations can easily be solved numerically.
Table 6.26 shows relative risks above which adverse selection appears, assuming
ω = 0.9 and a multiplicative model. They can be compared with the values in Table
6.25. We observe the following:
(a) The missing entries are the same as in the additive model. This is because the
limiting values of q̄(k) and ω do not depend on the model structure.
(b) The relative risk in stratum GE is higher in the multiplicative model (k 2 >
2k − 1) so persons in the baseline stratum will be less tolerant towards any
given value of k. This is why the values in Table 6.26 are smaller than those in
Table 6.25.
104
(c) However the differences between the additive and multiplicative models are not
very large. If k ≈ 1, then k 2 ≈ 2k − 1, and for large values of ω (which arguably
is most realistic) the impact of stratum GE is relatively small. In view of this,
we will use only the additive model from now on.
105
106
Chapter 7
Adverse Selection in a Critical
Illness Insurance Model
7.1
A Heart Attack Model
We now model the specific example of CI insurance. We will focus on heart attack
risk, building upon the material developed in earlier chapters.
(a) We will use the CI insurance model developed by Gutiérrez and Macdonald
(2003), which we have already seen in Section 3.5.1. To recap, the authors
parameterised the CI model shown in Figure 7.26, using medical studies and
population data. Therefore, in particular, λ12 (x) denotes the rate of onset of
heart attacks in the general population (different for males and females).
(b) In Chapter 3, we assumed that a 2 × 2 gene-environment interaction affected
heart attack risk, with genotypes G and g, and environmental exposures E and
e, upper case representing higher risk. So there were four strata for each sex —
ge, gE, Ge and GE. We showed that it is possible to hypothecate assumptions
on strata-specific relative risks, in a way which is consistent with the rate of
onset in the general population. We will use a similar technique here.
Consider all healthy individuals aged x. If q̄ denotes the probability that a healthy
person aged x has a heart attack before age x + t, it can be calculated from the
heart attack transition intensity of the general population as follows:
107
State 2 Heart Attack
λ12 (x) ¡
µ
¡
¡
¡
¡
State 1 Healthy
¡
*
©
¡ λ13 (x) ©
©
¡
©
¡
©©
©
¡ ©
¡
©©
λ14 (x) ©
©
¡
H
@H
@HHH
@
HH
@
HH
HH
@
j
@ λ15 (x) H
@
@
@
@
λ16 (x) @
R
@
State 3 Cancer
State 4 Stroke
State 5 Other CI
State 6 Dead
Figure 7.26: A full critical illness model.
· Z t
¸
q̄ = 1 − exp −
λ12 (x + y)dy
(7.67)
0
Now, for males and females separately, let c denote the relative risk in the baseline
stratum ge with respect to the general population, and let ks denote the relative
risk in stratum s with respect to stratum ge, in both cases assumed to be constant
at all ages (in other words, we assume a proportional hazards model). If we denote
the rate of onset of heart attack in stratum s by λs12 (x), it is given by:
λs12 (x) = c × ks × λ12 (x).
(7.68)
Suppose that at age, x, the proportion of healthy individuals who are in stratum
s is ws . In stratum s, let qs be the probability that a healthy person age x has a
first heart attack before reaching age x + t. Then using Equations 7.67 and 7.68, we
can show that:
108
· Z t
¸
s
qs = 1 − exp −
λ12 (x + y)dy = 1 − (1 − q̄)cks .
(7.69)
0
Equating the weighted average probability over all strata with the population probP
ability, that is, q̄ =
ws qs , we have:
q̄ =
X
ws [1 − (1 − q̄)cks ].
(7.70)
Given the relative risks, the population proportions and the estimated λ12 (x), we
can solve this for c, which fully specifies the stratum-specific intensities λs12 (x).
7.2
Threshold Premium for Critical Illness Insurance
To extend the two-state insurance model of Section 6.1 to the CI model with six
states, we make some simplifying assumptions.
(a) We will model gene-environment interactions affecting heart attack risk alone,
leaving other intensities unaffected. This is not completely realistic, since many
known risk factors for heart disease are also risk factors for other disorders.
(b) The heart attack transition intensity is different for males and females. Figure
P
7.27 shows the ratio λ12 (x)/ 5j=2 λ1j (x) for both sexes. Heart attack is the
predominant CI among middle-aged men, while among women, heart attack is
increasingly prominent from age 30 onwards, but cancer is the dominant CI at
all ages. The ratio for males stays significantly higher than the ratio for females,
except at very high ages. Hence we might expect adverse selection to appear at
different relative risk thresholds for the two sexes.
7.3
Premium Rates for Critical Illness Insurance
As examples, we model single-premium CI insurance contracts of duration 15 years
sold to males and females aged 25, 35 and 45. First, assuming all transition intensities are as given by Gutiérrez and Macdonald (2003), we compute the single
109
1
Male
Female
Ratio
0.8
0.6
0.4
0.2
0
0
10
20
30
40
50
Age (years)
60
70
80
Figure 7.27: The ratio of heart attack transition intensity to total critical illness
transition intensity, by gender.
Table 7.27: The premium rates of critical illness contracts of duration 15 years.
Age
25
35
45
Male
0.013787
0.048413
0.136363
Female
0.018746
0.049715
0.110434
premiums as expected present values (EPVs) of the benefit payments by solving
Thiele’s differential equations (see Norberg (1995)) numerically. Again for simplicity, we assume the force of interest δ = 0. Table 7.27 gives the CI premium rates
per unit sum assured for these contracts.
We make the same epidemiological assumptions as before, namely that kgE =
kGe = k; that an additive model (kGE = 2k − 1) applies, and that wge = ω 2 ,
wgE = wGe = ω(1 − ω), and wGE = (1 − ω)2 , where ω = 0.9 (the more realistic
assumption); and also that initial wealth is W = £100, 000. Given the relative risks,
we obtain c and hence the the heart attack intensity for each sex and stratum as in
Section 7.1. This allows us to calculate stratum-specific premium rates.
Let Ps denote the single premium rate for unit CI insurance in stratum s. Note
110
that apart from the stratum-specific heart attack risk, Ps also covers the risk of
all other CIs, which are assumed to be the same for all strata. Let P̄ denote the
population average premium rate for unit CI insurance (the averaging being over all
strata for a given gender). As before, since we are ignoring interest rates and profit
margins, the various premium rates defined above are the same as the probabilities
of the event insured against. Then define a function Z(P ) of a premium P as follows:
Z(P ) = U (W − P̄ L) − [P U (W − L) + (1 − P )U (W )].
(7.71)
Note that Z(Pge ) < 0 is the condition under which adverse selection will appear,
equivalent to Equation 6.59 of Section 6.3. Or, let P † be the solution of Z(P ) = 0.
Then Pge < P † is the condition for adverse selection to appear. Tables 7.28 and
7.29 show P † for males and females respectively. It depends on the utility function
but not on the epidemiological model. For the 2-state model, Equation 6.59 was
central in our analysis. Given: (a) a model structure (additive or multiplicative),
the baseline risk qge , and the proportion ω with low values of each risk factor; and
(b) noting that the average risk q̄ was an increasing function of the relative risk
parameter k; we obtained a minimum value of k for which adverse selection first
appears.
111
112
N (9e-5)
I(−8)
Log
I(0.5)
Utility
Function
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
0.013438
0.047229
0.133321
0.013095
0.046062
0.130316
0.008388
0.029922
0.087752
0.008554
0.030512
0.089459
20
0.013068
0.045969
0.130058
0.012374
0.043604
0.123918
0.004503
0.016319
0.049912
0.004976
0.018032
0.055093
30
0.012674
0.044622
0.126534
0.011620
0.041019
0.117108
0.002062
0.007596
0.024272
0.002733
0.010061
0.032069
loss L in £’000
40
50
60
0.012250 0.011788 0.011277
0.043167 0.041577 0.039808
0.122691 0.118448 0.113678
0.010826 0.009980 0.009065
0.038282 0.035353 0.032171
0.109801 0.101879 0.093158
0.000773 0.000223 0.000045
0.002893 0.000849 0.000174
0.009674 0.002978 0.000642
0.001429 0.000719 0.000351
0.005349 0.002734 0.001356
0.017804 0.009517 0.004938
70
0.010695
0.037788
0.108172
0.008055
0.028636
0.083326
0.000005
0.000021
0.000081
0.000167
0.000656
0.002504
80
0.010004
0.035378
0.101522
0.006891
0.024543
0.071772
0.000000
0.000001
0.000004
0.000078
0.000312
0.001247
90
0.009102
0.032216
0.092679
0.005423
0.019348
0.056865
0.000000
0.000000
0.000000
0.000036
0.000146
0.000613
Table 7.28: P † for males, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W =
£100,000.
113
N (9e-5)
I(−8)
Log
I(0.5)
Utility
Function
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
0.018273
0.048500
0.107899
0.017809
0.047304
0.105398
0.011431
0.030745
0.070219
0.011657
0.031351
0.071593
20
0.017773
0.047209
0.105188
0.016833
0.044782
0.100089
0.006150
0.016778
0.039438
0.006796
0.018539
0.043550
30
0.017239
0.045827
0.102269
0.015811
0.042131
0.094460
0.002823
0.007814
0.018924
0.003740
0.010350
0.025029
loss L in £’000
40
50
60
0.016664 0.016038 0.015344
0.044334 0.042702 0.040886
0.099094 0.095600 0.091684
0.014734 0.013586 0.012344
0.039322 0.036315 0.033050
0.088443 0.081945 0.074821
0.001060 0.000307 0.000062
0.002978 0.000875 0.000180
0.007438 0.002256 0.000479
0.001961 0.000989 0.000483
0.005506 0.002817 0.001397
0.013714 0.007231 0.003700
70
0.014554
0.038813
0.087179
0.010971
0.029420
0.066825
0.000007
0.000021
0.000059
0.000231
0.000677
0.001849
80
0.013616
0.036339
0.081758
0.009388
0.025217
0.057471
0.000000
0.000001
0.000003
0.000108
0.000322
0.000908
90
0.012389
0.033093
0.074580
0.007390
0.019880
0.045463
0.000000
0.000000
0.000000
0.000050
0.000151
0.000439
Table 7.29: P † for females, which solves Z(P ) = 0, for different combinations of utility functions and losses, using initial wealth W =
£100,000.
Table 7.30: The population average premium rate for CI insurance, P0 , as if heart
attack risk were absent (λ12 = 0).
Age
25
35
45
Male
0.009821
0.031290
0.092818
Female
0.018326
0.046485
0.097947
We would like to do the same for the CI insurance model. However, there are
important differences between the two models.
(a) In the 2-state model we specified the baseline risk and relative risks, and these
determined the average risk. In the CI insurance model, we specify the average
risk (given by the population heart attack risk) and the relative risks, and these
determine the baseline risk, in the form of the relative risk c. Clearly increasing
the relative risk k will cause c to fall, hence also the premium Pge . To make this
dependence clear, we will write c(k) and Pge (k) in this section. It will also be
useful to note that the probability qge of a heart attack similarly depends on k,
and write qge (k).
(b) However, unlike in the 2-state model, Pge (k) has a lower bound, denoted P0 ,
given by the population average premium rate for CI insurance as if heart attack
risk were absent (λ12 = 0 and c = 0). These values are shown in Table 7.30.
They do not depend on the epidemiological model or the utility function. Clearly
Pge (k) ≥ P0 , no matter how high k becomes. Thus we have two possibilities:
limk→∞ Pge (k) = P0 (equivalently limk→∞ c(k) = 0); or limk→∞ Pge (k) > P0
(equivalently limk→∞ c(k) > 0). We return to this point in Section 7.4.
(c) If Pge (k) is a strictly decreasing function, which it is for the utility functions
we are using, adverse selection is possible if limk→∞ Pge (k) < P † , and in such
cases we can solve Pge (k) = P † for the threshold value of k above which adverse
selection will appear. Tables 7.31 and 7.32 show these values for the various
utility functions and loss levels, for males and females respectively. The missing
values correspond to combinations of parameters such that limk→∞ Pge (k) > P † ,
for which adverse selection will not appear.
(d) Another consequence of this is that there is a level of insured loss, that we
114
Table 7.31: The relative risk k above which males of different ages in stratum ge
with initial wealth W = £100, 000 will not buy critical illness insurance policies of
term 15 years, where ω = 0.9.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
1.484
1.376
1.389
2.062
1.808
1.843
–
–
–
–
–
–
20
2.111
1.846
1.886
3.783
2.998
3.138
–
–
–
–
–
–
30
2.960
2.450
2.544
7.068
4.917
5.339
–
–
–
–
–
–
40
4.183
3.262
3.456
15.883
8.530
9.794
–
–
–
–
–
–
loss L in £’000
50
60
6.117
9.698
4.420
6.226
4.808
7.027
122.410
–
17.855
98.596
23.063 765.192
–
–
–
–
–
–
–
–
–
–
–
–
70
18.869
9.509
11.388
–
–
–
–
–
–
–
–
–
80
105.569
17.715
24.239
–
–
–
–
–
–
–
–
–
Table 7.32: The relative risk k above which females of different ages in stratum ge
with initial wealth W = £100, 000 will not buy critical illness insurance policies of
term 15 years, where ω = 0.9.
Utility
Function
I(0.5)
Log
I(−8.0)
N (9e-5)
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
–
4.031
2.293
–
15.856
4.459
–
–
–
–
–
–
20
–
18.470
4.710
–
–
26.155
–
–
–
–
–
–
loss L in £’000
30
40
50
–
–
–
–
–
–
10.770 52.668 –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
115
60
–
–
–
–
–
–
–
–
–
–
–
–
70
–
–
–
–
–
–
–
–
–
–
–
–
80
–
–
–
–
–
–
–
–
–
–
–
–
90
–
–
–
–
–
–
–
–
–
–
–
–
90
–
93.578
–
–
–
–
–
–
–
–
–
–
Table 7.33: The loss L0 in £,000 above which adverse selection cannot occur. Initial
wealth W = £100,000.
Gender
Male
Female
Age
25
35
45
25
35
45
Utility Function
I(0.5) Log I(−8) N (9e-5)
82.3 51.8
7.1
7.2
92.3 62.6
9.2
9.5
89.9 60.4
8.9
9.2
8.9
4.5
0.5
0.5
25.3 13.3
1.5
1.6
43.4 23.9
2.9
2.9
denote L0 , above which adverse selection cannot occur, because fixing L > L0
in Equation 7.71 and solving for P † yields a solution P † < Pge (k) for all k.
Table 7.33 gives the values of L0 , for the usual utility functions and initial
wealth £100,000. The missing values in Tables 7.31 and 7.32 occur for losses
L > L0 .
The general pattern of threshold relative risks for males given in Table 7.31 is
similar to that in Chapter 6; what is of most interest are their absolute values, since
we have tried to suggest plausible models for both the risk model and the utility
functions.
(a) For iso-elastic utility with λ = −8 and negative exponential utility with parameter A = 9 × 10−5 , we find no evidence at all of adverse selection.
(b) For all utility functions and at all loss levels, if adverse selection can appear, it
does so at higher levels of relative risk than under the two-state model. This is
because the impact of the gene and environment on heart attack risk is diluted by
the presence of the other CIs. Only for the lowest levels of loss are these relative
risks in the range that might be typical of relatively common multifactorial
disorders; by definition, we do not expect studies like UK Biobank to lead to
the discovery of hitherto unknown high risk genotypes.
(c) When adverse selection can appear, the relative risk threshold first decreases
and then increases with age. This is because among CIs the importance of heart
attack peaks at around age 45 as can be seen from Figure 7.27.
116
The threshold relative risks for females are given in Table 7.32. We observe the
following:
(a) The threshold relative risks are much higher than those for males, in all cases.
This is because heart attacks form a smaller proportion of all CIs for females,
so a larger increase in heart attack risk is needed to trigger adverse selection.
(b) As for males, at levels of absolute and relative risk-aversion that we regard as
most plausible (consistent with the Bank of Italy study) we find no evidence
that adverse selection is likely.
(c) In contrast to males, the threshold relative risks decrease with age. The reason
is clear from Figure 7.27; for females the relative importance of heart attack
increases with age.
(d) Adverse selection appears to be possible only for: (i) smaller losses; and (ii)
extremely low levels of risk aversion.
7.4
High Relative Risks
In Section 6.5, we considered relative risks that increased without limit, for the
simple 2-state insurance model. We saw that, even in this extreme case, if stratum
ge was large enough, adverse selection would not appear. In this section, we consider
high relative risks (of heart attack) in the CI insurance model.
We assume the heart attack rates in the general population λ12 (x) are fixed at
their estimated values (Gutiérrez and Macdonald (2003)). From Equation 7.70 we
obtain:
1 − q̄ = 1 −
X
ws [1 − (1 − q̄)c(k)ks ]
s
= wge (1 − q̄)c(k) +
X
ws (1 − q̄)c(k)ks .
(7.72)
s6=ge
Differentiation shows the right-hand side to be a decreasing function of c and of
each ks (s 6= ge), all other quantities held constant in each case. Also, if c = 1 the
right-hand side is less than (1 − q̄) while if c = 0 it is greater than (1 − q̄). Hence,
as we increase the ks without limit, c must decrease, and being bounded below it
117
must have a limit. The limit could be zero or non-zero. We can easily see that if c
has a non-zero limit (necessarily positive) then the last term on the right-hand side
of Equation 7.72 vanishes and the limit must be:
lim c(k) = 1 −
ks →∞
s6=ge
log wge
log(1 − q̄)
(7.73)
which in turn implies (1 − q̄) < wge . On the other hand if (1 − q̄) > wge , then c
cannot have non-zero limit, so the equation:
lim
ks →∞
s6=ge
X
ws (1 − q̄)c(k)ks = (1 − q̄) − wge
(7.74)
s6=ge
holds. Since the left-hand side is finite, at least one of the products cks tends to a
finite limit as the ks → ∞. However, we have not specified here how the quantities
ks (s 6= ge) jointly approach infinity, so the behaviour of c is not easy to analyse in
general. It is greatly simplified if the ks are simple functions of a single parameter
k, which is the case in our assumed epidemiological model (in which case we again
make explicit the dependence of c by writing c(k)). For example, under an additive
model with symmetry between genetic and environmental risks, Equation 7.72 can
be written as:
1 − q̄ = ω 2 (1 − q̄)c(k) + 2ω(1 − ω)(1 − q̄)c(k)k + (1 − ω)2 (1 − q̄)c(k)(2k−1)
= (1 − q̄)c(k) [ω + (1 − ω)(1 − q̄)c(k)(k−1) ]2
(7.75)
therefore:
k =1+
log[(1 − q̄)(1−c(k))/2 − ω] − log(1 − ω)
.
c(k) log(1 − q̄)
(7.76)
If ω 2 > (1 − q̄) then as k → ∞, the limiting value of c(k) is non-zero. Otherwise,
when ω 2 < (1 − q̄), c(k) → 0, and Equation 7.76 yields the finite limiting value:
lim c(k)k =
k→∞
log[(1 − q̄)1/2 − ω] − log(1 − ω)
.
log(1 − q̄)
So, in summary:
118
(7.77)
Table 7.34: q̄, the probability that a healthy person aged x has a heart attack before
age x + t, for policy duration t = 15 years.
Age
25
35
45
Male
0.004743
0.021454
0.059959

 0
lim c(k) =
k→∞
 1−
Female
0.000541
0.004299
0.017616
if wge ≤ (1 − q̄)
log wge
log(1−q̄)
if wge > (1 − q̄).
(7.78)
We want to find out if the baseline stratum ge can ever be large enough that
adverse selection will never appear, no matter how large k becomes. Hence we want
to understand the behaviour of limk→∞ Pge (k) as a function of wge . Equation 7.78
shows that we must treat separately the cases wge ≤ (1 − q̄) and wge > (1 − q̄).
Values of q̄ are given in Table 7.34. (Note that P0 + q̄ 6= P̄ , because in a competing
risks model removing one cause of decrement increases the probabilities of the other
decrements occurring.)
(a) If P0 > P † the result is trivial, since limk→∞ Pge (k) ≥ P0 for any value of wge ,
and adverse selection can never occur.
(b) If P0 < P † adverse selection will occur if wge ≤ (1 − q̄), since then
limk→∞ Pge (k) = P0 .
(c) The non-trivial case is P0 < P † and wge > (1−q̄), since then limk→∞ Pge (k) > P0 .
We can show that limk→∞ Pge (k) is an increasing function of wge in this range,
because the limit of the heart attack probability limk→∞ qge (k) is (use Equation
7.73 to write:
lim qge (k) = lim [1 − (1 − q̄)c(k) ] = 1 −
k→∞
k→∞
(1 − q̄)
wge
(7.79)
and differentiate). The function limk→∞ Pge (k) is continuous and increases from
P0 to P̄ as wge increases from (1 − q̄) to 1, the upper limit being attained when
all the strata have collapsed into one, and c = 1. Since P † < P̄ for any concave
utility function, the intermediate value theorem guarantees that there exists a
119
Table 7.35: The proportions ω exposed to each low-risk factor above which persons
in the baseline stratum will buy insurance at the average premium regardless of the
relative risk k, using different utility functions, for males purchasing CI insurance.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
1.000
0.999
0.998
1.000
0.999
0.996
–
–
–
–
–
–
20
1.000
0.998
0.995
0.999
0.997
0.991
–
–
–
–
–
–
30
0.999
0.998
0.993
0.999
0.995
0.986
–
–
–
–
–
–
loss
40
0.999
0.997
0.990
0.998
0.994
0.981
–
–
–
–
–
–
L in £’000
50
60
0.999 0.998
0.996 0.995
0.987 0.984
0.998
–
0.992 0.990
0.976 0.970
–
–
–
–
–
–
–
–
–
–
–
–
70
0.998
0.993
0.980
–
–
–
–
–
–
–
–
–
80
0.998
0.992
0.975
–
–
–
–
–
–
–
–
–
90
–
0.990
–
–
–
–
–
–
–
–
–
–
unique value of wge such that limk→∞ Pge (k) = P † ; that is, such that adverse
selection can never appear if wge exceeds this value.
1/2
Tables 7.35 and 7.36 give the threshold values of ω = wge above which no adverse
selection takes place, in the additive model with gene-environment symmetry, for
males and females respectively. Missing values indicate that adverse selection will
never appear. When it is possible, the threshold value of ω ranges from 0.970 to 1
for males and 0.992 to 0.999 for females. As the relative risks in Tables 7.31 and
7.32 are based on ω = 0.9, this explains the missing values in those tables.
This pattern is quite unexpected. If adverse selection can occur, then a large
enough baseline stratum does confer immunity from it, but it has to be very large
indeed, all but a few percent of the population. But once the threshold is crossed,
adverse selection cannot appear at all, even if very few people are in the baseline
stratum. This had no counterpart in the 2-state model, and it is caused by the
presence of substantial other risks not affected by the gene-environment variants.
120
Table 7.36: The proportions ω exposed to each low-risk factor above which persons
in the baseline stratum will buy insurance at the average premium regardless of the
relative risk k, using different utility functions, for females purchasing CI insurance.
Utility
Function
I(0.5)
Log
I(−8)
N (9e-5)
7.5
Age
25
35
45
25
35
45
25
35
45
25
35
45
10
–
0.999
0.998
–
0.998
0.996
–
–
–
–
–
–
20
–
0.998
0.996
–
–
0.993
–
–
–
–
–
–
loss L in £’000
30
40
50 60
–
–
–
–
–
–
–
–
0.994 0.992 –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
70
–
–
–
–
–
–
–
–
–
–
–
–
80
–
–
–
–
–
–
–
–
–
–
–
–
90
–
–
–
–
–
–
–
–
–
–
–
–
Conclusions
Until now, genetical research on information asymmetry and adverse selection has
taken one of two routes — models of single-gene disorders and work on the economic welfare effects of genetic testing. Single-gene disorders, by their very nature,
are often severe and it is a reasonable first approximation to assume that private
information about risk makes insurance purchase highly likely. This is not so for
multifactorial disorders, where adverse gene-environment interactions are expected
to be much more common and lead to more modest risk differences. On the other
hand, the economic welfare approach concentrates primarily on efficiency losses in
the insurance market, and may be less concerned with the epidemiology. In this
paper, we have represented multifactorial disorders using standard epidemiological
models and analysed circumstances leading to adverse selection, taking economic
factors into account in a simple way through expected utility.
Logarithmic utility, although popular, may not reflect all risk preferences very
well. In particular, Eisenhauer and Ventura (2003) showed that consumers’ riskaversion is normally much greater than implied by logarithmic utility. We therefore
used utilities with both realistic and traditional risk-aversion coefficients to illustrate
121
our results.
We used a simple 2 × 2 gene-environment interaction model, assuming that
information on status within the model was available only to the consumers and
not to the insurer. Competition leads insurers to charge actuarially fair premiums,
based on expected losses given the information they have. Adverse selection will not
occur as long as members of the least risky stratum (who know their status) can
still increase their expected utility by insuring at the average price.
First, we studied a simple 2-state insurance model, with constant relative risks
in different risk strata defined by the gene-environment model and sex. We found
that adverse selection does not appear unless purchasers are relatively risk averse
(compared with what we think to be a plausible parameterisation) and insure only
a small proportion of their wealth; or unless the elevated risks implied by genetic
information are implausibly high, bearing in mind the nature of multifactorial risk.
In many cases adverse selection is impossible if the low-risk stratum is large enough,
these levels being quite compatible with plausible multifactorial disorders.
We applied the same gene-environment interaction model, assumed to affect the
risk of heart attacks, to CI insurance. As heart-attack risk is just part of the risk of
all CIs, the impact of the gene-environment risk factor was diluted, compared with
the 2-state insurance model where the total risk was influenced. Our results showed
complete absence of adverse selection at realistic risk-aversion levels, irrespective
of the stratum-specific risks. Moreover, the existence of risks other than of heart
attack, and the constraint of differential heart-attack risk to be consistent with the
average population risk, introduced a threshold effect absent from the 2-state model.
When adverse selection was possible at all (low risk aversion, low loss ratios) only
an unfeasibly high proportion of the population in the low-risk stratum would avoid
it, but when the threshold was crossed adverse selection vanished no matter what
the size of the low-risk stratum.
The results from both 2-state and CI insurance models suggest that in circumstances that are plausibly realistic, private genetic information, relating to multifactorial risks, that is available only to customers does not lead to adverse selection.
This conclusion is strongest in the more realistic CI insurance model.
122
We have not considered what might happen if insurers were allowed access to
this genetic information. The opportunity would then exist to underwrite using
that information. If one believed that social policy is best served by solidarity, the
important question is whether insurers would find it worthwhile to use the genetic
information. Further research would be useful, to investigate the costs of acquiring
and interpreting genetic information relating to common diseases, compared with
the benefits in terms of possibly more accurate risk classification, in both cases in
the context of multifactorial risk.
123
124
Chapter 8
Conclusions
In Chapter 1, we set out our broad objectives for the thesis — to analyse how
gene-environment interactions in multifactorial disorders might affect current underwriting practices of the insurance industry. Researchers have found that severe
single-gene disorders, due to their rarity, do not have a significant impact on insurance premiums. Equivalently, the extent of adverse selection was found to be
minimal. Multifactorial disorders, on the other hand, are much more common and
any medical development or breakthrough in this area is likely to have a major impact on the insurance industry. With the setting up of large-scale cohort studies,
like the UK Biobank project, specifically to concentrate on multifactorial disorders,
this has become a real possibility. Given this backdrop, we tackled two fundamental
questions in this thesis:
(a) In the next 5–10 years, as results start emerging from UK Biobank, what will
be the impact of these on the insurance industry?
(b) Given the risk-averse nature of insurance purchasers, at what levels of geneenvironment interaction might an insurer face a realistic risk of adverse selection?
8.1
UK Biobank Simulation Study
In the first half of the thesis, we examined question (a). We chose heart attack as the
disorder of interest and hypothecated a simple 2×2 gene-environment interaction for
125
the risk of heart attack. This segregated the study population into four strata with
varying risk-profiles based on the impact of the respective genes and environmental
factors. As the rates of onset of heart attack are significantly different for males
and females, we analysed the results separately for each sex. Based on this model,
we randomly simulated 500,000 life histories to generate data similar to what is
expected to emerge out of the UK Biobank project. An epidemiological analysis was
then carried out on the simulated data using case-control studies. The results, thus
obtained, were then used in an actuarial model to calculate CI insurance premium
rates for all strata.
This led us to the question, how reliable are these estimates of premium rates
based on which insurers can possibly justify discriminating between individuals with
different genes and with exposure to different environmental factors? In particular,
GAIC and other interested parties would want insurers to provide factual evidence
and rigorously demonstrate the justification of underwriting strategies based on
genetic information. So, we looked at the empirical distributions of the estimated
premium rates generated by simulating many replications of UK Biobank. We noted
that the strata-specific premium rates, as a proportion of the baseline premiums, are
uncorrelated and the extent of overlap of the empirical densities provided a measure
of reliability.
The main conclusions from the analysis are as follows:
(a) Our Base scenario assumptions reflected fairly common adverse genetic and environmental exposures with modest penetrances. This is what we would expect
for most common multifactorial disorders. We found that, if epidemiologists
opted for an extensive study, which included all heart attack cases in conjunction with a 1:5 matching strategy, reliable discrimination could be achieved.
However, this is also an expensive option and case-control studies with such
large numbers of cases and controls may not be economically viable. Casecontrol studies with a few thousand cases coupled with a modest 1:1 matching
strategy, although realistic, quickly diminished the reliability of the estimates
and thus the power to discriminate.
(b) We also analysed the results by varying our assumptions of the frequencies and
126
penetrances of the adverse exposures. The reliability of the estimates reduced
substantially when the proportion of adverse traits in the study population were
halved. Case-control studies also became increasingly harder to carry out as the
number of cases decreased and suitable matching controls with adverse traits
became rarer. Reduced penetrances had a similar impact with reduced ability
to discriminate between different risk-categories; the problems being more acute
for case-control studies with fewer cases and controls.
(c) The results were similar for both males and females when the number of cases
used in the case-control study was fixed in advance. If all cases were to be
included in the study, the estimates of premium rates for females were less
reliable than those for males. This is because heart attacks are rarer for females
and as a result total numbers of cases were fewer.
To summarise, we found that, unless the “adverse” genetic and environmental
factors are abundant or have significant penetrances, the inherent variability of
estimates obtained from case-control studies would make it difficult for insurers to
justify charging different premiums for different risk-groups. This result should bring
comfort to the regulators and other groups who are concerned about insurers using
genetic information to discriminate against the unfortunate few.
While carrying out our analysis we have made a number of simplifying assumptions to keep the problem tractable. Further research needs to be carried out to
analyse the implications of relaxing these assumptions. In particular:
(a) We have assumed a 2×2 gene-environment interaction, which is the simplest
of multifactorial models. However, most common disorders are likely to involve
higher order gene-environment interaction with complex interplay between multiple genes and environmental factors. Extending the simple 2×2 model to
general higher order interactions should produce interesting results.
(b) Caution should also be exercised in interpreting the results because they are
based on some idealised assumptions. In particular, we have ignored the problems of model mis-specification altogether. In reality, there are a number of
places where this can go wrong. As UK Biobank is essentially an unrepeatable
exercise, epidemiologists will have access to a single set of observations, based
127
on which they will propose their models. It is thus highly unlikely that the
model will be “accurate”. A poor choice of epidemiological model would lead to
erroneous results and will inevitably have additional knock-on effects for studies based on these results. Mis-specification can also occur when an actuary
tries to develop his or her own model based on the results published by the
epidemiologists. The true implications of these need to be investigated further.
8.2
Adverse Selection Issues
In the second half of the thesis, we tackled the adverse selection issues in the context of multifactorial disorders. In many countries, due to regulations or agreed
moratoria, genetic information is treated as private and insurers do not have access
to such information. This asymmetry of information can then lead to adverse selection if individuals in the lowest risk-category find the average premium, charged
by the insurer, unacceptably high. Of course, this would depend on a number of
factors including the degree of risk-aversion of these individuals. Our objective was
to analyse the different factors, and the levels of these, which would lead to adverse
selection.
First, we assumed a 2×2 gene-environment interaction in a simple 2-state insurance model. The factors of interest were:
(a) the baseline risk;
(b) the amount of loss insured as a proportion of total wealth;
(c) the proportion of individuals in the lowest risk-category; and
(d) the degree of absolute and relative risk-aversion.
For each of these factors, we analysed the levels of relative risks required to trigger
adverse selection. Our observations were:
(a) The higher the baseline risk, the lower is the level of relative risks of higher risk
strata at which adverse selection appears.
(b) As the amount of loss insured increased as a proportion of total wealth, higher
relative risks were required to trigger adverse selection.
128
(c) The more risk-averse the individuals, the higher are the relative risks required
for adverse selection.
(d) If the proportion of individuals in the lowest risk-category is high, relative risks
in other categories need to be large to move average premium rates so as to
trigger adverse selection. In fact, we found that if the lowest risk-category is
large enough, it is possible to achieve full immunity from adverse selection. Of
course, the levels at which this is attained depended on all other factors.
We then extended our results to a realistic example of a CI insurance model.
As in the UK Biobank simulation study, we hypothecated a 2×2 gene-environment
interaction on heart attack risk. We assumed that other illnesses covered under a CI
insurance contract remained unaffected by these genes and environmental factors.
The results from this model were along similar lines to those obtained from the
2-state insurance model. In particular:
(a) As the rates of onset of heart attack are different for males and females, we
analysed the impact separately. For females, the relative risks required for
adverse selection were substantially higher than those for males. This is because
heart attacks form only a small proportion of all CIs for females.
(b) The presence of other CIs diluted the impact of gene-environment interactions
on heart attack. As a result, the relative risks required for adverse selection were
generally much higher than those observed for the 2-state insurance model. In
fact, for individuals with empirical estimates of risk-aversion, adverse selection
did not appear at all.
(c) The existence of risks other than that of heart attack introduced a floor below
which CI insurance premiums could not fall even when risks of heart attack were
non-existent. This implied that when adverse selection was possible, immunity
from adverse selection was possible only at a very high proportion of population
in the lowest risk-category. Otherwise, adverse selection does not appear at all.
Results from both the 2-state insurance and CI insurance models confirm the key
message that under realistic assumptions, private genetic information does not lead
to adverse selection.
There are further research opportunities in a number of areas:
129
(a) As pointed out for the UK Biobank simulation model, extending the simple
2×2 gene-environment model to higher order interactions, might also produce
interesting results on adverse selection issues.
(b) An insurer’s decision to use genetic test results, if permitted, will depend on
a number of issues including the actual cost of these tests. If the costs are as
high as they are now, it might not make economic sense to use genetic tests
for underwriting purposes. However, as tests become cheaper, in future, the
balance might tilt in the other direction. It might be of interest to find out the
levels of cost at which genetic testing becomes an affordable underwriting tool.
(c) In our analysis, we made a simplifying assumption that all individuals wish
to insure the same amount of loss, as a proportion of wealth, irrespective of
their risk-profiles. This assumption can be relaxed, as Hoy and Polborn (2000)
showed that under certain assumptions, the appetite for cover increases with
risk. The techniques developed in this thesis can be extended to incorporate
these assumptions and analyse the situations where high-risk individuals could
opt for increased cover. Further research in this area might produce interesting
results.
130
Appendix A
Epidemiology
A.1
Introduction
Epidemiology is the study of diseases which tries to answer two fundamental questions:
(a) What causes a disease?
(b) Who are affected by a disease?
There might be a number of factors whose interplay manifests itself in the form
of a disease. With the advent of genetic knowledge, researchers have found out that
diseases can be caused by a genetic disorder. Or in other words, an individual with
a particular gene might have a higher or lower probability of contracting a disease.
This fact does not, however, diminish the role played by the environment on disease
susceptibility. For example, it is well documented that there are more smokers than
non-smokers among lung cancer patients. These factors, genetic and environmental,
which precipitate a disease are called risk factors and form the primary subject
matter of an epidemiological investigation.
The second question tries to ascertain the distribution of a disease. Instead of
looking at the population as a whole, it can be stratified into groups, the analysis of
which may show variability in disease susceptibility by strata. In an epidemiological
study, the usual stratifications are based on age, sex, social class, marital status,
racial group, occupation and geographical location. However, it is vital not to
overlook any other form of stratification which could explain the variation better.
131
To answer the questions posed above, epidemiologists collect, analyse and interpret data collected from groups of individuals. The results thus obtained and the
conclusions arrived thereof, apply directly to the individuals from whom the data
is collected. However, it is natural to seek to see if the results and conclusions can
be extended to a wider group. Of course, the ultimate goal of an epidemiological
study is to obtain results which can then be extended and be held to be valid for
the entire human population.
In practice, epidemiological investigations commence with an objective to obtain
results for a target population. For example, the UK Biobank protocol clearly states
that its objective is to investigate the risk of common multifactorial disorders of adult
life. So the target population here is the whole UK general population, and with a
wider focus – the entire human population.
Collecting data from the whole target population may not always be feasible. So
normally data is collected from a representative subset of the target population, the
study population. The UK Biobank project aims to collect data from a large crosssection of individuals, at least 500,000 men and women, from the general population
of the United Kingdom.
Once the study population is identified, the focus shifts to the collection of appropriate data for analysis. Ideally, each individual within the study population
should be followed-up through time. Every instance of disease should be recorded
along with data on plausible risk factors. Such a detailed study, sometimes called
a cohort study, can then provide direct information on the sequence of happenings
demonstrating causality. Moreover, being so detailed, cohort studies can analyse
many diseases simultaneously.
However, cohort studies are often very expensive and time consuming. Also,
they are not ideal for studying rare diseases as they would require either a very
large study population or a very long time span.
For studying rare diseases, resources can be used more efficiently by employing
case-control studies. Unlike cohort studies, where we follow-up every individual
within the study population prospectively for the entire duration of the study period,
in case-control studies individuals are chosen at the end of the study period according
132
to their disease status. This is why case-control studies are retrospective studies.
In case-control studies, the first step is to identify a number of cases, subjects
with the disease under consideration. The next step is to select a number of controls,
subjects who are free from the disease. Controls should be a representative sample
of those individuals in the study population who do not have the disease, but had
the same chance as a case, to be classified as a case had they become diseased. This
is best achieved by matching at the design stage. Matching will be discussed in
detail in a later section.
While selecting cases and controls, care needs to be taken that the definitions of
both cases and controls are precise and strictly adhered to during the course of the
investigation. The other important consideration is the possibility of bias that may
arise if the chance of having a particular risk factor among chosen cases is different
from all those with the disease in the study population. The same consideration for
bias needs to be given for controls.
The data from the cases and the controls are then analysed to determine the
effect of different risk factors on these two groups.
As is evident, case-control studies are quicker and cheaper. The resources are also
focused to study the more interesting subjects, the cases, in great detail, which is all
the more crucial for rare diseases. In the UK Biobank project, it is envisaged that
analysis will take the form of case-control studies nested within the study population.
A schematic diagram for a case-control study is given in Figure A.28.
A.2
Measuring risks
Before we start analysing the data, let us clarify what we are trying to measure. In
simple terms, the goal is to measure the risk of a disease. So to start with we need
a formal definition of risk.
The risk of a disease can be defined as the probability of an individual becoming
newly diseased given that the individual has the particular attribute or risk-factor
in question.
We will introduce some notation here. Let S(t) be a stochastic process which
133
Diseased
-
-
Cases
-
Controls
-
All
Healthy
Target
Population
Study
Population
Figure A.28: A schematic diagram of a case-control study.
records an individual’s state at time t. Let us also denote Pij (s, t) as the conditional
probability that the study subject is in state j at time t, given that it was in state
i at time s. In mathematical notation:
Pij (s, t) = Prob[S(t) = j|S(s) = i].
(A.80)
The conditional probability defined above is also known as a transition probability. Using the transition probabilities, we can now define the transition intensity or
hazard rate, λij (t), as the instantaneous rate of change of probability at time t, of
moving from state i to state j, given that the subject is in state i at time t, i.e.,
Pij (t, t + dt) − Pij (t, t)
,
dt→0
dt
λij (t) = lim
(A.81)
which can also be written as:
Pij (t, t + dt) = Pi,j (t, t) + λij (t) × dt + o(dt).
(A.82)
The above definition can be simplified further by noting that a subject cannot remain
in two different states at any one particular instant of time, i.e.,
134
λ12 (t) -
1 = Healthy
2 = Diseased
Figure A.29: A 2-state model.

 0 if i 6= j
Pij (t, t) =
 1 if i = j
Using the fact that
P
j
Pij (s, t) = 1 for all t ≥ s, we can derive a useful relation-
ship between the transition intensities. If we sum both sides of Equation A.82 we
get,
X
Pij (t, t + dt) =
j
X
Pi,j (t, t) +
j
X
λij (t) × dt + o(dt),
(A.83)
j
which leads to,
X
λij (t) = 0, or equivalently, λii (t) = −
j
X
λij (t).
(A.84)
j6=i
Before proceeding further, let us work our way through a simple model with two
states – Healthy and Diseased, where the names of the states refer to a particular
disease. Let us assume that an individual always starts off healthy. During the
course of the investigation, the individual can either stay healthy or contract the
disease and move on to the Diseased state. Once in the Diseased state, we will
assume that the individual cannot turn healthy again. Figure A.29 gives a pictorial
representation of this 2-state model.
The transition intensity, λ12 (t), gives the instantaneous rate of change of probability at time t of being diseased for a subject who is healthy up to time t. Let us
now derive a direct relationship between P12 (·) and λ12 (·) as follows. Using basic
probability theory,
P12 (s, t + dt) = P11 (s, t)P12 (t, t + dt) + P12 (s, t)P22 (t, t + dt).
135
(A.85)
Using Equation A.82 and the fact that a subject cannot return to the Healthy state,
i.e. P22 (t, t + dt) = 1, we have:
P12 (s, t + dt) = P11 (s, t) (P12 (t, t) + λ12 (t)dt + o(dt)) + P12 (s, t).
(A.86)
Noting that P11 (s, t) = 1 − P12 (s, t) and P12 (t, t) = 0, we can rewrite the above
equation as follows:
P12 (s, t + dt) − P12 (s, t) = (1 − P12 (s, t)) × (λ12 (t) × dt + o(dt)) .
(A.87)
This leads to
1
d
× P12 (s, t) = λ12 (t),
1 − P12 (s, t) dt
(A.88)
which can be solved, noting the boundary condition of P12 (s, s) = 0, to give:
µ Z t
¶
P12 (s, t) = 1 − exp −
λ12 (u)du .
(A.89)
s
If the disease is rare, or the time period t − s is short, we can use a Taylor series expansion to obtain the following approximate relationship between P12 (·) and
λ12 (·).
Z
t
P12 (s, t) ≈
λ12 (u)du.
(A.90)
s
Moving on to a general multiple-state model, we can derive similar relationships
between transition probabilities and transition intensities. We will start off from a
generalised version of Equation A.85.
Pij (s, t + dt) =
X
Pik (s, t) × Pkj (t, t + dt).
(A.91)
k
Now using Equation A.82, as before, we have,
Pij (s, t + dt) =
X
Pik (s, t) × (Pkj (t, t) + λkj (t) × dt + o(dt)) ,
k
which yields,
136
(A.92)
X
d
Pik (s, t) × λkj (t).
Pij (s, t) =
dt
k
(A.93)
We will discuss ways to solve these differential equations in Appendix B.1.
A.3
Models of Disease Association
In the previous section, we have formulated the risk of a disease through transition
probabilities and transition intensities. Now we will use these concepts to develop
models for measuring the effects of risk factors on a particular disease.
A risk-factor can have a number of levels. Suppose, we are interested in investigating the effect of smoking on lung cancer patients. Smoking habits can be
classified according to the average number of cigarettes smoked per day. The higher
the number, the higher is the level of exposure to the risk-factor of smoking. Investigations can then be performed to figure out how the risk of lung cancer differs
from one level of risk-factor to the other.
In the simplest situation, we can have two levels of a risk-factor where an individual is either exposed to the factor or not. In the lung cancer example, people can be
classified as smokers and non-smokers. Analysts will then investigate how smoking
increases the risk of lung cancer. Here we will concentrate primarily on this binary
set-up.
Initially we will develop models to study effects of one risk-factor at a time. To do
this, care needs to be taken that the results are not distorted by the effects of other
risk factors. One way to ensure this is to stratify the study population according
to the levels of these other possible risk factors, and then analyse the effect of
the risk-factor in question within each such stratum. Going back to the example of
investigating the effect of smoking on lung cancer, suppose we believe that age is also
a risk-factor. Following the strategy outlined above, the study population needs to
be stratified according to age-groups. We then examine the effect of smoking within
each such age-group.
Extending the notation from the previous section, let λuk
ij denote the transition
intensity from state i to state j for exposure status u and stratum k. We will assume
137
that u can take values 1 or 0 depending on whether the individual is exposed to the
risk-factor or not.
One simple formulation to study the excess risk or, more accurately, the excess
rate of risk, which is
0k
bkij = λ1k
ij − λij .
(A.94)
In most studies, the risk-factor in question is not the sole contributor to the risk of
the disease. Suppose that the total risk of the disease is the combined effect of the
risk-factor and some other general factors. In Equation A.94, by subtracting the
transition intensity of the unexposed group from that of the exposed group, we are
trying to eliminate the effects of those other general factors.
If our stratification is precise, then the difference represents the true effect of the
risk-factor in question. It should also remain stable from stratum to stratum. This
leads to the following simplification of Equation A.94:
0k
bij = λ1k
ij − λij , for all k.
(A.95)
The model in Equation A.95 is also known as the additive model.
An alternative model to study disease association is to study the ratios of transition intensities instead of the differences. The formulation is as follows:
k
rij
=
λ1k
ij
.
λ0k
ij
(A.96)
The ratio in Equation A.96 is known as the relative risk. Again under the assumption
that the effect of the general factors cancels out and the ratios remain stable from
stratum to stratum, we get the multiplicative model:
rij =
λ1k
ij
, for all k.
λ0k
ij
(A.97)
There is an interesting relationship between the additive and the multiplicative
model. If we take logarithms of both sides of Equation A.97, we get:
0k
log rij = log λ1k
ij − log λij .
138
(A.98)
Healthy
Diseased
Total
Exposed
p × Pij1k
p × Q1k
ij
p
Unexposed
q × Pij0k
q × Q0k
ij
q
p × Pij1k + q × Pij0k
0k
p × Q1k
ij + q × Qij
1
Total
Figure A.30: A 2 × 2 table for stratum k with corresponding probabilities.
Clearly, Equations A.95 and A.98 have the same structure, except for the scale.
This is why, sometimes multiplicative models are also called log-linear models.
Another important fact to note here is that although all the models above are
specified in terms of the transition intensities, an equivalent formulation can be
achieved through transition probabilities. The relationship defined in Equation A.89
can be used for this purpose.
A.4
Relative Risk and Odds Ratio
In the previous section, we introduced the concept of relative risk. In epidemiological
research, it has become the most frequently used measure for associating exposure
with disease. Here we will develop the concept further by introducing odds ratios.
Using notation similar to the one used for transition intensities in the previous
section, let us denote Pijuk as the transition probability from state i to state j, for
an individual from stratum k and exposure status u. If we assume that p is the
proportion of individuals exposed to the risk-factor in question, we can draw up the
uk
2 × 2 table in Figure A.30 for stratum k where q = 1 − p and Quk
ij = 1 − Pij :
If the study period is reasonably short or the disease under consideration is
relatively rare, we can use the approximation given in Equation A.90 to obtain
the following relationship:
λ1k
Pij1k
ij
≈
.
λ0k
Pij0k
ij
Using this, along with the definition of relative risk, we get:
139
(A.99)
Diseased
Healthy
Total
Exposed
akij
bkij
m1k
ij
Unexposed
ckij
dkij
m0k
ij
Total
n1k
ij
n0k
ij
Nijk
Figure A.31: A 2 × 2 table with data for stratum k.
k
=
rij
λ1k
Pij1k
ij
≈
.
λ0k
Pij0k
ij
(A.100)
Let us now define the odds ratio ψijk , for stratum k, as the ratio of the odds of
disease in the exposed and non-exposed subgroups, i.e.,
ψijk
=
(Pij1k /Q1k
ij )
÷
(Pij0k /Q0k
ij )
Pij1k Q0k
ij
= 0k 1k .
Pij Qij
(A.101)
Again based on the assumption that the study period is short or the disease is rare,
1k
we get Q0k
ij ≈ Qij ≈ 1. This leads to the following approximate relationship between
k
ψijk and rij
.
ψijk =
A.5
Pij1k
Pij1k Q0k
ij
k
≈
≈ rij
.
0k
Pij0k Q1k
P
ij
ij
(A.102)
Analysis of Grouped Data
Using the theory developed above, let us now proceed to draw inference based on
actual data. Hence forward we will state most of the results without any proof. For
details, please refer to Breslow and Day (1980) and Woodward (1999).
Suppose we are investigating the effect of a risk-factor on a particular disease. To
avoid distortion of results due to other risk-factors, the study population is stratified
into a number of strata. For each stratum of the study population, the data can be
summarised in a 2 × 2 table, as given in Figure A.31.
From the data, we can obtain estimates of the transition probabilities as follows:
140
P̂ij1k =
akij
,
(akij + bkij )
P̂ij0k =
ckij
.
(ckij + dkij )
(A.103)
These can then be used to derive an estimate of relative risk as follows:
k
r̂ij
=
P̂ij1k
P̂ij0k
akij /(akij + bkij )
akij /m1k
ij
= k
= k
.
k
k
0k
cij /(cij + dij )
cij /mij
(A.104)
Using a log transformation and normality assumption, the standard error can be
estimated by
s
k
se(log
ˆ
e r̂ij )
=
1
1
1
1
− k
+ k − k
.
k
k
aij
aij + bij cij
cij + dkij
(A.105)
The estimate and the estimated standard error can then be used to obtain approxik
mate confidence intervals for rij
. They can also be used to obtain p-values for testing
k
hypotheses on rij
.
Similarly, estimates can be obtained for the odds ratio:
ψ̂ijk
=
P̂ij1k Q̂0k
ij
P̂ij0k Q̂1k
ij
=
akij
k
(aij +bkij )
ckij
k
(cij +dkij )
s
k
se(log
ˆ
e ψ̂ij ) =
dkij
k
(cij +dkij )
bkij
k
(aij +bkij )
akij dkij
= k k ,
bij cij
1
1
1
1
+ k + k + k.
k
aij bij cij dij
(A.106)
(A.107)
Again, approximate confidence intervals and p-values can be obtained for ψijk using
these equations.
Note that the marginal totals, muk
ij , are meaningless for case-control studies, as
individuals are selected according to their disease status and not their exposure status. As a result relative risks cannot be estimated for case-control studies. However,
no such problem exists for the estimation of odds ratios as the marginal totals cancel
out. However, if the disease is rare or if the study period is short, the odds ratios
are good approximation to the relative risks. So for case-control studies we will only
concentrate on the estimation of odds ratios.
141
Until now, we have calculated odds ratios separately for each stratum. However,
if we can assume that there is a common true odds ratio for each stratum and
the differences in the observed odds ratios are purely due to chance variation, the
estimate of the common odds ratio is given by the Mantel-Haenszel estimate:
Ã
ψ̂ij =
X akij dkij
k
!,Ã
Nijk
X bkij ckij
k
Nijk
!
.
(A.108)
The estimate of the standard error of loge ψij , as proposed by Robins et al. (1986),
has the following form:
se(log
ˆ
e ψ̂ij ) =
v P
P k k P k k
P k k
u
u
Uij Xij + k Vij Wij
Uijk Wijk
k
k Vij Xij
k
t ¡P
P
P
+
+
¢
¢ ,
¡
P
2
k
k
k
k 2
W
X
2
W
X
2
2
ij
ij
k
k
ij
ij
k
k
(A.109)
where, for stratum k,
Uijk =
A.6
akij + dkij
bkij + ckij
akij dkij
bkij ckij
k
k
k
,
V
=
,
W
=
,
X
=
.
ij
ij
ij
Nijk
Nijk
Nijk
Nijk
(A.110)
Analysis of Matched Studies
In a case-control study, individuals are selected according to their disease status. If
cases and controls are chosen independently, there is a chance that the profiles of
the individuals in the control group will be different from that of the cases. This
difference will then feed into the analysis to distort the results.
Matching is a method which tackles this problem by choosing controls based on
the profiles of the cases. Matching uses the concept of stratification, introduced
in Section A.3, to subdivide the study population into a number of strata. The
cases are first classified according to the strata they come from. Controls are then
chosen in such a way that they have a distribution similar to that of the cases across
strata. This ensures that analysis can be done within each strata, eliminating the
distortions arising out of the differences between strata.
142
No exposures
Two exposures
Cases
Controls Total
Cases
Controls Total
Exposed
0
0
0
Exposed
1
1
2
Unexposed
1
1
2
Unexposed
0
0
0
Total
1
1
2
Total
1
1
2
One exposure
One exposure
Cases
Controls Total
Cases
Controls Total
Exposed
1
0
1
Exposed
0
1
1
Unexposed
0
1
1
Unexposed
1
0
1
Total
1
1
2
Total
1
1
2
Figure A.32: The types of table for each case-control pair in a 1:1 matching.
However, care needs to be taken to guard against over-matching. As an extreme example, suppose that the study population is stratified for the risk-factor
in question. This will then result in the same distribution of cases and controls for
each exposure level. No conclusions can then be drawn from the analysis. So it
is important to leave aside the risk-factor in question while stratifying the study
population.
The simplest form of all matching is the 1:1 matching or pair matching. Here for
each case, a control is chosen from the same stratum irrespective of the exposure
status. A case-control pair can then be identified with one of the four possibilities
shown in Figure A.32.
If we assume that each case-control pair represents a stratum and that there exists
a common odds ratio for all strata, we can derive the Mantel-Haenszel estimate using
Equation A.108.
Let,
tu be the number of sets with u exposures, and
143
mu be the number of sets with u exposures in which the case is exposed.
Using these notations in Equation A.108 we get,
Ã
ψ̂ij =
X akij dkij
k
!,Ã
Nijk
X bkij ckij
k
!
Nijk
t0 × 02 + m1 × 12 + (t1 − m1 ) × 02 + t2 ×
=
t0 × 02 + m1 × 02 + (t1 − m1 ) × 12 + t2 ×
m1
.
=
t1 − m1
0
2
0
2
(A.111)
In other words, the estimate is the ratio of the number of exposed cases to the
number exposed controls where one of the case or the control is exposed. Note
that the sets where both case and control are exposed or where both are unexposed
contain no extra information. So these terms are eliminated from Equation A.111.
The standard error of the estimate can be derived using Equation A.109. However, when ti is small, Breslow and Day (1980), have provided a formula for an exact
100(1 − α)% confidence interval (ψL , ψU ), where
m1
,
(t1 − m1 + 1)Fα/2 (2(t1 − m1 + 1), 2m1 )
(m1 + 1)Fα/2 (2(m1 + 1), 2(t1 − m1 ))
=
.
t1 − m1
ψL =
ψU
(A.112)
Here Fα/2 (ν1, ν2) denotes the upper 100(α/2) percentile of the F distribution
with ν1 and ν2 degrees of freedom.
As it is highly likely that for a rare disease, there are more controls available than
there are cases, it is possible to develop a design where each case can be matched
to a number of controls, say c. Increasing c, increases the efficiency of the estimates
as the standard errors fall. However, for each increase in c, the marginal increase in
efficiency decreases. So, 1:c matching is rarely performed with c greater than 5.
For 1:c matching, using techniques similar to the one used for 1:1 matching, we
can derive the Mantel-Haenszel estimate of the odds ratio as follows:
Pc
u=1 (c + 1 − u)mu
.
ψ̂ij = P
c
u=1 u(tu − mu )
(A.113)
Miettinen (1970) gives an approximate formula for the standard error of loge ψ̂ij :
144
"
se(log
ˆ
e ψ̂ij ) = ψ̂ij
#−0.5
c
X
(c + 1 − u)tu
u=1
(uψ̂ + c + 1 − u)
.
2
(A.114)
Sometimes in a 1:c matching, it is possible that data from a few controls may
not be available. In this situation, a case can be matched against a number of
controls which is not fixed but can vary between 1 and c. This then becomes a
1:variable matching design. Using similar techniques, estimates of the odds ratio
can be obtained.
Let j denote the number of controls that are matched with any one case, where
j = 1, 2, · · · , c. The Mantel-Haenszel estimate is given by
Pc
ψ̂ij =
Pv
(v)
v=1
u=1 Tu
,
Pc Pv
(v)
B
u
v=1
u=1
(A.115)
where,
(v)
Tu(v)
Bu(v)
(v + 1 − u)mu
=
,
v+1
(v)
(v)
u(tu − mu )
=
.
v+1
(A.116)
Also, Equation A.114 can be generalised to obtain the standard error of ψ̂ij in
Equation A.115.
"
se(log
ˆ
e ψ̂ij ) = ψ̂ij
c X
v
X
u(v + 1 − u)tu
v=1 u=1
(uψ̂ij + v + 1 − u)
#−0.5
(v)
2
.
(A.117)
The most general of all matching strategies is the many:many matching design.
Here a variable number of controls are matched against a variable number of cases.
Although conceptually more difficult, similar techniques can be used to derive the
Mantel-Haenszel estimate of the odds ratio.
(rs)
Suppose that muk is the number of matched sets with r cases and s controls in
which there are u exposures to the risk-factor, k of which are exposed cases.
P
ψ̂ij = P
(rs)
Tuk
(rs)
Buk
where,
145
,
(A.118)
(rs)
Tuk
(rs)
Buk
(rs)
k(s − u + k)muk
,
r+s
(rs)
(u − k)(r − k)muk
=
.
r+s
=
(A.119)
The standard error of the estimate can be derived using Equation A.109.
A.7
Effects of Combined Exposures
Until now, we have looked at models to study the effect of one particular risk-factor
at a time. However, in reality, all human diseases are caused by the combined
interactions of a number of risk factors. In Section A.1, we have briefly touched
upon gene-environment interactions, which study the combined effects of genetic
and environmental factors precipitating a disease. In this section, we will develop
models to analyse the effects of combined exposures on a disease.
Suppose we are interested in two risk factors A and B. Extending the notation
developed in Section A.3, let λuvk
be the transition intensity from state i to state
ij
j with exposure level u of risk-factor A and exposure status v of risk-factor B, for
stratum k. As before, in the binary set-up, u and v can take values 1 or 0 depending
on the exposure status. In a similar way, we can extend the notation of relative risk
uvk
to rij
and odds ratio to ψijuvk .
Using this notation, in the binary set-up, for stratum k, we can define:
11k
rij
=
λ10k
λ01k
λ00k
λ11k
ij
ij
ij
ij
10k
01k
00k
,
r
=
,
r
=
,
and
r
=
= 1.
ij
ij
ij
00k
00k
00k
λ00k
λ
λ
λ
ij
ij
ij
ij
(A.120)
Recall the definition of excess rate of risk in Section A.3. Based on the same
concept, for two risk factors, we can define three types of excess rates of risk, as
follows:
00k
λ11k
: When exposed to both A and B.
ij − λij
00k
λ10k
: When exposed to A but unexposed to B.
ij − λij
00k
λ01k
: When exposed to B but unexposed to A.
ij − λij
146
(A.121)
Now let us assume that the effect of risk-factor A is independent of the effect of
the risk-factor B, or in other words, there is no interaction between the risk factors.
Independence or non-interaction between risk factors can be interpreted in a number
of ways. One possible formulation is to assume that the joint effect of risk factors
A and B is additive, i.e.,
00k
10k
00k
01k
00k
(λ11k
ij − λij ) = (λij − λij ) + (λij − λij ),
(A.122)
01k
00k
10k
λ11k
ij = λij + λij − λij .
(A.123)
which simplifies to:
Dividing both side of Equation A.123 by λ00k
ij is:
11k
10k
01k
rij
= rij
+ rij
− 1.
(A.124)
An alternative characterisation for the joint association is the multiplicative or
the log-linear model. Here we assume that the log transformation of the transition
intensities are additive. Under this formulation, Equation A.122 transforms into:
³
log(λ11k
ij )
³
´
−
log(λ00k
ij )
log(λ10k
ij )
−
=
´
log(λ00k
ij )
³
+
log(λ01k
ij )
´
−
log(λ00k
ij )
,
(A.125)
which simplifies to:
λ11k
λ10k
λ01k
ij
ij
ij
log 00k = log 00k + log 00k ,
λij
λij
λij
(A.126)
which when re-written in terms of relative risks, is:
11k
10k
01k
log(rij
) = log(rij
) + log(rij
),
(A.127)
11k
10k
01k
rij
= rij
× rij
.
(A.128)
or equivalently,
147
A
B
Cases
Controls
+
+
akij
bkij
+
−
ckij
dkij
−
+
ekij
fijk
−
−
gijk
hkij
Figure A.33: A 2 × 4 table with data for stratum k.
So in the above model, the independence or non-interaction of risk factors implies
a multiplicative combination for the joint effect.
Earlier in Section A.5, we have seen that in case-control studies, although relative
risks cannot be estimated directly, odds ratios can be calculated and used as good
approximations of relative risks. So we will use odds ratios, instead of relative risks,
to analyse the effects of combined exposures in case-control studies.
To study the effects of two risk factors A and B, the data can be summarised in
a 2 × 4 table, as given in Figure A.33, where ‘+’ implies exposure and ‘−’ implies
non-exposure.
Table A.37 lists all possible odds ratios that can be calculated from the data
given in Figure A.33. The first odds ratio, ψij11k , measures the joint effect of the
risk factors A and B. The next two odds ratios, ψij10k and ψij01k , measure the effect
of one risk-factor at a time. The remaining four odds ratios, ψij1∗k , ψij0∗k , ψij∗1k and
ψij∗0k , stratify the population based on the exposure level of one risk-factor and then
measure the effect of the other risk-factor. The asterisk, in the notation of these
last four odds ratios, denotes the risk-factor for which the effect is being measured.
For example, ψij1∗k is the odds ratio measuring the effect of exposure to B, for those
who are already exposed to A.
The odds ratios, ψij11k , ψij10k and ψij01k , can also be used to measure the deviation
of the data from both additive and multiplicative models. The first two measures,
given in Table A.38, provide direct checks on deviation from these models. The
148
Table A.37: List of odds ratios obtained from the 2 × 4 table in Figure A.33.
Notation
ψij11k
ψij10k
ψij01k
ψij1∗k
ψij0∗k
ψij∗1k
ψij∗0k
Formula Main Information
akij hkij
k
bkij gij
ckij hkij
k
dkij gij
ekij hkij
k gk
fij
ij
akij dkij
bkij ckij
ekij hkij
k gk
fij
ij
k
akij fij
bkij ekij
ckij hkij
k
dkij gij
Effect of joint exposures versus none.
Effect of exposure to A alone versus none.
Effect of exposure to B alone versus none.
Effect of exposure to B, given exposed to A.
Effect of exposure to B, given unexposed to A.
Effect of exposure to A, given exposed to B.
Effect of exposure to A, given unexposed to B.
case only odds ratio gives an alternative measure to check departure from the multiplicative model. The control only odds ratio estimates exposure dependencies in the
underlying population. A discussion on these last two measures is given in Khoury
and Flanders (1996).
For a general discussion on the use of 2 × 4 tables for measuring combined exposures, please refer to Botto and Khoury (2001).
Table A.38: Other measures based on the 2 × 4 table in Figure A.33.
Other measures
Formula
Multiplicative interaction
ψij11k /(ψij10k ψij01k )
ψij11k − (ψij10k + ψij01k − 1)
Additive interaction
k
akij gij
Case only odds ratio
ckij ekij
bkij hkij
Control only odds ratio
k
dkij fij
149
150
Appendix B
Numerical Methods
B.1
B.1.1
Differential Equations
Introduction
In this section, we will briefly describe how the transition intensities introduced
earlier can be used to formulate a set of differential equations which can be solved
for the transition and occupation probabilities. For details, please refer to Press et al.
(2002). Here we will consider a general n-state model. Using the same definitions
and notations defined in the previous chapter, we have the following set of equations,
commonly referred to as the Kolmogorov forward equations:
X
d
Pij (s, t) =
Pik (s, t)λkj (t),
dt
k
(B.129)
P0 (s, t) = P(s, t) × Λ(t),
(B.130)
or in matrix notation,
with the boundary condition P(s, s) = I.
With arbitrary Λ(t), defined by typical life history events, we can only solve these
equations numerically and not explicitly. We now discuss some numerical methods
of solving differential equations.
151
B.1.2
Euler Method
The formula for the Euler method is:
P(s, t + h) = P(s, t) + h × P0 (s, t)
(B.131)
which advances a solution from t to t + h. However, this method advances the
solution through an interval of length h using derivative information only at the
beginning of that interval. Although the method converges, it is inefficient and
asymmetric and is not normally recommended.
The Euler method can easily be improved upon by making use of an intermediate
solution to achieve greater accuracy. A simple approach is to find a solution at the
mid-point of the interval and to then obtain the solution at the end of the interval
as illustrated below.
Define:
K1 = h × P0 (s, t) = h × P(s, t) × Λ(t),
½
1
K2 = h × P(s, t) + K1
2
¾
1
× Λ(t + h),
2
(B.132)
(B.133)
leading to:
P(s, t + h) = P(s, t) + K2 + O(h3 )
(B.134)
This method is sometimes referred to as the midpoint method and can be further
refined to give the fourth-order Runge-Kutta method which is outlined in the next
section.
B.1.3
Runge-Kutta Method
By far the most often used method is the classical fourth-order Runge-Kutta formula.
The steps are outlined below.
Define:
152
K1 = h × P0 (s, t) = h × P(s, t) × Λ(t)
¾
½
1
1
K2 = h × P(s, t) + K1 × Λ(t + h)
2
2
½
¾
1
1
K3 = h × P(s, t) + K2 × Λ(t + h)
2
2
K4 = h × {P(s, t) + K3 } × Λ(t + h).
(B.135)
leading to:
1
1
1
1
P(s, t + h) = P(s, t) + K1 + K2 + K3 + K4 + O(h5 )
6
3
3
6
(B.136)
For any multiple state model, the transition intensities will form the fundamental
building blocks. So in almost all circumstances we will be able to define a set of differential equations specifying the problem and numerical solutions can be computed
using Runge-Kutta method.
B.2
B.2.1
Random Numbers
Introduction
Generation of random numbers from a particular distribution forms one of the most
important tasks in a simulation exercise. This topic is covered in many textbooks
on numerical analysis. So this section is not meant to be an exhaustive discussion
on this topic. Rather the aim will be to provide a documentation of the methods
that we are going to use. For a fuller treatment of the topic, please refer to Press
et al. (2002).
In the next section, we will give a brief introduction to the generation of random
numbers from a uniform distribution. Then we will move on to other distributions
of interest, from which random numbers can be generated using suitable transformations. In the final section, we will outline a method that can be used for any
general continuous distribution.
153
B.2.2
Uniform Deviates
Standard libraries of all major programming languages provide random number
generators. In our case, we will concentrate primarily on C++, as all our programs
will be written in that programming language. C++ has inherited from the ANSI
C library a pair of routines, srand() and rand() for initialising and then generating
random numbers. The random number generator is initialised with a seed and
a sequence of random numbers can be generated based on that seed. Note that
the same initialising value of seed will always return the same sequence of random
numbers.
The rand() function of C++ is a linear congruential generator, which can generate
a sequence of integers I1 , I2 , . . . each between 0 and m − 1 by the recurrence relation
Ij+1 = aIj + c (mod m). Here m is called the modulus, and a and c are positive
integers called the multiplier and the increment respectively. ANSI C requires that
m be at least 32768, which is nevertheless too small an integer for any large scale
simulation exercise.
In Press et al. (2002), there is detailed discussion on efficient routines for random
number generation, salient features of which are listed below.
ran0 This routine is a simple linear congruential generator, which is satisfactory
for the majority of applications. However, it is not recommended because of
the presence of subtle serial correlations.
ran1 The routine uses the same algorithm as ran0. However, it shuffles the output
to remove low-order serial correlation. The routine ran1 passes those statistical
tests that ran0 is known to fail. However, it is 30% slower than ran0. This
routine is recommended for general use.
ran2 The ran2 routine uses a long period random number generator with the shuffle. It is recommended for generating more than 100,000,000 random numbers
in a single calculation, as it has a longer period than ran1. However, this
routine is only half as fast as ran0.
For our simulation exercise, we would need to generate a lot of random numbers.
So we will use ran2 for our simulation exercise. In Press et al. (2002), there is also
154
a discussion on ran4 which generates “extremely” good random deviates. However,
it is only half as fast as ran2 and we will not describe it here. Unlike rand() of C++
library which generates integers, ran0, ran1 and ran2 produce uniform random
deviates between 0.0 and 1.0 (exclusive of the endpoint values). Similar to the
rand() function all these random number generators require a seed to initiate the
sequence. If a seed is not provided, the seed will automatically be set to the time
of the machine clock.
B.2.3
The Transformation Method
In the last section, we have seen how we can generate uniform deviates using the
ran2 routine. Now we will see how we can use randomly generated uniform deviates
to produce random deviates from a specific distribution.
Let us first look at a simple discrete distribution — the Bernoulli distribution.
Let Y ∼ Ber(p), i.e. P [Y = 1] = p and P [Y = 0] = 1 − p. The following steps can
be used to generate random deviates from this distribution.
(a) Generate a random deviate x from a U (0, 1) distribution.
(b) If x < p produce 1, else produce 0 as the required random deviate from Ber(p).
Random deviates from Bin(n, p) can be produced by adding n independent
Ber(p) random deviates.
Random deviates from the M ultinomial(n, p1 , p2 , . . . , pn ) can be generated as
follows:
(a) Generate a random deviate x from a U (0, 1) distribution.
k−1
k
P
P
(b) If x ≤ p1 produce 1, else if
pj < x ≤
pj produce k as the required random
variate.
j=1
j=1
For continuous distributions, let us first consider U (a, b), a simple generalisation
of U (0, 1). We know that if we define Y = a + (b − a)X where X ∼ U (0, 1), then
Y ∼ U (a, b). So if we generate x from U (0, 1) and define y = a + (b − a)x, then y
is a random deviate from U (a, b). So we see that a simple linear transformation of
the U (0, 1) produces random deviates from U (a, b) distribution.
155
The next distribution of interest is the exponential distribution, Exp(λ). Here we
use the following transformation: Y = − log(1 − X). If X ∼ U (0, 1), Y ∼ Exp(λ).
Note that in both the examples above, we have made use of the fact that for
any random variable Y , F (Y ) ∼ U (0, 1), where F (·) is the cumulative distribution
function of the random variable Y . In other words, X = F (Y ) ∼ U (0, 1). So,
Y = F −1 (X) has the cumulative distribution function F (·).
A general method for producing random deviates from any random variable with
cumulative distribution function F (·) requires the following steps:
(a) Generate a random deviate x from a U (0, 1) distribution.
(b) Find y, such that, F (y) = x.
(c) Produce y as a random deviate from F (·).
The result above can be used to generate random deviates from a general distribution if the cumulative distribution function for that distribution can be inverted.
However, most distributions that we will be interested in rarely have a cumulative
distribution function that can be inverted easily. Of course, an iterative method can
be used to act as a substitute.
However, the algorithm above will not be efficient if F (y) is not easy to compute.
If this is the case then it is advisable to tabulate the values of F (y) at appropriately
short-spaced y’s, and use linear interpolation at intermediate points. Note that the
shorter the spacing between tabulated y’s the greater the accuracy but the larger
the space requirement.
As an example, let us assume that we know the age-dependent transition intensity
λ(x) for a particular hazard. Suppose we are interested in generating the waiting
time T for an individual aged a to make the relevant transition. We know that the
distribution function of T is given by
³ R
´
t
exp − 0 λ(s)ds
¡ Ra
¢
(B.137)
F (t) = 1 −
exp − 0 λ(s)ds
Unless we have a very simple form for λ(·), we will have to perform numerical
integration each time we need F (t). As this is inefficient and time consuming, we
evaluate the values F (t1 ), F (t2 ), . . . where tj+1 = tj + δ, δ being a small positive
number, say 0.01, and then store these values for ready reference.
156
Now following the algorithm outlined above, generate a uniform random variate
x and find t such that F (t) = x. One can choose an efficient search algorithm which
can minimise the search for the correct t. As we are searching within a bounded
interval the Bisection method can be used. The t thus obtained gives us the required
waiting time.
There is an important point to note here. Many of the transition intensities
that we will be working with may not have the property F (∞) = 1. This means
that there is a probability 1 − F (∞) that an individual will not make a transition
at all. This can be taken into account by generating a Bernoulli random variate
Y ∼ Ber (1 − F (∞)), where Y = 0 will indicate that the individual will never make
the transition and Y = 1 will indicate otherwise. So the above algorithm will only
be implemented if Y = 1, as searching for a value of t is only required if a transition
is made.
For a Normal distribution, the cumulative distribution function is not easily
invertible. So a different transformation known as the Box-Muller transformation
is usually used to produce standard normal deviates. Consider the transformation
between two random deviates x1 , x2 from U (0, 1) and two quantities y1 , y2 ,
p
−2 ln x1 cos 2πx2
p
y2 = −2 ln x1 sin 2πx2
y1 =
(B.138)
(B.139)
It can be shown that y1 , y2 are independent random deviates from the N (0, 1)
distribution.
B.2.4
The Rejection Method
The rejection method is a powerful, general technique for generating random deviates from a distribution whose density function p(·) is known and computable.
The rejection method does not require that the cumulative distribution function be
readily computable, much less the inverse of that function, which was required for
the transformation method described in the previous section.
The rejection method involves the following steps:
(a) Find a majorising function M (·), for which M (x) > p(x) for all x.
157
(b) Calculate the area A under the majorising function M (·), i.e. A =
R∞
−∞
M (s)ds.
(c) Generate a random deviate x1 from U (0, A).
Ry
(d) Find y, such that x1 = −∞ M (s)ds.
(e) Generate a random variate x2 from U (0, 1).
(f) If x2 < p(y)/M (y), produce y as the required random deviate from the distribution with density function p(·); otherwise return to step (c).
As we have already seen how to generate uniform random deviates, the main
issue here is to obtain an appropriate majorising function. There are many different
ways one can define a majorising function and suitability of the majorising function
will also depend on the shape of p(·). Also, apart from the fact that the M (·) needs
to have the property that M (x) > p(x) for all x, it should also be easy to invert
Ry
M (s)ds. Here, we will propose a general method of producing the majorising
−∞
function for any density function p(·).
Our aim will be to find a step function M (x) which will provide an upper envelope
for p(x). For this, we need to start off from any x = x0 , such that 0 < p(x0 ) < ∞.
Given x0 , we move on to x1 , such that M (x) on this interval is a constant and
exceeds p(x) for all x in that interval and the area under M (x) does not exceed a
pre-specified positive number. Once p(x) becomes smaller than a set tolerance level,
it is assumed that the tail of the distribution is reached and is approximated by an
exponential function. These steps are followed on both sides of x0 to +∞ and −∞.
The full algorithm is outlined below.
First find x = x0 such that 0 < p(x0 ) < ∞. This will be the starting point for
setting our majoring function M (·). We set M (x0 ) = p(x0 ). Now our algorithm will
set the values of M (·), first for x > x0 and then for x < x0 . At each step it will
be required to calculate p0 (x) for which any simple numerical differentiation method
can be used.
So for x > x0 do the following:
¯
¯
¡
¢
1. Find x+n from x+(n−1) , so that the area ¯x+n − x+(n−1) ¯ × p x+(n−1) equals a
pre-defined small value δ > 0.
¡
¢
2. Depending on the values of p0 x+(n−1) and p0 (x+n ) do one of the following:
158
¡
¢
¡
¢
(a) If p0 x+(n−1) < 0 and p0 (x+n ) < 0, then set M (x+(n−1) ) = p x+(n−1) .
¡
¢
(b) If p0 x+(n−1) > 0 and p0 (x+n ) > 0, then set M (x+(n−1) ) = p (x+n ).
¡
¢
(c) If p0 x+(n−1) > 0 and p0 (x+n ) < 0, then set M (x+(n−1) ) as the minimum
of the following two terms:
¯
¡
¢ ¯
¡
¢
ˆ p x+(n−1) + ¯x+n − x+(n−1) ¯ × p0 x+(n−1)
¯
¯
ˆ p (x+n ) + ¯x+n − x+(n−1) ¯ × p0 (x+n ).
¡
¢
(d) Else set M (x+(n−1) ) as the maximum of p x+(n−1) and p (x+n ).
3. If p (x+n ) < ² n = 1, 2, . . . for a pre-specified small ² > 0, then
¡
¢
(a) If p0 x+(n−1) < 0, set
M (x) = p (x+n ) × e−(x−x+n ) x > x+n
¡
¢
(b) If p0 x+(n−1) > 0, set
 ¡
¢
(x−x+(n−1) ) x
 p x
+(n−1) × e
+(n−1) < x ≤ x+n
M (x) =
 0
x > x+n
(B.140)
(B.141)
and stop. Else continue.
Similarly for x < x0 do the following:
¯
¯
1. Find x−n from x−(n−1) n = 1, 2, . . ., so that the area ¯x−n − x−(n−1) ¯ ×
¡
¢
p x−(n−1) equals a pre-defined small value δ > 0.
¡
¢
2. Depending on the values of p0 x−(n−1) and p0 (x−n ) do one of the following:
¡
¢
(a) If p0 x−(n−1) < 0 and p0 (x−n ) < 0, then set M (x−n ) = p (x−n ).
¡
¢
¡
¢
(b) If p0 x−(n−1) > 0 and p0 (x−n ) > 0, then set M (x−n ) = p x−(n−1) .
¡
¢
(c) If p0 x+(n−1) < 0 and p0 (x+n ) > 0, then set M (x−n ) as the minimum of
the following two terms:
¯
¡
¢
¡
¢ ¯
ˆ p x−(n−1) + ¯x−n − x−(n−1) ¯ × p0 x−(n−1)
¯
¯
ˆ p (x−n ) + ¯x−n − x−(n−1) ¯ × p0 (x−n ).
¡
¢
(d) Else set M (x−n ) as the maximum of p x−(n−1) and p (x−n ).
159
3. If p (x−n ) < ² for a pre-specified small ² > 0, then
¡
¢
(a) If p0 x−(n−1) > 0, set
M (x) = p (x−n ) × e(x−x−n ) x < x−n
¡
¢
(b) If p0 x−(n−1) < 0, set
 ¡
¢
−(x−x−(n−1) )
 p x
x−n < x ≤ x−(n−1)
−(n−1) × e
M (x) =
 0
x < x−n
(B.142)
(B.143)
and stop. Else continue.
For values of x where M (·) is not defined above, define M (x) = M (y) where y is
the largest value less than x for which M (·) is defined.
It is easy to verify that M (·) defined above is easily invertible and has the property
M (x) > p(x) where x does not belong to the tail region. For the tails, we will assume
that the scaled exponential function majorises p(x). The exponential approximation
of the tails is satisfactory for most distributions which will be of interest to us.
However, this approach is not adequate for dealing with distributions with fat tails.
Now that we have obtained M (·) for a general p(·), we can use the rejection
method to generate random deviates from the general distribution with density
function p(·).
As an example, let us consider Exp(1) distribution. If we start from x0 = 1 and
set δ = 0.10, Figure B.34 shows how the majorising function M (x) will provide an
upper envelope for the exponential density p(x). Now if we change δ to 0.01, the
new majorising function M (x) is given in Figure B.35. Clearly, with δ = 0.01, the
majorising function is a very close approximation of the Exp(1) density function.
The important point to note here is that the only difference in the simulation
of random deviates in cases, δ = 0.10 and δ = 0.01 lies in the efficiency of the
method. It is quicker to compute the majorising function if δ is large. However,
this might mean generating a significantly large number of uniform deviates to get
a single random deviate from the target distribution. On the other hand, small δ
means a significant amount of time spent on computing M (x), but more efficiency
is achieved in terms of actual generation of random deviates. But since M (x) need
160
1
Majorising function
Exponential(1) density
Density
0.8
0.6
0.4
0.2
0
0
1
2
3
x
4
5
6
Figure B.34: The Exp(1) density and the majorising function with δ = 0.10.
only be computed once, the following rule of thumb can be used — to generate a
large number of random deviates, use a small δ.
For the N (0, 1) distribution, a similar exercise leads to the majorising functions
given in Figures B.36 and B.37.
The density estimates based on the simulated 50,000 random deviates obtained
from the Exp(1) and N (0, 1) distributions using the Rejection method with δ = 0.01
are given in Figures B.38 and B.39 respectively.
161
1
Majorising function
Exponential(1) density
Density
0.8
0.6
0.4
0.2
0
0
1
2
3
x
4
5
6
Figure B.35: The Exp(1) density and the majorising function with δ = 0.01.
0.5
Majorising function
Normal(0,1) density
Density
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
x
1
2
3
4
Figure B.36: The N(0,1) density and the majorising function with δ = 0.10.
162
0.5
Majorising function
Normal(0,1) density
Density
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
x
1
2
3
4
0.0
0.2
0.4
Density
0.6
0.8
1.0
Figure B.37: The N(0,1) density and the majorising function with δ = 0.01.
0
1
2
3
4
5
6
x
Figure B.38: Density estimates based on the simulated 50,000 random deviates from
Exp(1).
163
0.5
0.4
0.3
0.0
0.1
0.2
Density
−4
−2
0
2
4
x
Figure B.39: Density estimates based on the simulated 50,000 random deviates from
N (0, 1).
164
Bibliography
Arrow, K. (1963). Uncertainty and the welfare economics of medical care. American Economic Review, 53(5), 941–973.
Bentham, J. (1789). An introduction to the principles of morals and legislation.
Oxford University Press (1996).
Binmore, K. (1991). Fun and games: A text on game theory. Houghton Mifflin.
Botto, L. and Khoury, M. (2001). Commentary: Facing the challenge of geneenvironment interaction: The two-by-four table and beyond. American Journal
of Epidemiology, 153, 1016–1020.
Breslow, N. and Day, N. (1980). Statistical Methods in Cancer Research: Volume
1 – The analysis of case-control studies. International Agency for Research on
Cancer.
Brønnum-Hansen, H., Jørgensen, T., Davidsen, M., Madsen, M., Osler,
M., Gerdes, L. and Schroll, M. (2001). Survival and cause of death after myocardial infarction: The danish monica study. Journal of Clinical Epidemiology,
54, 1244–1250.
Capewell, S., Livingston, B., MacIntyre, K., Chalmers, J., Boyd, J.,
Finlayson, A., Redpath, A., Pell, J., Evans, C. J. and McMurray, J.
(2000). Trends in case-fatality in 117 718 patients admitted with acute myocardial
infarction in scotland. European Heart Journal, 21, 1833–1840.
Darwin, C. (1859). On the origin of species by means of natural selection, or the
preservation of favoured races in the struggle for life. Jon Murray, Albermarle
Street, London.
165
Darwin, E. (1794). Zoönomia: or the laws of organic life. J. Johnson.
Daykin, C., Akers, D., Macdonald, A., McGleenan, T., Paul, D. and
Turvey, P. (2003). Genetics and insurance — some social policy issues (with
discussions). British Actuarial Journal, 9, 787–874.
Doherty, N. and Posey, L. (1998). On the value of a checkup: Adverse selection,
moral hazard and the value of information. Journal of Risk and Insurance, 65(2),
189–211.
Doherty, N. and Thistle, P. (1996). Adverse selection with endogeneous information in insurance markets. Journal of Public Economics, 63, 83–102.
Eisenhauer, J. and Ventura, L. (2003). Survey measures of risk aversion and
prudence. Applied Economics, 35, 1477–1484.
Goldberg, R., McCormick, D., Gurwitz, J., Yarzebsky, J., Lessard, D.
and Gore, J. (1998). Age-related trends in short- and long-term survival after
acute myocardial infarction: A 20-year population-based perspective (1975-1995).
American Journal of Cardiology, 82, 1311–1317.
Gutiérrez, C. and Macdonald, A. (2003). Adult polycystic kidney disease and
critical illness insurance. North American Actuarial Journal, 7(2), 93–115.
Gutiérrez, C. and Macdonald, A. (2004). Huntington’s disease, critical illness
insurance and life insurance. Scandinavian Actuarial Journal, pages 279–313.
Hoy, M. and Polborn, M. (2000). The value of genetic information in the life
insurance market. Journal of Public Economics, 78, 235–252.
Hoy, M. and Witt, J. (2005). Welfare effects of banning genetic information in
the life insurance market: The case of brca1/2 genes. Technical report, University
of Guelph Discussion Paper 2005-5.
Jones, F. (2005). The effects of taxes and benefits on household income, 2004/05.
Technical report, Office for National Statistics.
166
Khoury, M. and Flanders, W. (1996). Nontraditional epidemiologic approaches
in the analysis of gene-environment interaction: case-control studies with no controls. American Journal of Epidemiology, 144, 207–213.
Lewin, B. (2000). Genes VII. Oxford University Press.
Macdonald, A. (2003). Moratoria on the use of genetic tests and family history
for mortgage-related life insurance. British Actuarial Journal, 9(1), 217–237.
Macdonald, A. (2004). Genetics and insurance management. In A. Sandström
(ed.) The Swedish Society of Actuaries: One Hunderd Years. Svenska Aktuarieforeningen, StocKholm.
Macdonald, A. and Pritchard, D. (2000).
A mathematical model of
alzheimer’s disease and the apoe gene. ASTIN Bulletin, 30, 69–110.
Macdonald, A. and Pritchard, D. (2001). Genetics, alzheimer’s disease and
long-term care insurance. North American Actuarial Journal, 5(2), 54–78.
Macdonald, A., Pritchard, D. and Tapadar, P. (2006). The impact of multifactorial genetic disorders on critical illness insurance: A simulation study based
on uk biobank. To appear in ASTIN Bulletin.
Macdonald, A. and Tapadar, P. (2006). Multifactorial genetic disorders and
adverse selection: Epidemiology meets economics. Submitted.
Macdonald, A., Waters, H. and Wekwete, C. (2003a). The genetics of breast
and ovarian cancer i: A model of family history. Scandinavian Actuarial Journal,
pages 1–27.
Macdonald, A., Waters, H. and Wekwete, C. (2003b). The genetics of breast
and ovarian cancer ii: A model of family history. Scandinavian Actuarial Journal,
pages 28–50.
McCormick, A., Fleming, D. and Charlton, J. (1995). Morbidity Statistics
from General Practice: Fourth National Study 1991-1992. Series MB5 No. 3.
Washington, D.C.: OPCS, Government Statistical Service.
167
Mendel, G. (1866). Proceedings of the natural history society. Journal of Monetary
Economics, 4, 3–47.
Meyer, D. and Meyer, J. (2005). Risk preferences in multi-period consumption models, the equity premium puzzle and habit formation utility. Journal of
Monetary Economics, 52, 1497–1515.
Miettinen, O. (1970). Estimation of relative risk from individually matched series.
Biometrics, 26, 75–86.
Mill, J. (1879). Utilitarianism. Longmans, Green and Co.
Mossin, J. (1968). Aspects of rational insurance purchasing. Journal of Political
Economy, 76(4), 553–568.
Nash, J. (1950). The bargaining problem. Insurance: Mathematics and Economics,
17, 155–162.
Norberg, R. (1995). Differential equations for moments of present values in life
insurance. Econometrica, 18(2), 171–180.
Norstad, J. (1999). An introduction to utility theory. Unpublished manuscript at
http://homepage.mac.com/j.norstad.
Pasternak, J. (1999). An introduction to human molecular genetics: mechanisms
of inherited diseases. Fitzgerald Science Press.
Pratt, J. (1964). Risk aversion in the small and in the large. Econometrica, 32,
122–136.
Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (2002). Numerical Recipes in C++. Cambridge University Press.
Ridley, M. (1999). Genome: The autobiography of a species in 23 chapters. Fourth
Estate.
Robins, J., Greenland, S. and Breslow, N. (1986). A general estimator for the
variance of the mantel-haenszel odds ratio. American Journal of Epidemiology,
124, 719–723.
168
Rothschild, M. and Stiglitz, J. (1976). Equilibrium in competitive insurance
markets: An essay on the economics of imperfect information. The Quarterly
Journal of Economics, 90(4), 630–649.
Strachan, T. and Read, A. (1999). Human Molecular Genetics 2. BIOS Scientific
Publishers Ltd.
Sudbery, P. (1998). Human molecular genetics. Addison Wesley Longman Limited.
Treasury, H. (2005). Economy charts and tables. Technical report, Pre-Budget
Report.
Tunstall-Pedoe, H., Kuulasmaa, K., Mähönen, M., Tolonen, H.,
Ruokokoski, E. and Amouyel, P. (1999). Contribution of trends in survival
and coronary event rates to changes in coronary heart disease mortality: 10 year
results from 37 who monica project populations. The Lancet, 353, 1547–1557.
Von Neumann, J. and Morgenstern, O. (1944). Theory of games and economic
behavior. Princeton University Press.
Watson, J. and Crick, F. (1953). Moelcular structure of nucleic acids. Nature,
171, 737–738.
Woodward, M. (1999). Epidemiology: Study Design and Data Analysis. Chapman
& Hall.
Xie, D. (2000). Power risk aversion utility functions. Annals of Economic and
Finance, 1, 265–282.
169
Download