Survival analysis methods in Insurance Applications in car

advertisement
Survival analysis methods in Insurance
Applications in car insurance contracts
Abder OULIDI1
Jean-Marie MARION2
Hervé GANACHAUD3
Abstract
In this work, we are interested in survival models and their applications on actuarial problems.
We particularly study the Cox model and Aalen model which allow covariate effects to vary
with time (time dependant covariates); this allows to obtain more precise results on the
lifespan of cars insurance. We are interested in the relationship between lifespan of contracts
and some predictive covariates. For example, the “Bonus-Malus” is a covariate influencing
contracts and we noticed that the more the “Bonus-Malus” increases, the more the risk of
cancellation increases. We also studied time dependant covariates (“internal covariates” are
generated by individuals under study, “external covariates” more easily observed since they
are independent of the study subject to “Bonus-Malus” because the insurant keeps this one by
changing insurance company).
We compare the lifespan of car insurance contracts estimated by survival models (nonparametric, parametric and semi-parametric models with fixed and time-dependant
covariates).
Keywords: Cox model, Aalen model, survival distributions, censored data, Kaplan-Meier,
lifespan of car insurance contracts.
Institut de Mathématiques Appliquées – 44 Rue Rabelais – BP 808 – 49100 ANGERS CEDEX 01
oulidi@ima.uco.fr
1
2
Institut de Mathématiques Appliquées – 44 Rue Rabelais – BP 808 – 49100 ANGERS CEDEX 01
jean-marie-marion@uco.fr
3
Groupe MMA – DCTGP - 10 Boulevard Alexandre Oyon – 72030 LE MANS Cedex 9
ganachaud@groupe-mma.fr
1
1.
Introduction
The car insurance is a mature market with a weak growth rate. Furthermore, in this muchsough sector, new actors (banks-insurers, large distribution…) come to join the traditional
actors. Confronted with a strong competition aggravated by the quasi-stability of the insurable
motor vehicle population, and the advent of the future European prudential framework,
insurers are led to develop optimal models of surveillance and management of their portfolio,
among others, to develop loyalty of the most profitable customers and possibly to cancel some
contracts of insurance.
In this work, we use survival models and their applications on actuarial problems. We are
interested in the relationship between lifespan of contracts and some predictive covariates. We
particularly study models with time dependant covariates; this allows obtaining more precise
results on the lifespan of cars insurance. We apply these methods of survival analysis to an
actuarial dataset.
The origin of survival analysis can be traced to early work on mortality tables, which was
followed and expanded by statistical research for engineering applications. But there are also
other fields of applications: medicine, biology and economy. In our paper, we will use models
of survival analysis in an actuarial context.
In section 2 we will consider a brief overview of traditional survival models (non-parametric,
parametric and semi-parametric models) with censoring.
Section 3 is dedicated to the Cox model with time dependant covariates.
In section 4 we discuss the Aalen model which allows covariate effects to vary with time.
In section 5, we consider a dataset from a French insurance company which contains
information about cars insurance contracts. We investigate the time from the conclusion until
the cancellation of a car’s contract. There are several attributes given about the insurance
holder. We will compare survival models on this dataset. Aalen model with time dependant
covariates will allow to obtain more precise results on the lifespan of car’s insurance.
2.
Survival models
In prospective studies, the important feature is not only the outcome event, but the time to
event, the survival time. For example the survival time T from the conclusion (starting point)
until the cancellation (ending event) of a contract.
The distribution of T from starting point to the event of interest, viewed as a positive random
variable, is characterized by the probability density function f or the cumulative distribution
function F .
2
The survival function is defined by S (t ) = P (T > t ) and the hazard function, denoted h ,
f (t )
1
defined by h(t ) =
= lim P (T ∈ ]t , t + ∆t ] / T > t )
S (t ) ∆t →0 ∆t
The hazard function specifies the instantaneous rate of contract’s cancellation at time t , given
that the contract survives up till t .
t
The cumulative hazard function is defined by H (t ) = ∫ h( x)dx .
We find that S (t ) = exp {− H (t )} since S (0) = 1
0
The functions f , F , S and h give mathematically equivalent signification of the distribution
of T .
A special source of difficulty in the analysis of survival data is the possibility that some
individuals may not be observed for the full time to event. This problem is called censoring
and the associated variable is denoted by C . The so-called censoring arises, for example,
when observation is terminated before the occurrence of the event.
If the cancellation of a contract is not observed, we define (Y , D ) with Y = min(T , C ) and D
is the censoring indicator.
Non parametric models
In this section we discuss the analysis of survival data without parametric assumptions about
the distribution of T. Our topic is non-parametric estimation of the survival function.
Assume we have a right-censored sample of survival data (T1 ,..., Tn ) , really we observe
Yi = min (Ti , Ci ) with Di = 1{Ti ≤Ci } and Ci (1 ≤ i ≤ n ) denote the right censored observations.
Let Y(1) ,..., Y( n ) the orderly sample with D1' ,..., Dn' the ordered indicators. Consider R (t ) the
number of individuals at risk just prior to t (these are cases whose duration time is at least t )
and M (Y(i ) ) the number of cancellations at Y( i ) .
Sˆ ( t ) =
( ) ⎞⎟
( ) ⎟⎠
⎛ M Y
(i )
∏ ⎜⎜1 − R Y
{i ;Y(i) <t} ⎝
(i )
is called the product-limit estimator or Kaplan-Meier’s estimator for S (t ) - the most common
method of estimating the survival function.
Parametrics models
In parametric models, the survival time T belongs to a class of specified distributions.
These functions are described using a finite number of parameters, the purpose of which will
be to estimate them from a data set.
3
Let t1 ,..., tn a sample resulting from a known distribution f ( x, θ ) , where θ is a vectorial –or
not - parameter.
Really, we observe y1 ,..., yn , a possibly right or left censored set of observations. Parametric
models, or regression procedures, are techniques for assessing the relationship between survival
times and a set of explanatory variables (or covariates). For example, the “Bonus-Malus”, the
age of vehicle …are influencing the lifespan of car’s insurance contract
A characteristic of survival data is that the response cannot be negative. This suggests that a
transformation of the survival time such as a log transformation may be necessary or that
specialized methods may be more appropriate than those that assume a normal distribution for
the error term.
The parametric model is of the form
ln yi = x%i β + ηε i
i = 1,..., n
where x%i a transpose vector of covariates corresponding to the individual is i , β is a vector of
unknown regression parameters, η is an unknown scale parameter, and ε i is an error term.
The baseline distribution of the error term can be specified as one of several possible
distributions, including, but not limited to, the exponential, log normal, log logistic, and
Weibull distributions.
In parametric models, we estimate parameters β ,η and those of the ε i distribution. Finally,
we obtain the distribution of the survival time T .
Semi-parametric models
Semi-parametric models assume a parametric form for the effects of explanatory variables on
survival times and allow an unspecified form for an underlying survivor function. Among
these models, the most known one is the Cox regression model.
Thus, the hazard function of the survival time is given by:
h ( t / x ) = h0 ( t ) exp ( x% β )
where h0 is an unspecified baseline hazard function, x% is a vector of covariate values
(transposed) and β is a vector of unknown regression parameters.
The effect of the covariates on survival is to act multiplicatively on some unknown baseline
hazard rate.
The Cox regression is a proportional hazards model. That is, with time-fixed covariates, the
ratio of their hazard function for any two individuals i and j obeys the relationship:
h(t / x1 )
= exp( x%1β1 − x%2 β 2 )
h(t / x2 )
thus the “hazard ratio” is constant with respect to time t.
exp( x% β )
Let S0 the baseline survival function associated with h0 , we have S ( t / x ) = [ S0 (t ) ]
.
In order to estimate β , we observe ( y(1) ,..., y( n ) ) an orderly sample and we use the “partial
4
δi
⎡
⎤
n ⎢
exp ( x%i β ) ⎥
likelihood function” L y(1) ,..., y( n ) ; β = ∏ ⎢
⎥
%k β ) ⎥
i =1 ⎢ ∑ exp ( x
⎢⎣ k∈R( y(i ) )
⎥⎦
where the risk set R ( y(i ) ) includes those contracts at risk for the event at time Y( i ) when the
(
)
event was observed to occur for contract i (or at which time contract i was right censored) –
that is, contracts for whom the cancellation has not yet occurred or who have yet to be right
censored.
(Notice that censoring times are excluded from likelihood because for these observations the
exponent δ i = 0 ).
exp ( x% β )
to estimate S .
Finally we use S ( t / x ) = [ S0 (t ) ]
%
exp
x
β
νˆ ( )
One may write Sˆ ( t / x ) =
∏
{ j ; y( ) <t}
j
j
where νˆ j are solutions of the likelihood equations:
∑
l∈D3 j
( )=
exp x%l βˆ
( )
exp x%l βˆ
1 −ν j
∑
( )
l∈R y( j )
( )
exp x%l βˆ
j = 1,..., z
and z is the number of different lifetimes, D3 j are lifetimes really observed in the sample.
Remarks:
For detecting violation of the proportional hazard assumption, some methods are
recommended:
- Log cumulative hazard rate:
We stratify on categorical variables. For each variable, we plot on the same graph the
cumulative hazard rate curves against t on a log scale and compare them. If the curves are
parallel over time, it supports the proportional hazard assumption. If they cross, this is a
blatant violation.
- Scaled Schoenfeld residuals:
The Schoenfeld residual is the difference between the covariate at the “event time” and the
expected value of the covariate at this time. As an alternative to proportional hazards,
Therneau and Gambsch consider time varying coefficients β ( t ) = β + θ g (t ) for some smooth
function g. Given g (t ) , they develop a score test for ( H 0 ) θ = 0 based on a generalized least
square estimation for θ .
Under ( H 0 ) , we expect to see a constant function over time. If not, the “hazard ratio” is not
constant with respect to time t.
5
When the proportional hazard assumption is violated we can study Cox model with time
dependant covariates and Aalen’s non-parametric additive hazards model.
3.
The Cox model with time dependant covariates
The Cox model can be extended to allow time dependant covariates. It is often the case that
the values of some explanatory variables in a survival analysis change over the time (for
example the “Bonus-Malus” variable…). It seems natural to use the covariate information that
varies over time in an appropriate statistical model.
In this case, the Cox model with time dependant covariates specifies that:
h ( t / x ) = h0 ( t ) exp ( x% ( t ) β )
where x% ( t ) is a time dependant vector of covariate values.
We can distinguish between “internal” and “external” time dependant covariates:
-
-
For an “internal variable”, the reason for a change depends on “internal”
characteristics or behavior specific to the individual. The hazard function bears no
relationship to the survival function for internal covariates.
In contrast, a variable is called an “external variable” if its values change primarily
because of “external” characteristics of the environment that may affect several
individuals simultaneously. For example, an external covariate is one that is not
directly related to cancellation of car’s insurance contract.
The “partial likelihood function” of β for this model is given by
⎡
⎤
⎥
exp x%i y( i ) β
n ⎢
⎥
L y(1) ,..., y( n ) ; β = ∏ ⎢
⎥
i =1 ⎢
%
β
exp
x
y
k
(k )
⎢ k∈∑
⎥
⎣ R( y(i ) )
⎦
(
)
( ( ) )
( ( ) )
δi
The formula for partial likelihood looks almost identical to the one derived for time
independent covariates. The only difference is that at time y( i ) , the values of time-dependant
covariates at time y( i ) were used, both for the contract cancelled at that time, as well as the
contracts that are at risk sets at that time.
The estimates are obtained by maximizing the partial likelihood function. The major difficulty
with time dependant covariates in Cox model is computing, because the risk sets used to form
L are more complicated with time dependant covariates (we need to know the exact value of
covariates at cancellation time for all contracts “at risk”).
4.
Aalen’s additive regression model
The proportional hazards model assumes multiplicative effects of covariates on the hazard
function while the additive risk model assumes that the hazard function associated with a set
of covariates is the sum of a baseline hazard function and a regression function of covariates.
6
The conditional hazard rate at time t, given x(t ) , can be modelled by the following linear
model:
h ( t / x ( t ) ) = β 0 ( t ) + β% ( t ) x ( t )
where β 0 ( t ) is a baseline hazard function, x ( t ) is a vector of covariate values and
β ( t ) = ( β k ( t ) )1≤ k ≤ p is a vector of unknown regression parameters.
Direct estimation of β ( t ) is difficult. It is much easier to estimate the cumulative regression
t
functions Bk ( t ) = ∫ β k ( s )ds
where
0 ≤ k ≤ p . The estimators of coefficients Bk ( t ) are
0
based on least-squares technique.
A crude estimate of β k ( t ) is given by the slope of the estimate Bk ( t ) . Better estimates of
β k ( t ) can be obtained by using smoothing technique.
5.
Application
The dataset we are considering stems from a French insurance company and contains
information about the lifespan of car’s insurance contracts.
Having eliminated some values (for example, in some contracts the variable “first date of
circulation of the vehicle” can’t make use), the dataset consists of 1461 car’s insurance
contracts.
All types of cancellation are observed, contract’s cancellation by the customer or by the
insurance company. Consequently, cancellations are not homogeneous and a small deviation
about lifespan of contracts is possible.
The contracts were created during the period of June 13th, 1974 to December 28th, 1995. The
cancellation of a contract could only be observed after January 1st, 1996. For our analysis, the
event of interest is the contract’s lifetime. If the cancelling contract is before February 7th,
2006 we have considered the duration between cancellation and conclusion of contract
otherwise the duration between February 7th, 2006 and conclusion of contract (thus we have a
right censoring).
For a contract several different covariates are known: the age of vehicle, Bonus-Malus
variable, type of insurance….
In this work, we present methods to estimate the lifespan of car’s insurance contracts
(parametric, non parametric and semi parametric with time dependant covariates methods).
Results
Our main goal was to estimate survival function of car’s insurance contracts.
- If we have no prior information on survival function, we have estimated this function with
non-parametric Kaplan-Meier method.
7
- To introduce exogenous variables in model, we considered parametric methods (regression
linear models). The log-logistic model provided the best model for lifespan of car’s insurance
contracts.
- A semi-parametric model, the Cox model was considered. This model yields easily
interpreted estimated of covariates effects, but the assumption of proportional hazards is
necessary to make these estimates valid.
First, the proportional hazards assumption was investigated by examining graphical
diagnostics. We stratified exogenous variables (Bonus-Malus, age of vehicle, type of
insurance) and plotted on the same graph, one by variable, the cumulative hazard rate curves
against t on a log scale. Bonus-Malus curves, age of vehicle curves and type of insurance
curves are crossing and we deduct violation of proportional hazards assumption.
Secondly, the scaled Schoenfeld residuals and test for time varying coefficients were
investigated to assess proportional hazards assumption. For each covariate, we test time
independent Cox model coefficients. The results from the test indicate the proportional
hazards assumption is not satisfied.
A conclusion is that Cox regression model is not an adequately model to describe these data.
Some variables were changing over time (Bonus-Malus variable for example). The
investigation of Cox model with time dependant covariates is not possible; in the dataset the
exact value of Bonus-Malus covariate time for all contracts “at risk” is unknown.
.
- Finally, we discussed the Aalen’s additive regression model. For the jth contract, the
conditional hazard rate at time t, given x j ( t ) , can be modelled by:
h ( t / x j ( t ) ) = β 0 ( t ) + ∑ β k ( t ) x jk ( t )
3
k =1
t
The column vector B(t ) , with elements Bk ( t ) = ∫ β k ( s )ds
1 ≤ k ≤ 3 (cumulative regression
0
functions) will be estimated.
With our dataset, all coefficients are statistically significant. Then we discuss cumulative
regression functions plots for this dataset.
For example, we note that the more the Bonus-Malus variable increases, the more the risk of
cancellation increases over the entire time.
We note also that the cumulative regression coefficient plot for “Type of insurance” variable
suggests that there is an increase in the hazard rate with increasing time that remains in effect
over the first 6 years.
8
Conclusion
This work on lifespan of car’s insurance contracts was an illustration of well-known methods
of survival analysis applied to a non life insurance portfolio. The insurance company can use
these estimations of survival function with covariates to develop, for example, the
profitability of insurance contracts auto.
References
COX D.R. and OAKES D. (1984), Analysis of survival data, London, Edition Chapman and
Hall.
DROESBEKE J.J, FICHET B, TASSI P., éditeurs (1989), Analyse statistique des durées de
vie: Modélisation et données censurées, Economica.
KALBFLEISH J.D. and PRENTICE R.L. (1980), The statistical analysis of failure time data,
New York: Wiley and Sons, Inc.
KAPLAN E.L. and MEIER P. (1958), Non parametric estimation from incomplete
observations, J. Amer. Statist. Assoc. 53, pp 457-481.
LI, S.,(1996). Survival analysis, Marketing Research, 7(4), 17-23.
PLANCHET F. and THEROND P. Modèles de durée. Applications actuarielles. Economica
(2006)
THERNEAU T.M and GAMBSCH P.M. Modeling Survival Data . Springer (2001)
9
Download