Survival analysis methods in Insurance Applications in car insurance contracts Abder OULIDI1 Jean-Marie MARION2 Hervé GANACHAUD3 Abstract In this work, we are interested in survival models and their applications on actuarial problems. We particularly study the Cox model and Aalen model which allow covariate effects to vary with time (time dependant covariates); this allows to obtain more precise results on the lifespan of cars insurance. We are interested in the relationship between lifespan of contracts and some predictive covariates. For example, the “Bonus-Malus” is a covariate influencing contracts and we noticed that the more the “Bonus-Malus” increases, the more the risk of cancellation increases. We also studied time dependant covariates (“internal covariates” are generated by individuals under study, “external covariates” more easily observed since they are independent of the study subject to “Bonus-Malus” because the insurant keeps this one by changing insurance company). We compare the lifespan of car insurance contracts estimated by survival models (nonparametric, parametric and semi-parametric models with fixed and time-dependant covariates). Keywords: Cox model, Aalen model, survival distributions, censored data, Kaplan-Meier, lifespan of car insurance contracts. Institut de Mathématiques Appliquées – 44 Rue Rabelais – BP 808 – 49100 ANGERS CEDEX 01 oulidi@ima.uco.fr 1 2 Institut de Mathématiques Appliquées – 44 Rue Rabelais – BP 808 – 49100 ANGERS CEDEX 01 jean-marie-marion@uco.fr 3 Groupe MMA – DCTGP - 10 Boulevard Alexandre Oyon – 72030 LE MANS Cedex 9 ganachaud@groupe-mma.fr 1 1. Introduction The car insurance is a mature market with a weak growth rate. Furthermore, in this muchsough sector, new actors (banks-insurers, large distribution…) come to join the traditional actors. Confronted with a strong competition aggravated by the quasi-stability of the insurable motor vehicle population, and the advent of the future European prudential framework, insurers are led to develop optimal models of surveillance and management of their portfolio, among others, to develop loyalty of the most profitable customers and possibly to cancel some contracts of insurance. In this work, we use survival models and their applications on actuarial problems. We are interested in the relationship between lifespan of contracts and some predictive covariates. We particularly study models with time dependant covariates; this allows obtaining more precise results on the lifespan of cars insurance. We apply these methods of survival analysis to an actuarial dataset. The origin of survival analysis can be traced to early work on mortality tables, which was followed and expanded by statistical research for engineering applications. But there are also other fields of applications: medicine, biology and economy. In our paper, we will use models of survival analysis in an actuarial context. In section 2 we will consider a brief overview of traditional survival models (non-parametric, parametric and semi-parametric models) with censoring. Section 3 is dedicated to the Cox model with time dependant covariates. In section 4 we discuss the Aalen model which allows covariate effects to vary with time. In section 5, we consider a dataset from a French insurance company which contains information about cars insurance contracts. We investigate the time from the conclusion until the cancellation of a car’s contract. There are several attributes given about the insurance holder. We will compare survival models on this dataset. Aalen model with time dependant covariates will allow to obtain more precise results on the lifespan of car’s insurance. 2. Survival models In prospective studies, the important feature is not only the outcome event, but the time to event, the survival time. For example the survival time T from the conclusion (starting point) until the cancellation (ending event) of a contract. The distribution of T from starting point to the event of interest, viewed as a positive random variable, is characterized by the probability density function f or the cumulative distribution function F . 2 The survival function is defined by S (t ) = P (T > t ) and the hazard function, denoted h , f (t ) 1 defined by h(t ) = = lim P (T ∈ ]t , t + ∆t ] / T > t ) S (t ) ∆t →0 ∆t The hazard function specifies the instantaneous rate of contract’s cancellation at time t , given that the contract survives up till t . t The cumulative hazard function is defined by H (t ) = ∫ h( x)dx . We find that S (t ) = exp {− H (t )} since S (0) = 1 0 The functions f , F , S and h give mathematically equivalent signification of the distribution of T . A special source of difficulty in the analysis of survival data is the possibility that some individuals may not be observed for the full time to event. This problem is called censoring and the associated variable is denoted by C . The so-called censoring arises, for example, when observation is terminated before the occurrence of the event. If the cancellation of a contract is not observed, we define (Y , D ) with Y = min(T , C ) and D is the censoring indicator. Non parametric models In this section we discuss the analysis of survival data without parametric assumptions about the distribution of T. Our topic is non-parametric estimation of the survival function. Assume we have a right-censored sample of survival data (T1 ,..., Tn ) , really we observe Yi = min (Ti , Ci ) with Di = 1{Ti ≤Ci } and Ci (1 ≤ i ≤ n ) denote the right censored observations. Let Y(1) ,..., Y( n ) the orderly sample with D1' ,..., Dn' the ordered indicators. Consider R (t ) the number of individuals at risk just prior to t (these are cases whose duration time is at least t ) and M (Y(i ) ) the number of cancellations at Y( i ) . Sˆ ( t ) = ( ) ⎞⎟ ( ) ⎟⎠ ⎛ M Y (i ) ∏ ⎜⎜1 − R Y {i ;Y(i) <t} ⎝ (i ) is called the product-limit estimator or Kaplan-Meier’s estimator for S (t ) - the most common method of estimating the survival function. Parametrics models In parametric models, the survival time T belongs to a class of specified distributions. These functions are described using a finite number of parameters, the purpose of which will be to estimate them from a data set. 3 Let t1 ,..., tn a sample resulting from a known distribution f ( x, θ ) , where θ is a vectorial –or not - parameter. Really, we observe y1 ,..., yn , a possibly right or left censored set of observations. Parametric models, or regression procedures, are techniques for assessing the relationship between survival times and a set of explanatory variables (or covariates). For example, the “Bonus-Malus”, the age of vehicle …are influencing the lifespan of car’s insurance contract A characteristic of survival data is that the response cannot be negative. This suggests that a transformation of the survival time such as a log transformation may be necessary or that specialized methods may be more appropriate than those that assume a normal distribution for the error term. The parametric model is of the form ln yi = x%i β + ηε i i = 1,..., n where x%i a transpose vector of covariates corresponding to the individual is i , β is a vector of unknown regression parameters, η is an unknown scale parameter, and ε i is an error term. The baseline distribution of the error term can be specified as one of several possible distributions, including, but not limited to, the exponential, log normal, log logistic, and Weibull distributions. In parametric models, we estimate parameters β ,η and those of the ε i distribution. Finally, we obtain the distribution of the survival time T . Semi-parametric models Semi-parametric models assume a parametric form for the effects of explanatory variables on survival times and allow an unspecified form for an underlying survivor function. Among these models, the most known one is the Cox regression model. Thus, the hazard function of the survival time is given by: h ( t / x ) = h0 ( t ) exp ( x% β ) where h0 is an unspecified baseline hazard function, x% is a vector of covariate values (transposed) and β is a vector of unknown regression parameters. The effect of the covariates on survival is to act multiplicatively on some unknown baseline hazard rate. The Cox regression is a proportional hazards model. That is, with time-fixed covariates, the ratio of their hazard function for any two individuals i and j obeys the relationship: h(t / x1 ) = exp( x%1β1 − x%2 β 2 ) h(t / x2 ) thus the “hazard ratio” is constant with respect to time t. exp( x% β ) Let S0 the baseline survival function associated with h0 , we have S ( t / x ) = [ S0 (t ) ] . In order to estimate β , we observe ( y(1) ,..., y( n ) ) an orderly sample and we use the “partial 4 δi ⎡ ⎤ n ⎢ exp ( x%i β ) ⎥ likelihood function” L y(1) ,..., y( n ) ; β = ∏ ⎢ ⎥ %k β ) ⎥ i =1 ⎢ ∑ exp ( x ⎢⎣ k∈R( y(i ) ) ⎥⎦ where the risk set R ( y(i ) ) includes those contracts at risk for the event at time Y( i ) when the ( ) event was observed to occur for contract i (or at which time contract i was right censored) – that is, contracts for whom the cancellation has not yet occurred or who have yet to be right censored. (Notice that censoring times are excluded from likelihood because for these observations the exponent δ i = 0 ). exp ( x% β ) to estimate S . Finally we use S ( t / x ) = [ S0 (t ) ] % exp x β νˆ ( ) One may write Sˆ ( t / x ) = ∏ { j ; y( ) <t} j j where νˆ j are solutions of the likelihood equations: ∑ l∈D3 j ( )= exp x%l βˆ ( ) exp x%l βˆ 1 −ν j ∑ ( ) l∈R y( j ) ( ) exp x%l βˆ j = 1,..., z and z is the number of different lifetimes, D3 j are lifetimes really observed in the sample. Remarks: For detecting violation of the proportional hazard assumption, some methods are recommended: - Log cumulative hazard rate: We stratify on categorical variables. For each variable, we plot on the same graph the cumulative hazard rate curves against t on a log scale and compare them. If the curves are parallel over time, it supports the proportional hazard assumption. If they cross, this is a blatant violation. - Scaled Schoenfeld residuals: The Schoenfeld residual is the difference between the covariate at the “event time” and the expected value of the covariate at this time. As an alternative to proportional hazards, Therneau and Gambsch consider time varying coefficients β ( t ) = β + θ g (t ) for some smooth function g. Given g (t ) , they develop a score test for ( H 0 ) θ = 0 based on a generalized least square estimation for θ . Under ( H 0 ) , we expect to see a constant function over time. If not, the “hazard ratio” is not constant with respect to time t. 5 When the proportional hazard assumption is violated we can study Cox model with time dependant covariates and Aalen’s non-parametric additive hazards model. 3. The Cox model with time dependant covariates The Cox model can be extended to allow time dependant covariates. It is often the case that the values of some explanatory variables in a survival analysis change over the time (for example the “Bonus-Malus” variable…). It seems natural to use the covariate information that varies over time in an appropriate statistical model. In this case, the Cox model with time dependant covariates specifies that: h ( t / x ) = h0 ( t ) exp ( x% ( t ) β ) where x% ( t ) is a time dependant vector of covariate values. We can distinguish between “internal” and “external” time dependant covariates: - - For an “internal variable”, the reason for a change depends on “internal” characteristics or behavior specific to the individual. The hazard function bears no relationship to the survival function for internal covariates. In contrast, a variable is called an “external variable” if its values change primarily because of “external” characteristics of the environment that may affect several individuals simultaneously. For example, an external covariate is one that is not directly related to cancellation of car’s insurance contract. The “partial likelihood function” of β for this model is given by ⎡ ⎤ ⎥ exp x%i y( i ) β n ⎢ ⎥ L y(1) ,..., y( n ) ; β = ∏ ⎢ ⎥ i =1 ⎢ % β exp x y k (k ) ⎢ k∈∑ ⎥ ⎣ R( y(i ) ) ⎦ ( ) ( ( ) ) ( ( ) ) δi The formula for partial likelihood looks almost identical to the one derived for time independent covariates. The only difference is that at time y( i ) , the values of time-dependant covariates at time y( i ) were used, both for the contract cancelled at that time, as well as the contracts that are at risk sets at that time. The estimates are obtained by maximizing the partial likelihood function. The major difficulty with time dependant covariates in Cox model is computing, because the risk sets used to form L are more complicated with time dependant covariates (we need to know the exact value of covariates at cancellation time for all contracts “at risk”). 4. Aalen’s additive regression model The proportional hazards model assumes multiplicative effects of covariates on the hazard function while the additive risk model assumes that the hazard function associated with a set of covariates is the sum of a baseline hazard function and a regression function of covariates. 6 The conditional hazard rate at time t, given x(t ) , can be modelled by the following linear model: h ( t / x ( t ) ) = β 0 ( t ) + β% ( t ) x ( t ) where β 0 ( t ) is a baseline hazard function, x ( t ) is a vector of covariate values and β ( t ) = ( β k ( t ) )1≤ k ≤ p is a vector of unknown regression parameters. Direct estimation of β ( t ) is difficult. It is much easier to estimate the cumulative regression t functions Bk ( t ) = ∫ β k ( s )ds where 0 ≤ k ≤ p . The estimators of coefficients Bk ( t ) are 0 based on least-squares technique. A crude estimate of β k ( t ) is given by the slope of the estimate Bk ( t ) . Better estimates of β k ( t ) can be obtained by using smoothing technique. 5. Application The dataset we are considering stems from a French insurance company and contains information about the lifespan of car’s insurance contracts. Having eliminated some values (for example, in some contracts the variable “first date of circulation of the vehicle” can’t make use), the dataset consists of 1461 car’s insurance contracts. All types of cancellation are observed, contract’s cancellation by the customer or by the insurance company. Consequently, cancellations are not homogeneous and a small deviation about lifespan of contracts is possible. The contracts were created during the period of June 13th, 1974 to December 28th, 1995. The cancellation of a contract could only be observed after January 1st, 1996. For our analysis, the event of interest is the contract’s lifetime. If the cancelling contract is before February 7th, 2006 we have considered the duration between cancellation and conclusion of contract otherwise the duration between February 7th, 2006 and conclusion of contract (thus we have a right censoring). For a contract several different covariates are known: the age of vehicle, Bonus-Malus variable, type of insurance…. In this work, we present methods to estimate the lifespan of car’s insurance contracts (parametric, non parametric and semi parametric with time dependant covariates methods). Results Our main goal was to estimate survival function of car’s insurance contracts. - If we have no prior information on survival function, we have estimated this function with non-parametric Kaplan-Meier method. 7 - To introduce exogenous variables in model, we considered parametric methods (regression linear models). The log-logistic model provided the best model for lifespan of car’s insurance contracts. - A semi-parametric model, the Cox model was considered. This model yields easily interpreted estimated of covariates effects, but the assumption of proportional hazards is necessary to make these estimates valid. First, the proportional hazards assumption was investigated by examining graphical diagnostics. We stratified exogenous variables (Bonus-Malus, age of vehicle, type of insurance) and plotted on the same graph, one by variable, the cumulative hazard rate curves against t on a log scale. Bonus-Malus curves, age of vehicle curves and type of insurance curves are crossing and we deduct violation of proportional hazards assumption. Secondly, the scaled Schoenfeld residuals and test for time varying coefficients were investigated to assess proportional hazards assumption. For each covariate, we test time independent Cox model coefficients. The results from the test indicate the proportional hazards assumption is not satisfied. A conclusion is that Cox regression model is not an adequately model to describe these data. Some variables were changing over time (Bonus-Malus variable for example). The investigation of Cox model with time dependant covariates is not possible; in the dataset the exact value of Bonus-Malus covariate time for all contracts “at risk” is unknown. . - Finally, we discussed the Aalen’s additive regression model. For the jth contract, the conditional hazard rate at time t, given x j ( t ) , can be modelled by: h ( t / x j ( t ) ) = β 0 ( t ) + ∑ β k ( t ) x jk ( t ) 3 k =1 t The column vector B(t ) , with elements Bk ( t ) = ∫ β k ( s )ds 1 ≤ k ≤ 3 (cumulative regression 0 functions) will be estimated. With our dataset, all coefficients are statistically significant. Then we discuss cumulative regression functions plots for this dataset. For example, we note that the more the Bonus-Malus variable increases, the more the risk of cancellation increases over the entire time. We note also that the cumulative regression coefficient plot for “Type of insurance” variable suggests that there is an increase in the hazard rate with increasing time that remains in effect over the first 6 years. 8 Conclusion This work on lifespan of car’s insurance contracts was an illustration of well-known methods of survival analysis applied to a non life insurance portfolio. The insurance company can use these estimations of survival function with covariates to develop, for example, the profitability of insurance contracts auto. References COX D.R. and OAKES D. (1984), Analysis of survival data, London, Edition Chapman and Hall. DROESBEKE J.J, FICHET B, TASSI P., éditeurs (1989), Analyse statistique des durées de vie: Modélisation et données censurées, Economica. KALBFLEISH J.D. and PRENTICE R.L. (1980), The statistical analysis of failure time data, New York: Wiley and Sons, Inc. KAPLAN E.L. and MEIER P. (1958), Non parametric estimation from incomplete observations, J. Amer. Statist. Assoc. 53, pp 457-481. LI, S.,(1996). Survival analysis, Marketing Research, 7(4), 17-23. PLANCHET F. and THEROND P. Modèles de durée. Applications actuarielles. Economica (2006) THERNEAU T.M and GAMBSCH P.M. Modeling Survival Data . Springer (2001) 9