CAS COTOR Solution For Round 5 May 15th 2008 By Bai, Wei wbaab@allstate.com 847-402-9584 Xiao, Guangjin gxiaa@allstate.com 847-402-8618 Contents Part One: Explanation of the Solution (Page 3) Part Two: Details of the Solution (Page 4) A. Fitting the Severity Distribution (Page 4) B. Fitting the Model for the Number of Claims (Page 9) C. Predicted Results and Conclusions (Page 12) D. Appendix (Page 13) 2 Part One Explanations of the Solutions to CAS Cotor Challenge Round 5 The solution consists of two parts. The first part is to fit a heavy-tailed distribution to the loss data taking censoring and truncation into consideration for different Risk A, B and C for year 2004, 2005 and 2006. Deductible for risk A is 1000 dollars, deductible for risk B is 500 dollars and deductible for Risk C is 0 dollars. Policy limit for all risks are 50,000 dollars. Deductible is a type of truncated data and policy limit is a type of censoring data. Maximum likelihood estimation was used to estimate the parameters of the heavy-tailed loss distributions based on the data adjusting for deductible and policy limit. The inflation rate is estimated simultaneously as another parameter in the estimation process by MLE. Before fitting the loss distributions, paid amount of loss was retrieved to ground-up loss by adding the deductible back. Ten different heavy-tailed distributions were used to fit the data. The best candidate was chosen based on statistical criteria such as AIC, BIC and AICC and probability plot. The expected loss of the coming year (2007) was then calculated using the chosen loss distribution and estimated parameters including inflation rate when the policy limit increased to 75,000 dollars and deductibles were removed for all risks. The expected loss is then used to calculate the expected number of claims in the second part of the solution. The second part is to fit a general linear regression model to determine the relationship existing between number of claims and average loss for a given MSA. This model was used to calculate the expected number of claims in the coming year (2007) and their confidence and prediction intervals. Box-Cox Transformation was used to determine the way how we transform the target variable-number of claims. The fourth root of claims was used as the target in the general linear model. MSA and average loss for a given MSA are significant statistically to predict number of claims. Significant interaction was also found for average loss and MSAs, which is saying, for different MSAs, the marginal change in number of claims for each additional average loss change is different .In the data provided, for MSA 85, there is no data for year 2004. In total, we have only 299 data combinations for different year and MSAs used in fitting the linear regression model. We grouped MSA85 with other MSA in the analysis so that it is estimable for this certain MSA from statistical perspective and overall confidence intervals for all MSAs can be calculated. 3 Part Two Solutions to CAS Challenge Cotor Round 5 Part A Fitting Loss Distribution with Censoring and Truncation for All Three Risks The data for risk A, B and C are first retrieved from payment data to loss data by adding the deductibles for risk A, B and C. For example, for risk A, each data point will be added 1000 dollars to get the loss data. The policy limit is therefore 51,000 dollars for risk A. Censoring and truncation for each individual risk are taken into consideration while writing likelihood function for the estimation for different loss distributions. All data for the three individual risks are used together to estimate the parameters. The following shows how the likelihood function is affected when censoring and truncation exist. o Contribution to the log-likelihood function from the observations with truncation Only: 1000 for Risk A d 500 for Risk B 0 for Risk C f ( xi ) ln ln f ( xi ) ln S (d ) S (d ) o Contribution to the log-likelihood function from the observations with both censoring and truncation: S (u ) ln ln S (u ) ln S (d ) S (d ) 1000 for Risk A 51000 for Risk A d 500 for Risk B u 50500 for Risk B 50000 for Risk C 0 for Risk C The table below shows the parameter estimates for different loss distributions in SAS. Distribution Pareto Inverse Pareto Inverse Gamma Inverse Weibull Loglogistic Paralogistic Inverse Paralogistic Burr Inverse Burr Weibull Shape Parameter (Alpha) 2.8058 1.7760 0.6947 0.7621 1.3334 1.3052 1.2297 2.5155 0.5191 0.1958 Estimate Scale Parameter Inflation Third Parameter (Theta) (1+Inflation Rate) (Gamma) 25296 1.0628 3274.84 1.0538 2529.62 0.9191 4937.04 0.8066 7100.22 1.0594 9116.03 1.0588 5607.23 1.0568 21756 1.0628 1.0285 13127 1.0666 1.6729 324.01 0.3233 4 Fitting Criteria AIC AICC BIC 96,095 96,095 96,115 96,410 96,410 96,429 97,553 97,553 97,573 97,029 97,029 97,049 96,181 96,181 96,200 95,515 95,515 95,534 96,230 96,230 96,249 96,096 96,096 96,122 96,110 96,110 96,136 100,384 100,384 100,403 Among ten chosen heavy-tailed loss distributions, two-parameter Pareto and Paralogistic gives the two lowest BIC and AICC values (Pareto: AICC=90695 and BIC=96115; Paralogistic: AICC=95515 and BIC=95534). Burr and Inverse Burr are also having lower BIC and AICC values but they are higher than Pareto and Paralogistic. Therefore Pareto and Paralogistic are chosen as our loss distribution candidates at this stage. For inflation rate, Pareto gives an estimate of %6.3 and Paralogistic gives an estimate of %5.88. The base year for the inflation rate is 2004. Q-Q plots were constructed in SAS for both Pareto and Paralogistic distributions while taking truncation and censoring into consideration. Graphs are attached below to show how well each distribution fits the loss data. There are a total of six Q-Q plot graphs. Since each risk has different truncation and censoring. Q-Q plots were constructed for each risk. When the Q-Q plot is a perfect 45 degree straight line, it means a very good fit. In the following graphs, red lines stand for the distribution we are using to fit the data and black lines stand for the true distributions. Graph 1 5 Graph 2 Graph 3 6 The Q-Q plots above are showing that Pareto is fitting the data very well for all three risks. Only risk C is a little bit worse than A and B but it still provides a very good fit to the data. Graph 4 Graph 5 7 Graph 6 The Paralogistic Q-Q plots above are showing that Paralogistic is fitting the data well for all three risks but the fitting is worse than Pareto since all the red lines are further to black lines than Pareto distributions for all risks. Combining the Q-Q plots results and BIC and AICC criteria, Pareto distribution is the final selection for the loss distribution for all individual risks A, B and C. The following table summarizes the results for the loss distributions for risk A, B and C. Fitted Loss Distribution: Year 2004 2005 2006 2007 Distribution Pareto Pareto Pareto Pareto α = 2.0858 α = 2.0858 α = 2.0858 α = 2.0858 Parameters θ = 25926 θ = 25926*1.063 = 27559 θ = 25926*1.063 2 = 29296 θ = 25926*1.063 3 = 31141 The 2007 fitted loss distribution is going to be used to calculate the expected loss for year 2007 and therefore to predict the expected number of claims in 2007. This is going to be shown in Part B 8 Part B Fitting Linear Regression Model to Predict Expected Number of Claims for the Coming Year (2007) In order to predict the expected number of claims for the coming year, a relationship between the number of claims and the average ground-up loss is established. A linear regression model is built for this purpose. The data for the number of claims is summarized by year and MSA and is then used to run a linear regression model. There is a total of 299 data points instead of 300 after summarization for 3 years of data due to the fact that the loss data for MSA 85 for year 2004 is not provided. MSA 85 is grouped with MSA 12 in the analysis because they perform similar. According to the question, we are told that “the average ground-up size of loss within any given MSA is correlated with the underlying propensity for a claim in that MSA”. So, we assume the trend between the number of claims and average loss is different for every MSA. The interaction between MSA and Average Loss are included in our linear model to allow the slope of this linear relationship to vary by MSA, as well as the intercept (taken care of by the main effect of MSA). The statistical significance test in our model proves the necessity of including this interaction in our model. Please refer to the Type III table results for the Chisquare testing. BOX-COX transformation was used to determine how we can transform the data into a normal distribution so that linear regression technique can be used appropriately. The result is attached below. 9 The BOX-COX Transformation produces a lambda of 0.25, which shows that taking a fourth root of the number of claims can make the distribution of number of claims approach normality. Normality test was performed for the fourth root of number of claims. The results are shown below. Normality Test Table Tests for Normality Statistic W D W-Sq A-Sq Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling p Value 0.3249 Pr < W 0.0638 Pr > D >0.2500 Pr > W-Sq >0.2500 Pr > A-Sq 0.994287 0.050431 0.071095 0.448284 Q-Q Normality Plot 3. 0 2. 5 f c l a i m 2. 0 1. 5 1. 0 0. 1 1 5 10 25 50 No r ma l 75 90 95 99 99. 9 Pe r c e n t i l e s The result shows that the fourth root of the number of claims is normally distributed at 5% significance level. It passed all the four formal normality tests (Anderson-Darling, K-S and etc). The Q-Q plot for the fourth root of the number of claims is fairly a 45 degree straight line. All of these results are showing that building a linear model for the fourth root of number of claims is appropriate. The following linear regression model is fitted. The fourth root (# Claims) =Intercept+Beta1*Average-Ground-UpLoss+Beta2*MSA+Beta3*Average-Ground-Up-Loss*MSA+Error Where: Error is the random term and assumed to be normally distributed and an IID, Average-Ground-Up-Loss*MSA is the interaction between the two 10 Linear Regression ANOVA table is attached below Source DF Sum of Mean F Value Pr > F Squares Square 199 32.8399995 0.16502512 6.33 <.0001 Model 99 2.57920497 0.02605258 Error 35.41920447 Corrected Total 298 R-Square 0.927181 Coeff Var Root MSE 8.415204 0.161408 fclaim Mean 1.918053 For the fitted linear regression model above, the overall R-square is around 0.93, which is very high and pretty good. The F statistics is 6.33, which is very significant at 5% significance level. Linear regression model TYPE III analysis below shows that MSA, Average GroundUp loss and the interaction between the two are significant at 5% significance level. Type III Analysis Source DF Chi-Square 98 339.81 MSA 1 4.84 Average Ground-Up Loss 98 313.87 Average Ground-Up Loss*MSA Pr > ChiSq <.0001 0.0278 <.0001 Model Diagnostic Check and Outliers o The residual plot for our linear model is analyzed to check if the model fits ok ehat 0. 4 0. 3 0. 2 0. 1 0. 0 - 0. 1 - 0. 2 - 0. 3 - 0. 4 - 0. 5 1. 0 1. 1 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 1. 8 1. 9 2. 0 2. 1 2. 2 2. 3 2. 4 2. 5 2. 6 2. 7 2. 8 yhat o This plot is showing that the residuals scatter randomly and symmetrically around zero line, which supports our assumptions and shows that the model fits very well and only noises are left in the residuals. 11 o There seem no outliers in the data based on the residual plot since all the residuals are falling into three sigma zone. Based on the results above, we conclude that there is a significant linear relationship between the fourth root of the number of claims and the other variables (MSA and Average-Ground-Up Loss). Fitting a linear model between them is statistically sound. The fitted linear regression equation is to be used to calculate the expected number of claims of the coming year of 2007 Part C Calculating the Expected Number of Claims for the Coming Year (2007) The expected loss for 2007 while the policy limit increased to 75,000 dollars is calculated as follows Expected Loss for 2007 with 75,000 Limit is: E ( X ^75,000) 1 1 75000 1 2.08581 31141.2 31141.2 1 2.0858 1 75000 31141.2 = $15173.05 The expected loss for 2007 ($15173.05) is then plugged into our linear model equation to calculate the predicted number of claims and 95% confidence and prediction interval for each certain MSA. Notice that in our linear regression model, we took the fourth root of the number of claims. Therefore, the confidence and prediction interval for number of claims for each MSA is the fourth power of the confidence and prediction interval for the fourth root of the number of claims for each MSA. The overall confidence and prediction interval for 2007 is then the sum of all the confidence and prediction intervals for all MSAs. This calculation is done in SAS. Please refer to the confidence interval calculation spread for details The expected number of claims and its 95% confidence and prediction interval are summarized in the table below Lower Limit 1300.26 Predicted Number Of Claims 1902.86 Upper Limit 2824.83 In summary, the predicted number of claims in 2007 is 1903 and the 95% confidence and prediction interval is (1301, 2825). 12 Part D Appendix Summarized Data Used For Linear Regression Based On The Data Provided G:\CotorTests\ Cotor_Linear_model_Data.xls SAS Program For Fitting Different Distributions and Linear Regressions Spreadsheets for Parameter Estimates for Linear Regression and for Calculating The Expected Number of Claims and 95% Confidence and Prediction Intervals G:\CotorTests\ Cotor_Round5_Parameter&Confidence.xls 13