Wei Bai and Guangjin Xiao

advertisement
CAS COTOR Solution For
Round 5
May 15th 2008
By
Bai, Wei
wbaab@allstate.com
847-402-9584
Xiao, Guangjin
gxiaa@allstate.com
847-402-8618
Contents
Part One: Explanation of the Solution
(Page 3)
Part Two: Details of the Solution
(Page 4)
A. Fitting the Severity Distribution
(Page 4)
B. Fitting the Model for the Number of Claims
(Page 9)
C. Predicted Results and Conclusions
(Page 12)
D. Appendix
(Page 13)
2
Part One
Explanations of the Solutions to CAS Cotor Challenge Round 5
The solution consists of two parts. The first part is to fit a heavy-tailed
distribution to the loss data taking censoring and truncation into consideration for
different Risk A, B and C for year 2004, 2005 and 2006. Deductible for risk A is 1000
dollars, deductible for risk B is 500 dollars and deductible for Risk C is 0 dollars. Policy
limit for all risks are 50,000 dollars. Deductible is a type of truncated data and policy
limit is a type of censoring data. Maximum likelihood estimation was used to estimate the
parameters of the heavy-tailed loss distributions based on the data adjusting for
deductible and policy limit. The inflation rate is estimated simultaneously as another
parameter in the estimation process by MLE. Before fitting the loss distributions, paid
amount of loss was retrieved to ground-up loss by adding the deductible back. Ten
different heavy-tailed distributions were used to fit the data. The best candidate was
chosen based on statistical criteria such as AIC, BIC and AICC and probability plot. The
expected loss of the coming year (2007) was then calculated using the chosen loss
distribution and estimated parameters including inflation rate when the policy limit
increased to 75,000 dollars and deductibles were removed for all risks. The expected loss
is then used to calculate the expected number of claims in the second part of the solution.
The second part is to fit a general linear regression model to determine the
relationship existing between number of claims and average loss for a given MSA. This
model was used to calculate the expected number of claims in the coming year (2007)
and their confidence and prediction intervals. Box-Cox Transformation was used to
determine the way how we transform the target variable-number of claims. The fourth
root of claims was used as the target in the general linear model. MSA and average loss
for a given MSA are significant statistically to predict number of claims. Significant
interaction was also found for average loss and MSAs, which is saying, for different
MSAs, the marginal change in number of claims for each additional average loss change
is different .In the data provided, for MSA 85, there is no data for year 2004. In total, we
have only 299 data combinations for different year and MSAs used in fitting the linear
regression model. We grouped MSA85 with other MSA in the analysis so that it is
estimable for this certain MSA from statistical perspective and overall confidence
intervals for all MSAs can be calculated.
3
Part Two
Solutions to CAS Challenge Cotor Round 5
Part A
Fitting Loss Distribution with Censoring and Truncation for All Three Risks


The data for risk A, B and C are first retrieved from payment data to loss data by
adding the deductibles for risk A, B and C. For example, for risk A, each data point
will be added 1000 dollars to get the loss data. The policy limit is therefore 51,000
dollars for risk A.
Censoring and truncation for each individual risk are taken into consideration while
writing likelihood function for the estimation for different loss distributions. All data
for the three individual risks are used together to estimate the parameters. The
following shows how the likelihood function is affected when censoring and
truncation exist.
o Contribution to the log-likelihood function from the observations with
truncation Only:
 1000 for Risk A

d   500 for Risk B

0 for Risk C

f ( xi )
ln
 ln f ( xi )  ln S (d )
S (d )
o Contribution to the log-likelihood function from the observations with
both censoring and truncation:
S (u )
ln
 ln S (u )  ln S (d )
S (d )
 1000 for Risk A
 51000 for Risk A


d   500 for Risk B u   50500 for Risk B

 50000 for Risk C
0 for Risk C



The table below shows the parameter estimates for different loss distributions in SAS.
Distribution
Pareto
Inverse Pareto
Inverse Gamma
Inverse Weibull
Loglogistic
Paralogistic
Inverse Paralogistic
Burr
Inverse Burr
Weibull
Shape Parameter
(Alpha)
2.8058
1.7760
0.6947
0.7621
1.3334
1.3052
1.2297
2.5155
0.5191
0.1958
Estimate
Scale Parameter Inflation
Third Parameter
(Theta)
(1+Inflation Rate)
(Gamma)
25296
1.0628
3274.84
1.0538
2529.62
0.9191
4937.04
0.8066
7100.22
1.0594
9116.03
1.0588
5607.23
1.0568
21756
1.0628
1.0285
13127
1.0666
1.6729
324.01
0.3233
4
Fitting Criteria
AIC
AICC
BIC
96,095 96,095 96,115
96,410 96,410 96,429
97,553 97,553 97,573
97,029 97,029 97,049
96,181 96,181 96,200
95,515 95,515 95,534
96,230 96,230 96,249
96,096 96,096 96,122
96,110 96,110 96,136
100,384 100,384 100,403


Among ten chosen heavy-tailed loss distributions, two-parameter Pareto and
Paralogistic gives the two lowest BIC and AICC values (Pareto: AICC=90695
and BIC=96115; Paralogistic: AICC=95515 and BIC=95534). Burr and Inverse
Burr are also having lower BIC and AICC values but they are higher than Pareto
and Paralogistic. Therefore Pareto and Paralogistic are chosen as our loss
distribution candidates at this stage. For inflation rate, Pareto gives an estimate of
%6.3 and Paralogistic gives an estimate of %5.88. The base year for the inflation
rate is 2004.
Q-Q plots were constructed in SAS for both Pareto and Paralogistic distributions
while taking truncation and censoring into consideration. Graphs are attached
below to show how well each distribution fits the loss data. There are a total of six
Q-Q plot graphs. Since each risk has different truncation and censoring. Q-Q plots
were constructed for each risk. When the Q-Q plot is a perfect 45 degree straight
line, it means a very good fit. In the following graphs, red lines stand for the
distribution we are using to fit the data and black lines stand for the true
distributions.
Graph 1
5
Graph 2
Graph 3
6

The Q-Q plots above are showing that Pareto is fitting the data very well for all
three risks. Only risk C is a little bit worse than A and B but it still provides a very
good fit to the data.
Graph 4
Graph 5
7
Graph 6



The Paralogistic Q-Q plots above are showing that Paralogistic is fitting the data
well for all three risks but the fitting is worse than Pareto since all the red lines are
further to black lines than Pareto distributions for all risks.
Combining the Q-Q plots results and BIC and AICC criteria, Pareto distribution is
the final selection for the loss distribution for all individual risks A, B and C.
The following table summarizes the results for the loss distributions for risk
A, B and C.
Fitted Loss Distribution:
Year
2004
2005
2006
2007

Distribution
Pareto
Pareto
Pareto
Pareto
α = 2.0858
α = 2.0858
α = 2.0858
α = 2.0858
Parameters
θ = 25926
θ = 25926*1.063 = 27559
θ = 25926*1.063 2 = 29296
θ = 25926*1.063 3 = 31141
The 2007 fitted loss distribution is going to be used to calculate the expected loss
for year 2007 and therefore to predict the expected number of claims in 2007.
This is going to be shown in Part B
8
Part B
Fitting Linear Regression Model to Predict Expected Number of Claims for the
Coming Year (2007)



In order to predict the expected number of claims for the coming year, a
relationship between the number of claims and the average ground-up loss is
established. A linear regression model is built for this purpose. The data for the
number of claims is summarized by year and MSA and is then used to run a linear
regression model. There is a total of 299 data points instead of 300 after
summarization for 3 years of data due to the fact that the loss data for MSA 85 for
year 2004 is not provided. MSA 85 is grouped with MSA 12 in the analysis
because they perform similar.
According to the question, we are told that “the average ground-up size of loss
within any given MSA is correlated with the underlying propensity for a claim in
that MSA”. So, we assume the trend between the number of claims and average
loss is different for every MSA. The interaction between MSA and Average Loss
are included in our linear model to allow the slope of this linear relationship to
vary by MSA, as well as the intercept (taken care of by the main effect of MSA).
The statistical significance test in our model proves the necessity of including this
interaction in our model. Please refer to the Type III table results for the Chisquare testing.
BOX-COX transformation was used to determine how we can transform the data
into a normal distribution so that linear regression technique can be used
appropriately. The result is attached below.
9

The BOX-COX Transformation produces a lambda of 0.25, which shows that
taking a fourth root of the number of claims can make the distribution of number
of claims approach normality.
Normality test was performed for the fourth root of number of claims. The results
are shown below.

Normality Test Table
Tests for Normality
Statistic
W
D
W-Sq
A-Sq
Test
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
p Value
0.3249
Pr < W
0.0638
Pr > D
>0.2500
Pr > W-Sq
>0.2500
Pr > A-Sq
0.994287
0.050431
0.071095
0.448284
Q-Q Normality Plot
3. 0
2. 5
f
c
l
a
i
m
2. 0
1. 5
1. 0
0. 1
1
5
10
25
50
No r ma l


75
90
95
99
99. 9
Pe r c e n t i l e s
The result shows that the fourth root of the number of claims is normally
distributed at 5% significance level. It passed all the four formal normality tests
(Anderson-Darling, K-S and etc). The Q-Q plot for the fourth root of the number
of claims is fairly a 45 degree straight line. All of these results are showing that
building a linear model for the fourth root of number of claims is appropriate.
The following linear regression model is fitted.
The fourth root (# Claims) =Intercept+Beta1*Average-Ground-UpLoss+Beta2*MSA+Beta3*Average-Ground-Up-Loss*MSA+Error
Where: Error is the random term and assumed to be normally distributed and
an IID, Average-Ground-Up-Loss*MSA is the interaction between the two
10

Linear Regression ANOVA table is attached below
Source
DF
Sum of
Mean
F Value Pr > F
Squares
Square
199
32.8399995
0.16502512 6.33
<.0001
Model
99
2.57920497 0.02605258
Error
35.41920447
Corrected Total 298
R-Square
0.927181


Coeff Var Root MSE
8.415204 0.161408
fclaim Mean
1.918053
For the fitted linear regression model above, the overall R-square is around 0.93,
which is very high and pretty good. The F statistics is 6.33, which is very significant
at 5% significance level.
Linear regression model TYPE III analysis below shows that MSA, Average GroundUp loss and the interaction between the two are significant at 5% significance level.
Type III Analysis
Source
DF
Chi-Square
98
339.81
MSA
1
4.84
Average Ground-Up Loss
98
313.87
Average Ground-Up Loss*MSA

Pr > ChiSq
<.0001
0.0278
<.0001
Model Diagnostic Check and Outliers
o The residual plot for our linear model is analyzed to check if the model fits ok
ehat
0. 4
0. 3
0. 2
0. 1
0. 0
- 0. 1
- 0. 2
- 0. 3
- 0. 4
- 0. 5
1. 0
1. 1
1. 2
1. 3
1. 4
1. 5
1. 6
1. 7
1. 8
1. 9
2. 0
2. 1
2. 2
2. 3
2. 4
2. 5
2. 6
2. 7
2. 8
yhat
o This plot is showing that the residuals scatter randomly and symmetrically
around zero line, which supports our assumptions and shows that the model
fits very well and only noises are left in the residuals.
11
o There seem no outliers in the data based on the residual plot since all the
residuals are falling into three sigma zone.

Based on the results above, we conclude that there is a significant linear
relationship between the fourth root of the number of claims and the other
variables (MSA and Average-Ground-Up Loss). Fitting a linear model between
them is statistically sound. The fitted linear regression equation is to be used to
calculate the expected number of claims of the coming year of 2007
Part C
Calculating the Expected Number of Claims for the Coming Year (2007)

The expected loss for 2007 while the policy limit increased to 75,000 dollars is
calculated as follows
Expected Loss for 2007 with 75,000 Limit is:
E ( X ^75,000) 
 




1  
  1   75000   
 1
2.08581


31141.2  
31141.2



1  

 2.0858  1   75000  31141.2 

= $15173.05


The expected loss for 2007 ($15173.05) is then plugged into our linear model
equation to calculate the predicted number of claims and 95% confidence and
prediction interval for each certain MSA. Notice that in our linear regression model,
we took the fourth root of the number of claims. Therefore, the confidence and
prediction interval for number of claims for each MSA is the fourth power of the
confidence and prediction interval for the fourth root of the number of claims for each
MSA. The overall confidence and prediction interval for 2007 is then the sum of all
the confidence and prediction intervals for all MSAs. This calculation is done in SAS.
Please refer to the confidence interval calculation spread for details
The expected number of claims and its 95% confidence and prediction interval are
summarized in the table below
Lower Limit
1300.26

Predicted Number
Of Claims
1902.86
Upper Limit
2824.83
In summary, the predicted number of claims in 2007 is 1903 and the 95%
confidence and prediction interval is (1301, 2825).
12
Part D
Appendix

Summarized Data Used For Linear Regression Based On The Data Provided
G:\CotorTests\
Cotor_Linear_model_Data.xls

SAS Program For Fitting Different Distributions and Linear Regressions

Spreadsheets for Parameter Estimates for Linear Regression and for Calculating The
Expected Number of Claims and 95% Confidence and Prediction Intervals
G:\CotorTests\
Cotor_Round5_Parameter&Confidence.xls
13
Download