Predictive Modeling CAS Reinsurance Seminar Louise Francis, FCAS, MAAA May 7, 2007

advertisement
Francis Analytics
Actuarial Data Mining Services
Predictive Modeling
CAS Reinsurance Seminar
May 7, 2007
Louise Francis, FCAS, MAAA
Louise.francis@data-mines.com
Francis Analytics and Actuarial Data Mining, Inc.
www.data-mines.com
Francis Analytics
Why Predictive Modeling?
Actuarial Data Mining Services
• Better use of data than
traditional methods
• Advanced methods for
dealing with messy
data now available
Francis Analytics www.data-mines.com
2
Francis Analytics
Data Mining Goes Prime Time
Francis Analytics www.data-mines.com
Actuarial Data Mining Services
3
Francis Analytics
Becoming A Popular Tool In All Industries
Francis Analytics www.data-mines.com
Actuarial Data Mining Services
4
Francis Analytics
Real Life Insurance Application – The “Boris Gang”
Francis Analytics www.data-mines.com
Actuarial Data Mining Services
5
Francis Analytics
Predictive Modeling Family
Actuarial Data Mining Services
Predictive Modeling
Classical Linear Models
Francis Analytics www.data-mines.com
GLMs
Data Mining
6
Francis Analytics
Data Quality: A Data Mining Problem
Actuarial Data Mining Services
• Actuary reviewing a database
Francis Analytics www.data-mines.com
8
Francis Analytics
A Problem: Nonlinear Functions
Actuarial Data Mining Services
An Insurance Nonlinear Function:
Provider Bill vs. Probability of Independent Medical Exam
0.90
0.80
Value Prob IME
0.70
0.60
0.50
0.40
0.30
11368
2540
1805
1450
1195
989
821
683
560
450
363
275
200
100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Provider 2 Bill
Francis Analytics www.data-mines.com
10
Classical Statistics: Regression
Francis Analytics
Actuarial Data Mining Services
• Estimation of parameters: Fit line that minimizes
deviation between actual and fitted values
min( (Yi  Y ) )
2
Workers Comp Severity Trend
$10,000
Severity
$8,000
$6,000
$4,000
$2,000
$1990
1992
1994
1996
1998
2000
2002
2004
Year
Severity
Francis Analytics www.data-mines.com
Fitted Y
11
Generalized Linear Models
Common Links for GLMs
Francis Analytics
Actuarial Data Mining Services
The identity link: h(Y) = Y
The log link: h(Y) = ln(Y)
The inverse link: h(Y) =
1
Y
Y
ln(
)
The logit link: h(Y) =
1Y
The probit link: h(Y) = (Y ),  denotes
Francis Analytics www.data-mines.com
the normal CDF
13
Francis Analytics
Major Kinds of Data Mining
•
Supervised learning
– Most common
situation
– A dependent variable
• Frequency
• Loss ratio
• Fraud/no fraud
– Some methods
• Regression
• CART
• Some neural
networks
•
Actuarial Data Mining Services
Unsupervised learning
– No dependent variable
– Group like records together
• A group of claims with
similar characteristics
might be more likely to
be fraudulent
• Ex: Territory
assignment, Text
Mining
– Some methods
• Association rules
• K-means clustering
• Kohonen neural
networks
Francis Analytics www.data-mines.com
14
Francis Analytics
Desirable Features of a Data Mining Method
Actuarial Data Mining Services
• Any nonlinear relationship can be
approximated
• A method that works when the form of the
nonlinearity is unknown
• The effect of interactions can be easily
determined and incorporated into the model
• The method generalizes well on out-of sample
data
Francis Analytics www.data-mines.com
15
Francis Analytics
The Fraud Surrogates used as Dependent Variables
Actuarial Data Mining Services
• Independent Medical Exam (IME)
requested
• Special Investigation Unit (SIU) referral
– (IME successful)
– (SIU successful)
• Data: Detailed Auto Injury Claim
Database for Massachusetts
• Accident Years (1995-1997)
Francis Analytics www.data-mines.com
16
Francis Analytics
Predictor Variables
Actuarial Data Mining Services
• Claim file variables
– Provider bill, Provider type
– Injury
• Derived from claim file variables
– Attorneys per zip code
– Docs per zip code
• Using external data
– Average household income
– Households per zip
Francis Analytics www.data-mines.com
17
Francis Analytics
Different Kinds of Decision Trees
Actuarial Data Mining Services
• Single Trees (CART, CHAID)
• Ensemble Trees, a more recent
development (TREENET, RANDOM
FOREST)
– A composite or weighted average of
many trees (perhaps 100 or more)
Francis Analytics www.data-mines.com
18
Francis Analytics
Non Tree Methods
Actuarial Data Mining Services
• MARS – Multivariate Adaptive
Regression Splines
• Neural Networks
• Naïve Bayes (Baseline)
• Logistic Regression (Baseline)
Francis Analytics www.data-mines.com
19
Francis Analytics
Classification and Regression Trees (CART)
Actuarial Data Mining Services
• Tree Splits are binary
• If the variable is numeric, split is based on
R2 or sum or mean squared error
– For any variable, choose the two way split
of data that reduces the mse the most
– Do for all independent variables
– Choose the variable that reduces the
squared errors the most
• When dependent is categorical, other
goodness of fit measures (gini index,
deviance) are used
Francis Analytics www.data-mines.com
21
CART – Example of 1st split on Provider 2 Bill,
With Paid as Dependent
Francis Analytics
Actuarial Data Mining Services
1st Split
All Data
Mean = 11,224
Bill < 5,021
Bill>= 5,021
Mean = 10,770
Mean = 59,250
•
For the entire database, total squared deviation of paid losses around the
predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013
after the data are partitioned using $5,021 as the cutpoint.
•
Any other partition of the provider bill produces a larger SSE than 4.66x1013.
For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.
Francis Analytics www.data-mines.com
22
Continue Splitting to get more homogenous groups
at terminal nodes
Francis Analytics
Actuarial Data Mining Services
mp2.bill<544.5
|
mp2.bill<3.5
mp2.bill<4055.5
mp2.bill<1443.5
0.02254
mp2.bill<16659
0.04817
mp2.bill<5151.5
0.07767
0.08832
0.06980
0.11480
Francis Analytics www.data-mines.com
0.13330
23
Ensemble Trees: Fit More Than One Tree
Francis Analytics
Actuarial Data Mining Services
• Fit a series of trees
• Each tree added improves the fit of the
model
• Average or Sum the results of the fits
• There are many methods to fit the trees
and prevent overfitting
• Boosting: Iminer Ensemble and Treenet
• Bagging: Random Forest
Francis Analytics www.data-mines.com
25
Treenet Prediction of IME Requested
Francis Analytics
Actuarial Data Mining Services
0.90
0.80
Value Prob IME
0.70
0.60
0.50
0.40
0.30
11368
2540
1805
1450
1195
989
821
683
560
450
363
275
200
100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Provider 2 Bill
Francis Analytics www.data-mines.com
27
Francis Analytics
Neural Networks
Actuarial Data Mining Services
Three Layer Neural Network
=
Input Layer
(Input Data)
Francis Analytics www.data-mines.com
Hidden Layer
(Process Data)
Output Layer
(Predicted Value)
29
Neural Networks
Francis Analytics
Actuarial Data Mining Services
• Also minimizes squared deviation
between fitted and actual values
• Can be viewed as a non-parametric,
non-linear regression
Francis Analytics www.data-mines.com
31
Hidden Layer of Neural Network
(Input Transfer Function)
Francis Analytics
Actuarial Data Mining Services
Logistic Function for Various Values of w1
1.0
0.8
w1=-10
w1=-5
w1=-1
w1=1
w1=5
w1=10
0.6
0.4
0.2
0.0
X
-1.2
Francis Analytics www.data-mines.com
-0.7
-0.2
0.3
0.8
32
The Activation Function
(Transfer Function)
Francis Analytics
Actuarial Data Mining Services
• The sigmoid logistic function
f (Y ) 
1
1  e Y
Y  w0  w1 X1  w2 X 2 ...  wn X n
Francis Analytics www.data-mines.com
33
Neural Network:
Provider 2 Bill vs. IME Requested
Francis Analytics
0.12
0.10
0.08
0.06
0.04
Fitted Neural Net Prediction
0.14
Actuarial Data Mining Services
0
20000
40000
60000
80000
100000
Privider 2 Bill
Francis Analytics www.data-mines.com
34
Francis Analytics
MARS: Provider 2 Bill vs. IME Requested
Actuarial Data Mining Services
MARS Predicted IME
0.13
0.11
0.09
0.07
0.05
0
1000
Francis Analytics www.data-mines.com
2000
3000
Provider 2 Bill
4000
35
Francis Analytics
How MARS Fits Nonlinear Function
Actuarial Data Mining Services
• MARS fits a piecewise regression
– BF1 = max(0, X – 1,401.00)
– BF2 = max(0, 1,401.00 - X )
– BF3 = max(0, X - 70.00)
– Y = 0.336 + .145626E-03 * BF1 - .199072E-03
* BF2 - .145947E-03 * BF3; BF1 is basis
function
– BF1, BF2, BF3 are basis functions
• MARS uses statistical optimization to find best basis
function(s)
• Basis function similar to dummy variable in
regression. Like a combination of a dummy indicator
and a linear independent variable
Francis Analytics www.data-mines.com
36
Baseline Method: Naive Bayes Classifier
Francis Analytics
Actuarial Data Mining Services
• Naive Bayes assumes feature (predictor variables)
independence conditional on each category
• Probability that an observation X will have a specific
set of values for the independent variables is the
product of the conditional probabilities of observing
each of the values given target category cj, j=1 to m
(m typically 2)
P( X1 , X 2 ... X n | c j )   P( X i  xi | c j )
i
where X1 , X 2 ... X n are specific values for the independent variables
Francis Analytics www.data-mines.com
39
Francis Analytics
Naïve Bayes Formula
P(C j | X1, X 2 ...X N ) 
Actuarial Data Mining Services
p(C  c j , X1, X 2 ... X N )
P( X1 , X 2 ... X n)
(Bayes Rule)
p(C  c j ) P( X i | c j )
P(C j | X1, X 2 ...X N ) 
i
P( X1 , X 2 ... X n)
A constant
Francis Analytics www.data-mines.com
40
Francis Analytics
Advantages/Disadvantages
Actuarial Data Mining Services
• Computationally efficient
• Under many circumstances has
performed well
• Assumption of conditional
independence often does not hold
• Can’t be used for numeric variables
Francis Analytics www.data-mines.com
44
Francis Analytics
Naïve Bayes Predicted IME vs. Provider 2 Bill
Actuarial Data Mining Services
0.140000
Mean Probability IME
0.120000
0.100000
0.080000
0.060000
13767
9288
7126
5944
5200
4705
4335
4060
3805
3588
3391
3196
3042
2895
2760
2637
2512
2380
2260
2149
2050
1945
1838
1745
1649
1554
1465
1371
1285
1199
1110
1025
939
853
769
685
601
517
433
349
265
181
97
0
Provider 2 Bill
Francis Analytics www.data-mines.com
45
True/False Positives and True/False Negatives
(Type I and Type II Errors) The “Confusion” Matrix
Francis Analytics
Actuarial Data Mining Services
• Choose a “cut point” in the model
score.
• Claims > cut point, classify “yes”.
Sample Confusion Matrix: Sensitivity and Specificity
True Class
Prediction
No
Yes
Column Total
Sensitivity
Specificity
Francis Analytics www.data-mines.com
No
800
200
1,000
Correct
800
400
Yes
200
400
600
Row Total
1,000
600
Total
Percent Correct
1,000
80%
600
67%
46
Francis Analytics
ROC Curves and Area Under the ROC Curve
Actuarial Data Mining Services
• Want good performance both on sensitivity
and specificity
• Sensitivity and specificity depend on cut
points chosen
– Choose a series of different cut points, and
compute sensitivity and specificity for each
of them
– Graph results
• Plot sensitivity vs 1-specifity
• Compute an overall measure of “lift”, or
area under the curve
Francis Analytics www.data-mines.com
47
TREENET ROC Curve – IME
Explain AUROC AUROC = 0.701
Francis Analytics www.data-mines.com
Francis Analytics
Actuarial Data Mining Services
48
Ranking of Methods/Software – IME Requested
Francis Analytics
Actuarial Data Mining Services
Method/Software
AUROC Lower Bound Upper Bound
Random Forest
0.7030
0.6954
0.7107
Treenet
0.7010
0.6935
0.7085
MARS
0.6974
0.6897
0.7051
SPLUS Neural
0.6961
0.6885
0.7038
S-PLUS Tree
0.6881
0.6802
0.6961
Logistic
0.6771
0.6695
0.6848
Naïve Bayes
0.6763
0.6685
0.6841
SPSS Exhaustive CHAID
0.6730
0.6660
0.6820
CART Tree
0.6694
0.6613
0.6775
Iminer Neural
0.6681
0.6604
0.6759
Iminer Ensemble
0.6491
0.6408
0.6573
Iminer Tree
0.6286
0.6199
0.6372
Francis Analytics www.data-mines.com
50
Francis Analytics
Some Software Packages That Can be Used
Actuarial Data Mining Services
Excel
 Access
 Free Software

R




 Web based software
S-Plus (similar to commercial version of R)
SPSS
CART/MARS
Data Mining suites – (SAS Enterprise Miner/SPSS
Clementine)
Francis Analytics www.data-mines.com
51
Francis Analytics
References
Actuarial Data Mining Services
• Derrig, R., Francis, L., “Distinguishing the Forest from the
Trees: A Comparison of Tree Based Data Mining Methods”,
CAS Winter Forum, March 2006, WWW.casact.org
• Derrig, R., Francis, L., “A Comparison of Methods for
Predicting Fraud ”,Risk Theory Seminar, April 2006
• Francis, L., “Taming Text: An Introduction to Text Mining”,
CAS Winter Forum, March 2006, WWW.casact.org
• Francis, L.A., Neural Networks Demystified, Casualty
Actuarial Society Forum, Winter, pp. 254-319, 2001.
• Francis, L.A., Martian Chronicles: Is MARS better than
Neural Networks? Casualty Actuarial Society Forum,
Winter, pp. 253-320, 2003b.
• Dahr, V, Seven Methods for Transforming Corporate into
Business Intelligence, Prentice Hall, 1997
• The web site WWW.data-mines.com has some tutorials
and presentations
52
Francis Analytics www.data-mines.com
Francis Analytics
Actuarial Data Mining Services
Predictive Modeling
CAS Reinsurance Seminar
May, 2006
Louise.francis@data-mines.com
www.data-mines.com
Related documents
Download