Data Mining and Actuarial Science

advertisement
Data Mining Applications
in P&C Insurance
CASE Spring Meeting
April 12, 2005
Lijia Guo, PhD, ASA, MAAA
University of Central Florida
1
Agenda
 Introductions
to data mining modeling
 Understanding the data mining process
 Data mining (DM) techniques
 Applications in P&C Insurance
 Case Study
April 12, 2005
Guo
2
Introduction – What is Data Mining?

Process of exploration and analysis of large
quantities of data in order to discover meaningful
patterns and rules.
 Uses a variety of data analysis tools to discover
relationships that may be used to make valid
predictions.

It is not a magic wand:



Must know your business
Understand your data
Understand the analytical methods
April 12, 2005
Guo
3
Introduction - DM Modeling
 An
information discovery process.
 Knowing your goals
 Understanding your data
 Choosing the right methods
 Understanding the limitations
 Validation and testing
 Make crucial business decisions
April 12, 2005
Guo
4
Introduction – DM Process
Understand the
Economics
Define the Goal
Identify Data
Sources
Prepare Data
Transform Data
Apply DM Models
Validate DM Models
April 12, 2005
Guo
IMPLEMENT
5
Introduction – DM Goals
 Identifying
responsive potential customers
 Identifying existing customers that more
likely to terminate
 Identifying high risk purchaser
 Identifying the factors that cause large
claims
 Identifying interactions among risk factors
April 12, 2005
Guo
6
Introduction – DM Process
April 12, 2005
Guo
7
DM Techniques






Decision Trees
Logistic regression
Neural Networks
Fuzzy Logics
Genetic Algorithms




Clustering
Associated discovery
Sequence Discovery
Bayesian analysis
Visualization
Hybrid algorithms
April 12, 2005
Guo
8
DM Techniques -- Decision Trees
 What



are decision trees
Classify observations based on the values
of nominal, binary, or ordinal targets
Predict outcomes for interval targets
Predict the appropriate decision when you
specify decision alternatives
April 12, 2005
Guo
9
DM Techniques -- Decision Trees
Example
Classification Of Surrender Risk
Yes
Income >$50,000
Yes Or No
Job >5 Years
Yes or No
If yes low risk
else high risk
April 12, 2005
No
High Debt
Yes or No
If yes low risk
Else high risk
Guo
10
DM Techniques -- Decision Trees
 Strengths




and weaknesses
Insights into the decision-making
process
Efficient and is thus suitable for large
data sets
Relatively unstable
Difficult to detect linear or quadratic
relationships
April 12, 2005
Guo
11
DM Techniques
-- Logistic regression
 What
is Logistic regression
 How Logistic regression works


Odds ratios
Each dependent variable affects logit linearly
 pi
logit  log 
 1  pi
April 12, 2005
k

   0    j x ji , where i  1, 2,  , n.
j 1

Guo
12
DM Techniques - Logistic Regression
 Strengths
and weaknesses
 Maximum Likelihood Curve Fitting
 Multiple Logistic Regression Model
 Interaction-effect modifier
 Multinomial Logistic Regression Model
April 12, 2005
Guo
13
DM Techniques
-- Neural Networks
 What
are Neural Networks
x1
w11
w21
x2
H1
w21
w1
y
w22
w2
x3
w31
w32
April 12, 2005
H2
Guo
 Input
layer - a
unit for each
input variable
 Output layer the target
 Hidden layer hidden unit
(neurons)
14
DM Techniques – Neural Networks
g01  E ( y )   w0  w1H1  w2 H 2
H1  g1 ( w01  w11 x1  w21 x2  w31 x3 )
H 2  g 2 (w02  w12 x1  w22 x2  w32 x3 )
g 0 ()
: output activation function.
gi ()

: activation functions-nonlinear
transformations.
w11 , w21 , , w32 , w1 , w2

: weights
w , w , w
: Bias
0
April 12, 2005
01
02
Guo
15
DM Techniques –Neural Networks

How Neural Networks work




Processing elements
Training
Predicting
Activation Functions
• logistic function
1
l ( ) 
1  e 
• hyperbolic tangent
April 12, 2005
x
e e
tanh( x)  x  x
e e
x
Guo
16
DM Techniques -- Neural Networks
 Strengths
and weaknesses
• Accurately prediction for complex problems
• Black box predict engine
• Overtraining
• Training speed
April 12, 2005
Guo
17
DM Techniques -- Hybrid Algorithms
 Problems
with standard algorithms
 Advanced algorithms
 Discovery-driven approaches
 Mixture of algorithms
April 12, 2005
Guo
18
DM Applications in P&C Insurance
 Data
Warehouse
 Underwriting
 Pricing/Rate Making
 Claim Scoring
 Risk Management
 Policy Level Analysis
 Variable Selection
April 12, 2005
Guo
19
Data Warehousing Example
Transactions SurveysDemographics
Unique Patient List
Transactions
Pharmacy
Claims
Rx
April 12, 2005
Demographics
Service Level Table
Derived Variables/
Flags
Surveys
...
Group by Patient
Hospital
Claims
Surveys
Secondary Selection:
WHAT DATA?
Med Claims
Physician
Claims
Primary Selection:
WHO?
Tertiary Selection:
WHAT DOES THE TRANSACTION
DATA TELL US?
Summary Level Table
Service Level
Summary Level
Variables
Variables
Guo
Summary:
WHAT DO WE
KNOW ABOUT
THIS PATIENT?
20
DM in Insurance Underwriting
 Improving
profit margin.
 Gaining competitive edge
 Risk evaluation process.


Lots of variables
Lots of interactions
 Easy
to follow procedure.
 Decision tree can be used
April 12, 2005
Guo
21
DM in Insurance Underwriting
- Auto Driver’s Claim Information
Variable
Variable Type
Measurement Level
Description
Age
Continuous
Interval
Driver’s age in years
Car age
Continuous
Interval
Age of the car
Car type
Categorical
Nominal
Type of the car
Gender
Categorical
Binary
F=female, M=male
Coverage level
Categorical
Nominal
Policy coverage
Education
Categorical
Nominal
Education level of the drive
Location
Categorical
Nominal
Location of residence
Climate
Categorical
Nominal
Climate code for residence
Credit rating
Continuous
Interval
Credit score of the driver
ID
Input
Nominal
Driver’s identification number
No. of claims
Categorical
Nominal
Number of claims
April 12, 2005
Guo
22
DM in Insurance Underwriting
- Decision Tree Diagram
April 12, 2005
Guo
23
DM in Pricing/Rate Making
 Data: Auto Driver’s Claim Information
 Decision
trees analysis to identify risk
factors that predict profits, claims and
losses
 Logistic regression applied to model


Claim frequency
Effect of each risk factor
April 12, 2005
Guo
24
DM in Pricing/Rate Making
Effect T-scores from the logistic regression
April 12, 2005
Guo
25
DM in Pricing/Rate Making
- Assessment

Assessment



Cross-model comparisons of the expected to actual
profits/losses
Independent of all other factors (sample size,..)
Lift charts


% claim-occurrence value to a random baseline
model
Performance quality demonstrated by the degree the
lift chart curve pushes upward and to the left
April 12, 2005
Guo
26
DM in Pricing/Rate Making
- Lift Chart for Logistic
Regression
logistic Regression
- Captured 30% of
the drivers in the
10th percentile
- Better predictive
power from about
the 20th to the
80th percentiles
April 12, 2005
Guo
27
DM in Risk Management
 Reinsurance

To structure more effectively by segmentation
 Hedging
 Target
April 12, 2005
retention and building loyalty
Guo
28
DM in Policy Level Analysis
 Retention
analysis
 Profitability analysis
 Policyholder’s behavior
 DM methods used



Neural networks
Decision trees
Logistic regression
April 12, 2005
Guo
29
Applications – Variable Selection

Problem
-- Given {Y,X} where


X  {x1 , x2 ,...xN }
Find F, such that F ( X )  Y
Find Z  X , and F*, such that F *( X )  Y

Improving model accuracy and efficiency
 Making crucial business decisions
April 12, 2005
Guo
30
Case Study - Group Insurance

Identify ways to build upon the current
manual rating structure utilizing exiting rating
variables to develop a practical tool to guild
underwriting in rates adjustments
 Identify any new rating variables with
significant predictive power



Currently gathered, but not utilized data
Transformations of existing variables
introduce new rating variables (e.g. external financial
data)
April 12, 2005
Guo
31
Case Study – Group Insurance
 Profit
margin over x year period
 128 input variables
 Principle Components Analysis applied
 42 variables remains
 How to improve business profit?
April 12, 2005
Guo
32
Case Study - Goals
 Developing
a practical
underwriting tool


Detecting deviations
Identifying key drivers
 Improving

model predictive power
Risk selection
April 12, 2005
Guo
33
Function Approximation
F ( X )  F0  1T1 ( X )  2T2 ( X )  ...   M TM ( X )
 F0




is the initial guess
Stegewise approximation
Each stage added by reducing errors
Each stage is weak linear – a small tree.
Sequential adjustment
April 12, 2005
Guo
34
Regression Tree Example
Profit=6.5%
+0.8% , if AS >
421
-0.5% ,
otherwise
April 12, 2005
Guo
+1.2% , if male
young than
30
-1.1% ,
otherwise
35
Function Approximation

GIVEN
 Y: Output and X: Inputs or Predictors
 L(Y, F): Loss Function

ESTIMATE
F *( X )  arg min F ( X ) EY , X [ L(Y , F ( X ))]
April 12, 2005
Guo
36
Classical Function Approximation
F  F ( X , ),   { j }

Solve   { j }
from
min L(Y , F ( X , B))
April 12, 2005
Guo
37
Nonparametric Function
Approximation
{F0 ( X i )}

Initial guess

Compute

Take a step in the steepest descent direction
April 12, 2005
N
 L 
g  

 F ( X i ) i 1
Guo
38
Gradient Boosting
1 N
L({F ( X i )})   (Yi  F ( X i )) 2
N i 1

Initial guess

FOR m = 1 TO M
{F0 ( X i )}
gm  L( Fm1 ( X i ))




Fit an L-node regression tree to the current residuals
For each given node, calculate node average residual
Update:
END
April 12, 2005
hm ( X i )
{Fm ( X i )}  {Fm1 ( X i )}  hm ( X i )
Guo
39
Case Study
Tw o Predictor Dependence For
PROFIT_MARGIN
April 12, 2005
Guo
40
Case Study
Tw o Predictor Dependence For
PROFIT_MARGIN
April 12, 2005
Guo
41
Case Study
- Single Stats and Variable Importance
Input
Variable 1
Variable 2
Variable 3
Variable 4
Variable 5
Variable 6
Variable 7
Variable 8
Variable 9
April 12, 2005
Additive
0.2679
0.2779
0.1456
0.2263
0.1059
0.2741
0.1289
0.0797
0.1129
Multiplicative Importance
0.2690
100.00
0.3203
75.23
0.1771
54.65
0.2469
47.41
0.1425
42.81
0.2847
34.81
0.1306
34.27
0.0864
25.35
0.1148
23.37
Guo
42
Case Study
- Pair Stats and Variable Importance
Variables
Variable 1 & Variable 2
Variable 2 & Variable 3
Variable 2 & Variable 4
Variable 2 & Variable 7
Variable 3 & Variable 4
Variable 3 & Variable 6
Variable 4 & Variable 7
Variable 5 & Variable 6
Variable 6 & Variable 7
April 12, 2005
Additive
Multiplicative
0.3714
0.3704
0.3686
0.3401
0.2795
0.2895
0.2417
0.2622
0.2904
0.3847
0.4066
0.4010
0.3856
0.3137
0.3082
0.2592
0.2766
0.3066
Guo
43
Predictive Modeling
 Predicts
deviations from expected
profitability (used 9 variables)
 Practical guide for underwriters to use for
rates adjustments
 New variables Identified to have strong
predictive power
 Improve business profit (20% Profit margin)
April 12, 2005
Guo
44
Importance of Multiple
Techniques
 Robust
model with high predictive
accuracy
 Practical constrains
 Algorithm complexity
 Ease of understanding of results
April 12, 2005
Guo
45
Is Data Mining for you?
 Defining
the goals
 Understanding your data
 Using multiple techniques
 Improving your decision making process
 Gaining competitive edges!
Thank you!
April 12, 2005
Guo
46
Download