KL workshop Part 1

advertisement
Health Insurance
Conference 2012
Predictive Modelling
“GLMs and beyond GLMs”
Singapore – May 2012
Xavier Conort
AGENDA
1. GLMs: The Good, the Bad and the Ugly
2. Trees or how to detect interactions
3. GLM(ixed)M or how to handle variables with
a large number of categories
4. Regularized GLMs or how to handle texts or
data with a large number of predictors
5. The PRIDIT method or how to handle with
no or little information on the response
Gear Analytics
GLMs is a standard but be aware of its
limitations
Gear Analytics
GOOD
BAD
UGLY
Recognized as a standard in
the insurance industry
Need to pre-process data
(missing values, outliers,
dimension reduction)
GLM models do not
automatically account for
interactions if you don’t
specify them in the model
structure.
Accommodate responses
with skewed distributions
Lots of literature and
readily available software
solutions
Simple mathematical
formula easy to implement
Diagnostics tools and
confidence intervals
Parametric models (good
when you know well your
data)
The assumptions
underlying GLMs may not
hold:
Independence of
observations
Appropriateness of the
link function
Appropriateness of the
error function
Risk to rely too much on
diagnostic tools. Need to
test on unseen cases.
Iterative modelling process
time-consuming and
complex
Think about latitude and
longitude.
Is it correct to assume North-East
effect= North X East effects?
GLMs will provide you with
estimates even if the
standard errors are
unreasonably high
GLMs (and other
supervised learning
techniques) work only if
you have reliable
information on the
response. This info is not
always available. Think
Gear Analytics
fraud detection !
How smart actuaries detect potential
interactions
•
•
•
•
•
•
luck
intuition
descriptive analysis
experience
market practices
Machine Learning techniques based
on decision trees
Gear Analytics
Regression trees are known to detect
interactions
…but usually have lower predictive
power than GLMs and are unstable.
By construction, regression trees partition the feature space into a set of
rectangles and then produce a multitude of local interactions
Gear Analytics
Random Forest will provide you with
higher predictive power…
… but less interpretability
…
A Random Forest is:
• a collection of weak and independent decision trees such that
each tree has been trained on a bootstrapped dataset with a
random selection of predictors (think about the wisdom of
crowds)
Gear Analytics
Boosted Regression Trees or learn
step by step slowly
•
BRTs (also called Gradient Boosting Machine) use boosting
and decision trees techniques:
• The boosting algorithm gradually increases emphasis on poorly
modelled observations. It minimizes a loss function (the
deviance, as in GLMs) by adding, at each step, a new simple
tree whose focus is only on the residuals
• The contributions of each tree are shrunk by setting a learning
rate very small (and < 1) to give more stable fitted values for
the final model
• To further improve predictive performance, the process uses
random subsets of data to fit each new tree (bagging).
Gear Analytics
Why do I love BRTs?
• BRTs can be fitted to a variety of response types (Gaussian,
Poisson, Binomial)
• BRTs best fit (interactions included) is automatically detected by
the machine
• BRTs learn non-linear functions without the need to specify them
• BRT outputs have some GLM flavour and provide insight on the
relationship between the response and the predictors
• BRTs avoid doing much data cleaning because of their
• ability to accommodate missing values
• immunity to monotone transformations of predictors, extreme
outliers and irrelevant predictors
Gear Analytics
BRTs’ Partial dependence plots
Non-linear
relationship
detected
automatically
represent the effect of each
predictor after accounting
for the effects of the other
predictors
Gear Analytics
Plot of interactions fitted by BRTs
Gear Analytics
BRTs’ prediction formula
Let’s consider 1 numerical predictor Xnum and 1 categorical
predictor Xcat (with two levels)
• GLMs’ prediction formula will be
•
•
Yhat=g-1(β0+βnum*Xnum+βcat*I(Xcat==1))
• with g the link function
BRTs’ prediction formula is more complex and less easily
implementable
•
Yhat=g-1(β0+βnum1*I(Xnum<α1)+βnum2*I(Xnum<α2)+…
+βcat*I(Xcat==1)
+βint1*I(Xnum<γ1 & Xcat==0)+βint2*I(γ2<Xnum<γ3 & Xcat==1)+…)
Gear Analytics
How smart actuaries handle with
factors with a large nb of categories
• In GLMs, predictors with many levels (e.g. territory, car
models) and little statistical material aren’t credibility
adjusted.
• GLMs diagnostics will only alert you. Relativities of levels
with little exposure are squarely in the middle of wide
confidence intervals driven by large standard errors.
• In practice, ad hoc credibility adjustments are applied by
actuaries before deploying the model
• Generalized Linear Mixed Models (GLMMs) can
accomplish this credibility adjustment by modelling both
fixed and random effects and provide credibility
estimates automatically.
Gear Analytics
How smart actuaries handle with data
with a large nb of variables
• GLMs are sensitive to multicollinearity and provide you
with estimates for every single predictors which lead to
over-fitting and unrobust results
• By fitting Regularized GLMs, you can automatically select
most relevant predictors and accommodate
multicollinearity by introducing a penalty in the loss
function (the deviance) to minimize. Here, for a gaussian
error:
Gear Analytics
The penalty effect in a
regularized GLM
Gear Analytics
How to make use of texts
• Usually, punctuations and numbers are removed and
words are stemmed. But varies with the application.
• Rare and very frequent words are discarded
• A document-term matrix is produced:
– Incidence or frequency matrix
• The matrix is sometimes scaled to put more emphasis on
rare but predictive words
• Regularized GLMs are applied to the matrix (with
sometimes 5000 columns!).
• Alternative: Support Vector Machine
Gear Analytics
How smart insurance companies
handle with fraud
• GLMs are sometimes presented as a potential technique
in fraud detection but in practice, they fail because:
– history of fraud cases are insufficient and incomplete
– do not detect previously undetected fraud cases
• In practice, companies use a series of red flags (based on
categorical and numerical variables) but fail to have a
single indicator
• Numerous actuarial articles in the past years presented a
unsupervised technique (no label to train) called PRIDIT
as an efficient actuarial way to make use of those
operational red flags
Gear Analytics
PRIDIT technique basic ideas
• Transform all numerical and ordinal red flags in a same
scale (values between -1 and 1) using RIDIT statistics
(based on cumulative distribution)
Cumulative
Level distribution Ridit score
1
0.2
-0.8
2
0.4
-0.4
3
0.6
0
4
0.8
0.4
5
1
0.8
• Apply Principal Component Analysis (PCA) to the RIDIT
scores to produce a single indicator
Gear Analytics
But what is PCA’s basic idea?
Maximize the variance of the projected data on a few axis
Gear Analytics
Example (1/4)
• Suppose we want to combine all the information
related to fraud
We compute their
ridit scores to put
them all in the
same scale
(including numeric
variables)
From Fraud Classification Using PCA of Ridits – The Journal of Risk
and Insurance, 2002, Vol. 69, No3, 341-371
Gear Analytics
Example (2/4)
• We apply PCA to replace many variables with a
score
– We look for the factor that explains the most variance
(captures most of the correlation) for the set of
variables
– That factor extracted will be a weighted average of the
variables (a score)
• That score can be used to sort claims
– More effort can be spent on claims more likely to be
fraudulent or abusive
Gear Analytics
Example (3/4)
One can decide to investigate claim 3 first, claim 7 next, and pay the rest of the
10 claims (or if ressources allow, investigate in increasing PRIDIT score order
untill resources are exhausted).
Gear Analytics
Example (4/4)
Factors loading are also sometimes used to explain
the importance of variables
Component Matrixa
Component
1
S IU
.2 48
Poli ce Report
.2 20
At Fault
.7 09
Leg al Rep
.7 52
Medical Audit
.3 41
Prior Cl ai m
.4 06
Extracti on Method: Princi pal Component Analysi s.
a. 1 co mpo nent s ext racted.
Gear Analytics
Does it work?
• Based on the US actuaries papers: Yes!
• There appears to be a strong relationship
between PRIDIT score and suspicion that claim is
fraudulent or abusive
• The Society of Actuary
even funded a study which
extends the use of the
PRIDIT technique to the
measurement of Hospital
quality
Gear Analytics
Download