decision tree

advertisement
Data Mining – Best Practices
Part #2
Richard Derrig, PhD,
Opal Consulting LLC
CAS Spring Meeting
June 16-18, 2008
Data Mining
Data Mining, also known as KnowledgeDiscovery in Databases (KDD), is the
process of automatically searching large
volumes of data for patterns. In order to
achieve this, data mining uses
computational techniques from statistics,
machine learning and pattern recognition.
 www.wikipedia.org
AGENDA
Predictive v Explanatory Models
Discussion of Methods
Example: Explanatory Models for
Decision to Investigate Claims
The “Importance” of Explanatory
and Predictive Variables
An Eight Step Program for Building
a Successful Model
Predictive v Explanatory Models
Both are of the form: Target or Dependent
Variable is a Function of Feature or
Independent Variables that are related to
the Target Variable
Explanatory Models assume all Variables
are Contemporaneous and Known
Predictive Models assume all Variables
are Contemporaneous and Estimable
Desirable Properties of a Data Mining
Method:
Any nonlinear relationship between target
and features can be approximated
A method that works when the form of the
nonlinearity is unknown
The effect of interactions can be easily
determined and incorporated into the model
The method generalizes well on out-of
sample data
Major Kinds of Data Mining Methods
 Supervised learning
 Most common situation
 Target variable




Frequency
Loss ratio
Fraud/no fraud
 Unsupervised learning
 No Target variable
 Group like records
together-Clustering

Some methods



Regression
Decision Trees
Some neural networks


A group of claims with
similar characteristics
might be more likely to be
of similar risk of loss
Ex: Territory assignment,
Some methods



PRIDIT
K-means clustering
Kohonen neural networks
The Supervised Methods and Software
Evaluated
1)
2)
3)
4)
5)
6)
TREENET
Iminer Tree
SPLUS Tree
CART
S-PLUS Neural
Iminer Neural
7) Iminer Ensemble
8) MARS
9) Random Forest
10) Exhaustive Chaid
11) Naïve Bayes (Baseline)
12) Logistic reg ( (Baseline)
Decision Trees
In decision theory (for example risk
management), a decision tree is a graph of
decisions and their possible consequences,
(including resource costs and risks) used to
create a plan to reach a goal. Decision trees are
constructed in order to help with making
decisions. A decision tree is a special form of
tree structure.
 www.wikipedia.org
CART – Example of 1st split on Provider 2
Bill, With Paid as Dependent
1st Split
All Data
Mean = 11,224
Bill < 5,021
Bill>= 5,021
Mean = 10,770
Mean = 59,250
 For the entire database, total squared deviation of paid losses around the
predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013
after the data are partitioned using $5,021 as the cutpoint.
 Any other partition of the provider bill produces a larger SSE than 4.66x1013.
For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.
Different Kinds of Decision Trees
Single Trees (CART, CHAID)
Ensemble Trees, a more recent development
(TREENET, RANDOM FOREST)
A composite or weighted average of many trees
(perhaps 100 or more)
 There are many methods to fit the trees and prevent
overfitting

 Boosting:
Iminer Ensemble and Treenet
 Bagging: Random Forest
Neural Networks
Three Layer Neural Network
=
Input Layer
(Input Data)
Hidden Layer
(Process Data)
Output Layer
(Predicted Value)
NEURAL NETWORKS
 Self-Organizing Feature Maps
 T.
Kohonen 1982-1990 (Cybernetics)
 Reference vectors of features map to
OUTPUT format in topologically faithful
way. Example: Map onto 40x40 2dimensional square.
 Iterative Process Adjusts All Reference
Vectors in a “Neighborhood” of the
Nearest One. Neighborhood Size
Shrinks over Iterations
FEATURE MAP
SUSPICION LEVELS
S16
S13
4-5
S10
3-4
S7
16
13
10
7
4
1
S4
S1
2-3
1-2
0-1
FEATURE MAP
SIMILIARITY OF A CLAIM
S16
S13
4-5
S10
3-4
S7
17
13
9
5
1
S4
S1
2-3
1-2
0-1
DATA MODELING EXAMPLE: CLUSTERING
 Data on 16,000
Medicaid providers
analyzed by
unsupervised neural net
 Neural network
clustered Medicaid
providers based on
100+ features
 Investigators validated a
small set of known
fraudulent providers
 Visualization tool
displays clustering,
showing known fraud
and abuse
 Subset of 100 providers
with similar patterns
investigated: Hit rate >
70%
© 1999 Intelligent Technologies Corporation
Cube size proportional to annual Medicaid revenues
Multiple Adaptive Regression Splines
(MARS)
 MARS fits a piecewise linear regression
 BF1 = max(0, X – 1,401.00)
 BF2 = max(0, 1,401.00 - X )
 BF3 = max(0, X - 70.00)
 Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2
- .145947E-03 * BF3; BF1 is basis function
 BF1, BF2, BF3 are basis functions
 MARS uses statistical optimization to find best basis
function(s)
 Basis function similar to dummy variable in regression.
Like a combination of a dummy indicator and a linear
independent variable
Baseline Methods:
Naive Bayes Classifier
Logistic Regression
 Naive Bayes assumes feature (predictor)
variables) independence conditional on
each category
 Logistic Regression assumes target is
linear in the logs of the feature (predictor)
variables
REAL CLAIM FRAUD
DETECTION PROBLEM
Classify all claims
Identify valid classes
Pay the claim
 No hassle
 Visa Example

Identify (possible) fraud

Investigation needed
Identify “gray” classes

Minimize with “learning” algorithms
The Fraud Surrogates used as Target
Decision Variables
Independent Medical Exam (IME)
requested
Special Investigation Unit (SIU) referral
IME successful
SIU successful
DATA: Detailed Auto Injury Closed Claim
Database for Massachusetts
Accident Years (1995-1997)
DM
Databases
Scoring Functions
Graded Output
Non-Suspicious Claims
Routine Claims
Suspicious Claims
Complicated Claims
ROC Curve
Area Under the ROC Curve
 Want good performance both on sensitivity and
specificity
 Sensitivity and specificity depend on cut points
chosen for binary target (yes/no)
 Choose a series of different cut points, and
compute sensitivity and specificity for each of
them
 Graph results
 Plot sensitivity vs 1-specifity
 Compute an overall measure of “lift”, or area
under the curve
True/False Positives and True/False
Negatives: The “Confusion” Matrix
Choose a “cut point” in the model score.
Claims > cut point, classify “yes”.
Sample Confusion Matrix: Sensitivity and Specificity
True Class
Prediction
No
Yes
Column Total
Sensitivity
Specificity
No
800
200
1,000
Yes
200
400
600
Row Total
1,000
600
Correct
Total
Percent Correct
800
1,000
80%
400
600
67%
TREENET ROC Curve – IME
AUROC = 0.701
Logistic ROC Curve – IME
AUROC = 0.643
Ranking of Methods/Software – IME
Requested
Method/Software
Random Forest
Treenet
MARS
SPLUS Neural
S-PLUS Tree
Logistic
Naïve Bayes
SPSS Exhaustive CHAID
CART Tree
Iminer Neural
Iminer Ensemble
Iminer Tree
AUROC Lower Bound Upper Bound
0.7030
0.6954
0.7107
0.7010
0.6935
0.7085
0.6974
0.6897
0.7051
0.6961
0.6885
0.7038
0.6881
0.6802
0.6961
0.6771
0.6695
0.6848
0.6763
0.6685
0.6841
0.6730
0.6660
0.6820
0.6694
0.6613
0.6775
0.6681
0.6604
0.6759
0.6491
0.6408
0.6573
0.6286
0.6199
0.6372
Variable Importance (IME) Based on Average of Methods
Important Variable Summarizations for IME
Tree Models, Other Models and Total
Total
Tree
Score
Score
Variable Total
Variable
type
Score
Rank
Rank
Health Insurance
F
16529
1
Provider 2 Bill
F
12514
2
Injury Type
F
10311
3
Territory
F
5180
4
Provider 2 Type
F
4911
5
Provider 1 Bill
F
4711
6
Attorneys Per Zip
DV
2731
7
Report Lag
DV
2650
8
Treatment Lag
DV
2638
9
Claimant per City
DV
2383
10
Provider 1 Type
F
1794
11
Providers per City
DV
1708
12
Attorney
F
1642
13
Distance MP1 Zip to Clt
Zip
DV
1134
14
AGE
F
1048
15
Avg. Household
DM
907
16
Price/Zip
Emergency Treatment
F
660
17
Income Household/Zip
DM
329
18
Providers/Zip
DV
288
19
Household/Zip
DM
242
20
Policy Type
F
4
21
Other
Score
Rank
2
1
3
4
6
5
7
10
13
12
9
11
8
1
3
2
7
4
5
14
8
6
9
13
11
16
18
17
10
12
16
14
15
20
19
21
15
18
20
17
19
21
Claim Fraud Detection Plan
STEP 1:SAMPLE: Systematic benchmark of a
random sample of claims.
STEP 2:FEATURES: Isolate red flags and other
sorting characteristics
STEP 3:FEATURE SELECTION: Separate
features into objective and subjective, early,
middle and late arriving, acquisition cost levels,
and other practical considerations.
STEP 4:CLUSTER: Apply unsupervised
algorithms (Kohonen, PRIDIT, Fuzzy) to cluster
claims, examine for needed homogeneity.
Claim Fraud Detection Plan
 STEP 5:ASSESSMENT: Externally classify claims
according to objectives for sorting.
 STEP 6:MODEL: Supervised models relating selected
features to objectives (logistic regression, Naïve Bayes,
Neural Networks, CART, MARS)
 STEP7:STATIC TESTING: Model output versus expert
assessment, model output versus cluster homogeneity
(PRIDIT scores) on one or more samples.
 STEP 8:DYNAMIC TESTING: Real time operation of
acceptable model, record outcomes, repeat steps 1-7 as
needed to fine tune model and parameters. Use PRIDIT
to show gain or loss of feature power and changing data
patterns, tune investigative proportions to optimize
detection and deterrence of fraud and abuse.
Download