Predictive Modeling Spring 2005 CAMAR meeting Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc www.data-mines.com Objectives Introduce Predictive modeling Why use it? Describe some methods in depth Trees Neural networks Clustering Apply to fraud data 2 Predictive Modeling Family Predictive Modeling Classical Linear Models GLMs Data Mining 3 Why Predictive Modeling? Better use of insurance data Advanced methods for dealing with messy data now available 4 Major Kinds of Modeling Supervised learning Most common situation A dependent variable Unsupervised learning Frequency Loss ratio Fraud/no fraud No dependent variable Group like records together Some methods Regression CART Some neural networks A group of claims with similar characteristics might be more likely to be fraudulent Some methods Association rules K-means clustering Kohonen neural networks 5 Two Big Specialties in Predicative Modeling Data Mining GLMS Regression Logistic Regressions Poisson Regression Trees Neural Networks Clustering 6 Modeling Process Internal Data Data Cleaning External Data Deploy Model Other Preprocessing Build Model Validate Model Test Model 7 Data Complexities Affecting Insurance Data Nonlinear functions Interactions Missing Data Correlations Non normal data 8 Kinds of Applications Classification Prediction 9 The Fraud Study Data • • 1993 Automobile Insurers Bureau closed Personal Injury Protection claims Dependent Variables • Suspicion Score • • Expert assessment of liklihood of fraud or abuse • • • Number from 0 to 10 5 categories Used to create a binary indicator Predictor Variables • Red flag indicators • Claim file variables 10 Introduction of Two Methods Trees Sometimes known as CART (Classification and Regression Trees) Neural Networks Will introduce backpropagation neural network 11 Decision Trees Recursively partitions the data Often sequentially bifurcates the data – but can split into more groups Applies goodness of fit to select best partition at each step Selects the partition which results in largest improvement to goodness of fit statistic 12 Goodness of Fit Statistics Chi Square CHAID (Fish, Gallagher, Monroe- Discussion Paper Program, 1990) 2 i, k Observed-Expected Expected 2 Deviance CART Di 2 nik log( pik ) (categorical) k D= (y j j ) 2 (or RSS for continuous variables) cases j 13 Goodness of Fit Statistics Gini Measure CART i is impurity measure i 1 pk 2 k (t , s) i(t ) pL i(t L ) pR i(t R ) 14 Goodness of Fit Statistics Entropy C4.5 E I ( E ) log 2 ( ) log 2 ( pE ) N H pk log2 ( pk ) k 15 An Illustration from Fraud data: GINI Measure Fraud/No Fraud Legal Representation No Yes Total No 626 80 706 Yes 269 425 694 Total 895 505 1400 Percent 64% 36% 16 First Split All Claims p(fraud) = 0.36 Legal Rep = Yes P(fraud) = 0 .612 Legal Rep = No P(fraud) = 0.113 17 Example cont: Root Node: Legal No Yes 0.461199 Fraud/No Fraud No Yes 1-p(i)^2 Row % 0.887 0.113 0.201 50.4% 0.388 0.612 0.475 49.6% 33.7% 0.337 , 201*.504 .475 *.496 improvement .461 .337 0.124 18 Example of Nonlinear Function Suspicion Score vs. 1st Provider Bill Neural Network Fit of SUSPICION vs Provider Bill 4.00 netfraud1 3.00 2.00 1.00 0.00 1000 3000 5000 7000 Provider Bill 19 An Approach to Nonlinear Functions: mp1.bill<1279.5 | Fit A Tree mp1.bill<153 mp1.bill<2389 mp1.bill<842.5 0.3387 3.6430 1.2850 4.4270 2.2550 20 3 2 1 Fraud Score Prediction 4 Fitted Curve From Tree 0 5000 10000 Provider Bill 15000 21 Neural Networks Developed by artificial intelligence experts – but now used by statisticians also Based on how neurons function in brain 22 Neural Networks • Fit by minimizing squared deviation between fitted and actual values • Can be viewed as a non-parametric, nonlinear regression • Often thought of as a “black box” • Due to complexity of fitted model it is difficult to understand relationship between dependent and predictor variables 23 The Backpropagation Neural Network Three Layer Neural Network Input Layer (Input Data) Hidden Layer (Process Data) Output Layer (Predicted Value) 24 Neural Network Fits a nonlinear function at each node of each layer h f ( X ; w0, w1...wn ) f ( w0 w1 x1 ...wn xn ) 1 1 e ( w0 w1x1...wn xn ) 25 The Logistic Function Logistic Function for Various Values of w1 1.0 0.8 w1=-10 w1=-5 w1=-1 w1=1 w1=5 w1=10 0.6 0.4 0.2 0.0 X -1.2 -0.7 -0.2 0.3 0.8 26 Universal Function Approximator • The backpropagation neural network with one hidden layer is a universal function approximator • Theoretically, with a sufficient number of nodes in the hidden layer, any continuous nonlinear function can be approximated 27 Nonlinear Function Fit by Neural Network Neural Network Fit of SUSPICION vs Provider Bill 4.00 netfraud1 3.00 2.00 1.00 0.00 1000 3000 5000 7000 Provider Bill 28 Interactions Functional relationship between a predictor variable and a dependent variable depends on the value of another variable(s) Neural Network Predicted for Provider Bill and Injury Type inj.type: 05 6.00 4.00 Neural Net Predicted 2.00 0.00 inj.type: 04 inj.type: 03 6.00 4.00 2.00 inj.type: 02 inj.type: 01 0.00 6.00 4.00 2.00 0.00 3000 8000 13000 18000 Provider Bill 29 Interactions Neural Networks The hidden nodes pay a key role in modeling the interactions CART partitions the data Partitions capture the interactions 30 mp1.bill<1279.5 | Simple Tree of Injury and Provider Bill mp1.bill<153 injtype:abcefghi injtype:abcefgi injtype:abcefgh mp1.bill<2675.5 3.20 injtype:abfgi mp1.bill<2017.5 0.68 0.14 0.30 1.00 2.10 4.80 3.70 4.20 31 4000 10000 16000 injtype: 8 injtype: 10 injtype: 99 injtype: 5 injtype: 6 injtype: 7 5 2 response 5 2 injtype: 1 injtype: 2 injtype: 4 5 2 4000 10000 16000 4000 10000 16000 mp1.bill 32 Missing Data Occurs frequently in insurance data There are some sophisticated methods for addressing this (i.e., EM algorithm) CART finds surrogates for variables with missing values Neural Networks have no explicit procedure for missing values 33 More Complex Example Dependent variable: Expert’s assessment of liklihood claim is legitimate A classification application Predictor variables: Combination of claim file variables (age of claimant, legal representation) red flag variables (injury is strain/sprain only, claimant has history of previous claim) Used an enhancement on CART known as boosting 34 Red Flag Predictor Variables Subject Accident Indicator Variable ACC01 ACC04 ACC09 ACC10 ACC11 ACC14 ACC15 ACC16 ACC19 Claimant CLT02 CLT04 CLT07 Injury INJ01 INJ02 INJ03 INJ05 INJ06 INJ11 Insured INS01 INS03 INS06 INS07 Lost Wages LW01 LW03 Red Flag Variables Description No report by police officer at scene Single vehicle accident No plausible explanation for accident Claimant in old, low valued vehicle Rental vehicle involved in accident Property Damage was inconsistent with accident Very minor impact collision Claimant vehicle stopped short Insured felt set up, denied fault Had a history of previous claims Was an out of state accident Was one of three or more claimants in vehicle Injury consisted of strain or sprain only No objective evidence of injury Police report showed no injury or pain No emergency treatment was given Non-emergency treatment was delayed Unusual injury for auto accident Had history of previous claims Readily accepted fault for accident Was difficult to contact/uncooperative Accident occurred soon after effective date Claimant worked for self or a family member Claimant recently started employment 35 Claim File Variables Claim Variables Available Early in Life of Claim Variable Description AGE Age of claimant RPTLAG TREATLAG Lag from date of accident to date reported Lag from date of accident to earliest treatment by service provider AMBUL Ambulance charges PARTDIS The claimant partially disabled TOTDIS The claimant totally disabled LEGALREP The claimant represented by an attorney 36 Neural Network Measure of Variable Importance • Look at weights to hidden layer • Compute sensitivities: • a measure of how much the predicted value’s error increases when the variables are excluded from the model one at a time 37 Variable Importance Rank 1 2 3 4 5 6 7 8 9 10 Rank LEGALREP TRTLAG AGE ACC04 INJ01 INJ02 ACC14 RPTLAG AMBUL CLT02 Variable 100.0 69.7 54.5 44.4 42.1 39.4 35.8 32.4 29.3 23.9 Importance |||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||| |||||||||||||||||||||| |||||||||||||||||| ||||||||||||||||| |||||||||||||||| |||||||||||||| ||||||||||||| |||||||||||| ||||||||| 38 Testing: Hold Out Part of Sample • Fit model on 1/2 to 2/3 of data • Test fit of model on remaining data • Need a large sample 39 Testing: Cross-Validation • Hold out 1/n (say 1/10) of data • Fit model to remaining data • Test on portion of sample held out • Do this n (say 10) times and average the results • Used for moderate sample sizes • Jacknifing similar to cross-validation 40 Results of Classification on Test Data Actual Fitted Neural Network 0 1 0 81.5% 18.5% 1 26.7% 73.3% Fitted Tree Actual 0 0 77.3% 1 14.3% 1 22.7% 85.7% 41 Unsupervised Learning Common Method: Clustering No dependent variable – records are grouped into classes with similar values on the variable Start with a measure of similarity or dissimilarity Maximize dissimilarity between members of different clusters 42 Dissimilarity (Distance) Measure Euclidian Distance dij 1/ 2 m 2 ( xik x jk ) i, j = records k=variable k 1 Manhattan Distance dij m xik x jk k 1 43 Column Variable Binary Variables Row Variable 1 0 1a b a+b 0c d c+d a+c b+d 44 Binary Variables Sample Matching bc d abcd Rogers and Tanimoto 2(b c) d (a d ) 2(b c) 45 Results for 2 Clusters Cluster Lawyer Back Claim Or Sprain Chiro or PT Prior Claim 1 77% 73% 56% 26% 2 3% 29% 14% 1% Suspicious Cluster Claim 1 56% 2 3% Average Suspicion Score 2.99 0.21 46 Beginners Library Berry, Michael J. A., and Linoff, Gordon, Data Mining Techniques, John Wiley and Sons, 1997 Kaufman, Leonard and Rousseeuw, Peter, Finding Groups in Data, John Wiley and Sons, 1990 Smith, Murry, Neural Networks for Statistical Modeling, International Thompson Computer Press, 1996 47 Data Mining CAMAR Spring Meeting Louise Francis, FCAS, MAAA Louise_francis@msn.com www.data-mines.com