Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008 Data Mining Data Mining, also known as KnowledgeDiscovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. In order to achieve this, data mining uses computational techniques from statistics, machine learning and pattern recognition. www.wikipedia.org AGENDA Predictive v Explanatory Models Discussion of Methods Example: Explanatory Models for Decision to Investigate Claims The “Importance” of Explanatory and Predictive Variables An Eight Step Program for Building a Successful Model Predictive v Explanatory Models Both are of the form: Target or Dependent Variable is a Function of Feature or Independent Variables that are related to the Target Variable Explanatory Models assume all Variables are Contemporaneous and Known Predictive Models assume all Variables are Contemporaneous and Estimable Desirable Properties of a Data Mining Method: Any nonlinear relationship between target and features can be approximated A method that works when the form of the nonlinearity is unknown The effect of interactions can be easily determined and incorporated into the model The method generalizes well on out-of sample data Major Kinds of Data Mining Methods Supervised learning Most common situation Target variable Frequency Loss ratio Fraud/no fraud Unsupervised learning No Target variable Group like records together-Clustering Some methods Regression Decision Trees Some neural networks A group of claims with similar characteristics might be more likely to be of similar risk of loss Ex: Territory assignment, Some methods PRIDIT K-means clustering Kohonen neural networks The Supervised Methods and Software Evaluated 1) 2) 3) 4) 5) 6) TREENET Iminer Tree SPLUS Tree CART S-PLUS Neural Iminer Neural 7) Iminer Ensemble 8) MARS 9) Random Forest 10) Exhaustive Chaid 11) Naïve Bayes (Baseline) 12) Logistic reg ( (Baseline) Decision Trees In decision theory (for example risk management), a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions. A decision tree is a special form of tree structure. www.wikipedia.org CART – Example of 1st split on Provider 2 Bill, With Paid as Dependent 1st Split All Data Mean = 11,224 Bill < 5,021 Bill>= 5,021 Mean = 10,770 Mean = 59,250 For the entire database, total squared deviation of paid losses around the predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013 after the data are partitioned using $5,021 as the cutpoint. Any other partition of the provider bill produces a larger SSE than 4.66x1013. For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013. Different Kinds of Decision Trees Single Trees (CART, CHAID) Ensemble Trees, a more recent development (TREENET, RANDOM FOREST) A composite or weighted average of many trees (perhaps 100 or more) There are many methods to fit the trees and prevent overfitting Boosting: Iminer Ensemble and Treenet Bagging: Random Forest Neural Networks Three Layer Neural Network = Input Layer (Input Data) Hidden Layer (Process Data) Output Layer (Predicted Value) NEURAL NETWORKS Self-Organizing Feature Maps T. Kohonen 1982-1990 (Cybernetics) Reference vectors of features map to OUTPUT format in topologically faithful way. Example: Map onto 40x40 2dimensional square. Iterative Process Adjusts All Reference Vectors in a “Neighborhood” of the Nearest One. Neighborhood Size Shrinks over Iterations FEATURE MAP SUSPICION LEVELS S16 S13 4-5 S10 3-4 S7 16 13 10 7 4 1 S4 S1 2-3 1-2 0-1 FEATURE MAP SIMILIARITY OF A CLAIM S16 S13 4-5 S10 3-4 S7 17 13 9 5 1 S4 S1 2-3 1-2 0-1 DATA MODELING EXAMPLE: CLUSTERING Data on 16,000 Medicaid providers analyzed by unsupervised neural net Neural network clustered Medicaid providers based on 100+ features Investigators validated a small set of known fraudulent providers Visualization tool displays clustering, showing known fraud and abuse Subset of 100 providers with similar patterns investigated: Hit rate > 70% © 1999 Intelligent Technologies Corporation Cube size proportional to annual Medicaid revenues Multiple Adaptive Regression Splines (MARS) MARS fits a piecewise linear regression BF1 = max(0, X – 1,401.00) BF2 = max(0, 1,401.00 - X ) BF3 = max(0, X - 70.00) Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2 - .145947E-03 * BF3; BF1 is basis function BF1, BF2, BF3 are basis functions MARS uses statistical optimization to find best basis function(s) Basis function similar to dummy variable in regression. Like a combination of a dummy indicator and a linear independent variable Baseline Methods: Naive Bayes Classifier Logistic Regression Naive Bayes assumes feature (predictor) variables) independence conditional on each category Logistic Regression assumes target is linear in the logs of the feature (predictor) variables REAL CLAIM FRAUD DETECTION PROBLEM Classify all claims Identify valid classes Pay the claim No hassle Visa Example Identify (possible) fraud Investigation needed Identify “gray” classes Minimize with “learning” algorithms The Fraud Surrogates used as Target Decision Variables Independent Medical Exam (IME) requested Special Investigation Unit (SIU) referral IME successful SIU successful DATA: Detailed Auto Injury Closed Claim Database for Massachusetts Accident Years (1995-1997) DM Databases Scoring Functions Graded Output Non-Suspicious Claims Routine Claims Suspicious Claims Complicated Claims ROC Curve Area Under the ROC Curve Want good performance both on sensitivity and specificity Sensitivity and specificity depend on cut points chosen for binary target (yes/no) Choose a series of different cut points, and compute sensitivity and specificity for each of them Graph results Plot sensitivity vs 1-specifity Compute an overall measure of “lift”, or area under the curve True/False Positives and True/False Negatives: The “Confusion” Matrix Choose a “cut point” in the model score. Claims > cut point, classify “yes”. Sample Confusion Matrix: Sensitivity and Specificity True Class Prediction No Yes Column Total Sensitivity Specificity No 800 200 1,000 Yes 200 400 600 Row Total 1,000 600 Correct Total Percent Correct 800 1,000 80% 400 600 67% TREENET ROC Curve – IME AUROC = 0.701 Logistic ROC Curve – IME AUROC = 0.643 Ranking of Methods/Software – IME Requested Method/Software Random Forest Treenet MARS SPLUS Neural S-PLUS Tree Logistic Naïve Bayes SPSS Exhaustive CHAID CART Tree Iminer Neural Iminer Ensemble Iminer Tree AUROC Lower Bound Upper Bound 0.7030 0.6954 0.7107 0.7010 0.6935 0.7085 0.6974 0.6897 0.7051 0.6961 0.6885 0.7038 0.6881 0.6802 0.6961 0.6771 0.6695 0.6848 0.6763 0.6685 0.6841 0.6730 0.6660 0.6820 0.6694 0.6613 0.6775 0.6681 0.6604 0.6759 0.6491 0.6408 0.6573 0.6286 0.6199 0.6372 Variable Importance (IME) Based on Average of Methods Important Variable Summarizations for IME Tree Models, Other Models and Total Total Tree Score Score Variable Total Variable type Score Rank Rank Health Insurance F 16529 1 Provider 2 Bill F 12514 2 Injury Type F 10311 3 Territory F 5180 4 Provider 2 Type F 4911 5 Provider 1 Bill F 4711 6 Attorneys Per Zip DV 2731 7 Report Lag DV 2650 8 Treatment Lag DV 2638 9 Claimant per City DV 2383 10 Provider 1 Type F 1794 11 Providers per City DV 1708 12 Attorney F 1642 13 Distance MP1 Zip to Clt Zip DV 1134 14 AGE F 1048 15 Avg. Household DM 907 16 Price/Zip Emergency Treatment F 660 17 Income Household/Zip DM 329 18 Providers/Zip DV 288 19 Household/Zip DM 242 20 Policy Type F 4 21 Other Score Rank 2 1 3 4 6 5 7 10 13 12 9 11 8 1 3 2 7 4 5 14 8 6 9 13 11 16 18 17 10 12 16 14 15 20 19 21 15 18 20 17 19 21 Claim Fraud Detection Plan STEP 1:SAMPLE: Systematic benchmark of a random sample of claims. STEP 2:FEATURES: Isolate red flags and other sorting characteristics STEP 3:FEATURE SELECTION: Separate features into objective and subjective, early, middle and late arriving, acquisition cost levels, and other practical considerations. STEP 4:CLUSTER: Apply unsupervised algorithms (Kohonen, PRIDIT, Fuzzy) to cluster claims, examine for needed homogeneity. Claim Fraud Detection Plan STEP 5:ASSESSMENT: Externally classify claims according to objectives for sorting. STEP 6:MODEL: Supervised models relating selected features to objectives (logistic regression, Naïve Bayes, Neural Networks, CART, MARS) STEP7:STATIC TESTING: Model output versus expert assessment, model output versus cluster homogeneity (PRIDIT scores) on one or more samples. STEP 8:DYNAMIC TESTING: Real time operation of acceptable model, record outcomes, repeat steps 1-7 as needed to fine tune model and parameters. Use PRIDIT to show gain or loss of feature power and changing data patterns, tune investigative proportions to optimize detection and deterrence of fraud and abuse.