Data Mining Applications in P&C Insurance CASE Spring Meeting April 12, 2005 Lijia Guo, PhD, ASA, MAAA University of Central Florida 1 Agenda Introductions to data mining modeling Understanding the data mining process Data mining (DM) techniques Applications in P&C Insurance Case Study April 12, 2005 Guo 2 Introduction – What is Data Mining? Process of exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. Uses a variety of data analysis tools to discover relationships that may be used to make valid predictions. It is not a magic wand: Must know your business Understand your data Understand the analytical methods April 12, 2005 Guo 3 Introduction - DM Modeling An information discovery process. Knowing your goals Understanding your data Choosing the right methods Understanding the limitations Validation and testing Make crucial business decisions April 12, 2005 Guo 4 Introduction – DM Process Understand the Economics Define the Goal Identify Data Sources Prepare Data Transform Data Apply DM Models Validate DM Models April 12, 2005 Guo IMPLEMENT 5 Introduction – DM Goals Identifying responsive potential customers Identifying existing customers that more likely to terminate Identifying high risk purchaser Identifying the factors that cause large claims Identifying interactions among risk factors April 12, 2005 Guo 6 Introduction – DM Process April 12, 2005 Guo 7 DM Techniques Decision Trees Logistic regression Neural Networks Fuzzy Logics Genetic Algorithms Clustering Associated discovery Sequence Discovery Bayesian analysis Visualization Hybrid algorithms April 12, 2005 Guo 8 DM Techniques -- Decision Trees What are decision trees Classify observations based on the values of nominal, binary, or ordinal targets Predict outcomes for interval targets Predict the appropriate decision when you specify decision alternatives April 12, 2005 Guo 9 DM Techniques -- Decision Trees Example Classification Of Surrender Risk Yes Income >$50,000 Yes Or No Job >5 Years Yes or No If yes low risk else high risk April 12, 2005 No High Debt Yes or No If yes low risk Else high risk Guo 10 DM Techniques -- Decision Trees Strengths and weaknesses Insights into the decision-making process Efficient and is thus suitable for large data sets Relatively unstable Difficult to detect linear or quadratic relationships April 12, 2005 Guo 11 DM Techniques -- Logistic regression What is Logistic regression How Logistic regression works Odds ratios Each dependent variable affects logit linearly pi logit log 1 pi April 12, 2005 k 0 j x ji , where i 1, 2, , n. j 1 Guo 12 DM Techniques - Logistic Regression Strengths and weaknesses Maximum Likelihood Curve Fitting Multiple Logistic Regression Model Interaction-effect modifier Multinomial Logistic Regression Model April 12, 2005 Guo 13 DM Techniques -- Neural Networks What are Neural Networks x1 w11 w21 x2 H1 w21 w1 y w22 w2 x3 w31 w32 April 12, 2005 H2 Guo Input layer - a unit for each input variable Output layer the target Hidden layer hidden unit (neurons) 14 DM Techniques – Neural Networks g01 E ( y ) w0 w1H1 w2 H 2 H1 g1 ( w01 w11 x1 w21 x2 w31 x3 ) H 2 g 2 (w02 w12 x1 w22 x2 w32 x3 ) g 0 () : output activation function. gi () : activation functions-nonlinear transformations. w11 , w21 , , w32 , w1 , w2 : weights w , w , w : Bias 0 April 12, 2005 01 02 Guo 15 DM Techniques –Neural Networks How Neural Networks work Processing elements Training Predicting Activation Functions • logistic function 1 l ( ) 1 e • hyperbolic tangent April 12, 2005 x e e tanh( x) x x e e x Guo 16 DM Techniques -- Neural Networks Strengths and weaknesses • Accurately prediction for complex problems • Black box predict engine • Overtraining • Training speed April 12, 2005 Guo 17 DM Techniques -- Hybrid Algorithms Problems with standard algorithms Advanced algorithms Discovery-driven approaches Mixture of algorithms April 12, 2005 Guo 18 DM Applications in P&C Insurance Data Warehouse Underwriting Pricing/Rate Making Claim Scoring Risk Management Policy Level Analysis Variable Selection April 12, 2005 Guo 19 Data Warehousing Example Transactions SurveysDemographics Unique Patient List Transactions Pharmacy Claims Rx April 12, 2005 Demographics Service Level Table Derived Variables/ Flags Surveys ... Group by Patient Hospital Claims Surveys Secondary Selection: WHAT DATA? Med Claims Physician Claims Primary Selection: WHO? Tertiary Selection: WHAT DOES THE TRANSACTION DATA TELL US? Summary Level Table Service Level Summary Level Variables Variables Guo Summary: WHAT DO WE KNOW ABOUT THIS PATIENT? 20 DM in Insurance Underwriting Improving profit margin. Gaining competitive edge Risk evaluation process. Lots of variables Lots of interactions Easy to follow procedure. Decision tree can be used April 12, 2005 Guo 21 DM in Insurance Underwriting - Auto Driver’s Claim Information Variable Variable Type Measurement Level Description Age Continuous Interval Driver’s age in years Car age Continuous Interval Age of the car Car type Categorical Nominal Type of the car Gender Categorical Binary F=female, M=male Coverage level Categorical Nominal Policy coverage Education Categorical Nominal Education level of the drive Location Categorical Nominal Location of residence Climate Categorical Nominal Climate code for residence Credit rating Continuous Interval Credit score of the driver ID Input Nominal Driver’s identification number No. of claims Categorical Nominal Number of claims April 12, 2005 Guo 22 DM in Insurance Underwriting - Decision Tree Diagram April 12, 2005 Guo 23 DM in Pricing/Rate Making Data: Auto Driver’s Claim Information Decision trees analysis to identify risk factors that predict profits, claims and losses Logistic regression applied to model Claim frequency Effect of each risk factor April 12, 2005 Guo 24 DM in Pricing/Rate Making Effect T-scores from the logistic regression April 12, 2005 Guo 25 DM in Pricing/Rate Making - Assessment Assessment Cross-model comparisons of the expected to actual profits/losses Independent of all other factors (sample size,..) Lift charts % claim-occurrence value to a random baseline model Performance quality demonstrated by the degree the lift chart curve pushes upward and to the left April 12, 2005 Guo 26 DM in Pricing/Rate Making - Lift Chart for Logistic Regression logistic Regression - Captured 30% of the drivers in the 10th percentile - Better predictive power from about the 20th to the 80th percentiles April 12, 2005 Guo 27 DM in Risk Management Reinsurance To structure more effectively by segmentation Hedging Target April 12, 2005 retention and building loyalty Guo 28 DM in Policy Level Analysis Retention analysis Profitability analysis Policyholder’s behavior DM methods used Neural networks Decision trees Logistic regression April 12, 2005 Guo 29 Applications – Variable Selection Problem -- Given {Y,X} where X {x1 , x2 ,...xN } Find F, such that F ( X ) Y Find Z X , and F*, such that F *( X ) Y Improving model accuracy and efficiency Making crucial business decisions April 12, 2005 Guo 30 Case Study - Group Insurance Identify ways to build upon the current manual rating structure utilizing exiting rating variables to develop a practical tool to guild underwriting in rates adjustments Identify any new rating variables with significant predictive power Currently gathered, but not utilized data Transformations of existing variables introduce new rating variables (e.g. external financial data) April 12, 2005 Guo 31 Case Study – Group Insurance Profit margin over x year period 128 input variables Principle Components Analysis applied 42 variables remains How to improve business profit? April 12, 2005 Guo 32 Case Study - Goals Developing a practical underwriting tool Detecting deviations Identifying key drivers Improving model predictive power Risk selection April 12, 2005 Guo 33 Function Approximation F ( X ) F0 1T1 ( X ) 2T2 ( X ) ... M TM ( X ) F0 is the initial guess Stegewise approximation Each stage added by reducing errors Each stage is weak linear – a small tree. Sequential adjustment April 12, 2005 Guo 34 Regression Tree Example Profit=6.5% +0.8% , if AS > 421 -0.5% , otherwise April 12, 2005 Guo +1.2% , if male young than 30 -1.1% , otherwise 35 Function Approximation GIVEN Y: Output and X: Inputs or Predictors L(Y, F): Loss Function ESTIMATE F *( X ) arg min F ( X ) EY , X [ L(Y , F ( X ))] April 12, 2005 Guo 36 Classical Function Approximation F F ( X , ), { j } Solve { j } from min L(Y , F ( X , B)) April 12, 2005 Guo 37 Nonparametric Function Approximation {F0 ( X i )} Initial guess Compute Take a step in the steepest descent direction April 12, 2005 N L g F ( X i ) i 1 Guo 38 Gradient Boosting 1 N L({F ( X i )}) (Yi F ( X i )) 2 N i 1 Initial guess FOR m = 1 TO M {F0 ( X i )} gm L( Fm1 ( X i )) Fit an L-node regression tree to the current residuals For each given node, calculate node average residual Update: END April 12, 2005 hm ( X i ) {Fm ( X i )} {Fm1 ( X i )} hm ( X i ) Guo 39 Case Study Tw o Predictor Dependence For PROFIT_MARGIN April 12, 2005 Guo 40 Case Study Tw o Predictor Dependence For PROFIT_MARGIN April 12, 2005 Guo 41 Case Study - Single Stats and Variable Importance Input Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7 Variable 8 Variable 9 April 12, 2005 Additive 0.2679 0.2779 0.1456 0.2263 0.1059 0.2741 0.1289 0.0797 0.1129 Multiplicative Importance 0.2690 100.00 0.3203 75.23 0.1771 54.65 0.2469 47.41 0.1425 42.81 0.2847 34.81 0.1306 34.27 0.0864 25.35 0.1148 23.37 Guo 42 Case Study - Pair Stats and Variable Importance Variables Variable 1 & Variable 2 Variable 2 & Variable 3 Variable 2 & Variable 4 Variable 2 & Variable 7 Variable 3 & Variable 4 Variable 3 & Variable 6 Variable 4 & Variable 7 Variable 5 & Variable 6 Variable 6 & Variable 7 April 12, 2005 Additive Multiplicative 0.3714 0.3704 0.3686 0.3401 0.2795 0.2895 0.2417 0.2622 0.2904 0.3847 0.4066 0.4010 0.3856 0.3137 0.3082 0.2592 0.2766 0.3066 Guo 43 Predictive Modeling Predicts deviations from expected profitability (used 9 variables) Practical guide for underwriters to use for rates adjustments New variables Identified to have strong predictive power Improve business profit (20% Profit margin) April 12, 2005 Guo 44 Importance of Multiple Techniques Robust model with high predictive accuracy Practical constrains Algorithm complexity Ease of understanding of results April 12, 2005 Guo 45 Is Data Mining for you? Defining the goals Understanding your data Using multiple techniques Improving your decision making process Gaining competitive edges! Thank you! April 12, 2005 Guo 46