Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.com 1 Objectives Review GIRO Data Quality Working Party Paper Discuss Data Quality in Predictive Modeling Focus on a key part of data preparation: Exploratory data analysis 2 GIRO Working Party Paper Literature Review Horror Stories Survey Experiment Actions Concluding Remarks 3 Insurance Data Management Association The IDMA is an American organisation which promotes professionalism in the Data Management discipline through education, certification and discussion forums The IDMA web site: • • • Suggests publications on data quality, Describes a data certification model, and Contains Data Management Value Propositions which document the value to various insurance industry stakeholders of investing in data quality http://www.idma.org 4 Horror Stories – Non-Insurance Heart-and-Lung Transplant – wrong blood type Bombing of Chinese Embassy in Belgrade Mars Orbiter – confusion between imperial and metric units Fidelity Mutual Fund – withdrawal of dividend Porter County, Illinois – Tax Bill and Budget Shortfall 5 Horror Stories – Rating/Pricing Examples faced by ISO: Exposure recorded in units of $10,000 instead of $1,000 Large insurer reporting personal auto data as miscellaneous and hence missed from ratemaking calculations One company reporting all its Florida property losses as fire (including hurricane years) Mismatched coding for policy and claims data 6 Horror Stories - Katrina US Weather models underestimated costs Katrina by approx. 50% (Westfall, 2005) 2004 RMS study highlighted exposure data that was: Out-of-date Incomplete Mis-coded Many flood victims had no flood insurance after being told by agents that they were not in flood risk areas. 7 Data Quality Survey of Actuaries Purpose: Assess the impact of data quality issues on the work of general insurance actuaries 2 questions: percentage of time spent on data quality issues proportion of projects adversely affected by such issues 8 Results - Percentage of Time Employer No. Mean Median Min Max Insurer/Reinsure r 17 26.4% 25.0% 5.0% 50.0% Consultancy 13 27.1% 25.0% 7.5% 60.0% Other 8 23.4% 12.5% 2.0% 75.0% All 38 26.0% 25.0% 2.0% 75.0% 9 Results - Percentage of Projects Employer No Mean Median Min Max Insurer/Reinsure r 15 27.9% 20.0% 5.0% 60.0% Consultancy 13 43.3% 35.0% 10.0% 100.0% Other 8 22.6% 20.0% 1.0% 50.0% All 36 32.3% 30.0% 1.0% 100.0% 10 Survey Conclusions Data quality issues have a significant impact on the work of general insurance actuaries about a quarter of their time is spent on such issues about a third of projects are adversely affected The impact varies widely between different actuaries, even those working in similar organisations Limited evidence to suggest that the impact is more significant for consultants 11 Hypothesis Uncertainty of actuarial estimates of ultimate incurred losses based on poor data is significantly greater than that of good data 12 Data Quality Experiment Examine the impact of incomplete and/or erroneous data on actuarial estimate of ultimate losses and the loss reserves Use real data with simulated limitations and/or errors and observe the potential error in the actuarial estimates 13 Data Used in Experiment Real data for primary private passenger bodily injury liability business for a single nofault state Eighteen (18) accident years of fully developed data; thus, true ultimate losses are known 14 Actuarial Methods Used Paid chain ladder models Incurred chain ladder models Frequency-severity models Inverse power curve for tail factors No judgment used in applying methods 15 Completeness of Data Experiments Vary size of the sample; that is, 1) 2) 3) All years Use only 7 accident years Use only last 3 diagonals 16 Data Error Experiments Simulated data quality issues 1. 2. 3. 4. 5. 6. Misclassification of losses by accident year Early years not available Late processing of financial information Paid losses replaced by crude estimates Overstatements followed by corrections in following period Definition of reported claims changed 17 Measure Impact of Data Quality • • Compare Estimated to Actual Ultimates Use Bootstrapping to evaluate effect of difference samples on results 18 Results – Experiment 1 More data generally reduces the volatility of the estimation errors 19 Results – Experiment 2 Extreme volatility, especially those based on paid data Actuaries ability to recognise and account for data quality issues is critical Actuarial adjustments to the data may never fully correct for data quality issues 20 Results – Bootstrapping Less dispersion in results for error free data Standard deviation of estimated ultimate losses greater for the modified data (data with errors) Confirms original hypothesis 21 Conclusions Resulting from Experiment Greater accuracy and less variability in actuarial estimates when: Quality data used Greater number of accident years used Data quality issues can erode or even reverse the gains of increased volumes of data: If errors are significant, more data may worsen estimates due to the propagation of errors for certain projection methods Significant uncertainty in results when: Data is incomplete Data has errors 22 Predictive Modeling & Exploratory Data Analysis 23 Modeling Process Business Understanding 2 Data Understanding Deployment 0 Complex Model Artifact Data Preparation Evaluation Modeling 24 Data Preprocessing Data sets Select Data Rational for inclusion Clean Data Data Cleaning Report Construct Data Data Attributes Integrate Data Merged records Generate records Format Data 25 Exploratory Data Analysis Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive statistics Purpose Find outliers (and errors) Explore structure of the data 26 Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses 27 Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Stem and Leaf Plots Statistical Descriptive statistics Data spheres 28 Histograms Can do them in Microsoft Excel 29 Histograms of Age Variable Varying Window Size 20 80 15 60 10 40 5 20 0 0 16 33 51 Age 68 85 16 23 30 37 44 51 57 64 71 78 85 Age 40 30 20 10 0 16 24 31 39 47 54 62 70 77 85 Age 30 Formula for Window Width h 3.5 1 3 N standard deviation N=sample size h =window width 31 Example of Suspicious Value 25,000 Frequency 20,000 15,000 10,000 5,000 0 600 900 1200 License Year 1500 1800 32 Box and Whisker Plot 80 Normally Distributed Age 60 Outliers 40 +1.5*midspread 20 75th Percentile 0 median -20 25th Percentile 33 Plot of Heavy Tailed Data Paid Losses 110000 90000 Paid Loss 70000 50000 30000 10000 -10000 34 Heavy Tailed Data – Log Scale 107 106 Paid Loss 105 104 103 102 101 35 600 Frequency Box and Whisker Example 1,000 800 400 200 0 96 92 90 88 86 84 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 Age 36 Descriptive Statistics Analysis ToolPak Statistic Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Largest(2) Smallest(2) Policyholder Age 36.9 0.1 35.0 32.0 13.2 174.4 0.5 0.7 84 16 100 1114357 30226 100 16 37 Descriptive Statistics License Year has minimum and maximums that are impossible N License Year 30,250 Valid N 30,250 Minimum 490 Maximum 2,049 Mean 1,990 Std. Deviation 16.3 38 Data Spheres: The Mahalanobis Distance Statistic 1 MD ( x μ) ' Σ ( x μ) x is a vector of variables μ is a vector of means Σ is a variance-covariance matrix 39 MD (x μ) ' Σ 1 (x μ) Screening Many Variables at Once Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state 40 Records With Unusual Values Flagged Mahalanobis Percentile of Policy ID Depth Mahalanobis 22244 59 100 6159 60 100 22997 65 100 5412 61 100 30577 72 100 28319 8,490 100 27815 55 100 16158 24 100 4908 25 100 28790 24 100 Age 27 22 NA 17 43 30 44 82 56 82 License Number Number Model Incurred Year of Cars of Drivers Year Loss 1997 3 6 1994 4,456 2001 2 6 1993 0 NA 2 1 1954 0 2003 3 6 1994 0 1979 3 1 1952 0 490 1 1 1987 0 1976 -1 0 1959 0 1938 1 1 1989 61,187 1997 4 4 2003 35,697 2039 1 1 1985 27,769 41 Categorical Data: Data Cubes 42 Categorical Data Data Cubes Usually frequency tables Search for missing values coded as blanks Gender F M Total Frequency 5,054 13,032 17,198 35,284 Percent 14.3 36.9 48.7 100 43 Categorical Data Table highlights inconsistent coding of marital status Marital Status Frequency 1 2 4 D M S Total 5,053 2,043 9,657 2 4 2,971 15,554 35,284 Percent 14.3 5.8 27.4 0 0 8.4 44.1 100 44 Screening for Missing Data BUSINESS TYPE Age License Year 35,284 35,284 30,242 30,250 0 0 5,042 5,034 25 27.00 1,986.00 Percentiles 50 35.00 1,996.00 75 45.00 2,000.00 N Valid Gender Missing 45 Blanks as Missing Frequency Valid Percent Valid Percent Cumulative Percent 5,054 14.3 14.3 14.3 F 13,032 36.9 36.9 51.3 M 17,198 48.7 48.7 100.0 Total 35,284 100.0 100.0 46 Conclusions Anecdotal horror stories illustrate possible dramatic impact of data quality problems Data quality survey suggests data quality issues have a significant cost on actualial work Data Quality Working Party urges actuaries to become data quality advocated within their organizations Techniques from Exploratory Data Analysis can be used to detect data glitches 47 Library for Getting Started Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Cleaning Data”, CAS Winter Forum, March 2005, www.casact.org The Data Quality Working Party Paper will be Posted on the GIRO web site in about nine months 48