Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com Objectives Discuss topic of data quality Present methods for screening data for problems Errors Missing data Present methods of fixing problems Data Quality: A Problem Actuary doing reconciliation on February 15 It’s Not Just Us “In just about any organization, the state of information quality is at the same low level” Olson, Data Quality AAA Standards of Practice AAA SOP: Review data for completeness, accuracy and relevance IDMA and CAS White paper. Evaluate data for Validity accuracy reasonableness completeness Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses Screening Data Many of the Methods have been in use for a while Pioneered in field of exploratory data analysis More recently – missing data methods for dealing with some quality problems Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Statistical Descriptive statistics Data spheres Histograms Can do them in Microsoft Excel Histograms Frequencies for Age Variable Bin 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 More Frequency 2853 3709 4372 4366 4097 3588 2707 1831 1140 615 397 271 148 83 32 12 5 Histograms of Age Variable Varying Window Size 20 80 15 60 10 40 5 20 0 0 16 33 51 Age 68 85 16 23 30 37 44 51 57 64 71 78 85 Age 40 30 20 10 0 16 24 31 39 47 54 62 70 77 85 Age Formula for Window Width h 3.5 1 3 N standard deviation N=sample size h =window width Example of Suspicious Value 25,000 Frequency 20,000 15,000 10,000 5,000 0 600 900 1200 License Year 1500 1800 Discrete-Numeric Data 40,000 Frequency 30,000 20,000 10,000 0 0 50,000 Paid Losses 100,000 Filtered Data Filter out Unwanted Records 1,200 1,000 Frequency 800 600 400 200 0 0 20,000 40,000 60,000 Paid Losses 80,000 100,000 Box and Whisker Plot 80 Normally Distributed Age 60 Outliers 40 +1.5*midspread 20 75th Percentile 0 median -20 25th Percentile 600 Frequency Box and Whisker Example 1,000 800 400 200 0 96 92 90 88 86 84 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 Age Plot of Heavy Tailed Data Paid Losses 110000 90000 Paid Loss 70000 50000 30000 10000 -10000 Heavy Tailed Data – Log Scale 107 106 Paid Loss 105 104 103 102 101 Descriptive Statistics N License Year 30,250 Valid N 30,250 Minimum 490 Maximum 2,049 Mean 1,990 Std. Deviation 16.3 Mahalanobis Distance 1 MD ( x μ) ' Σ ( x μ) x is a vector of variables μ is a vector of means Σ is a variance-covariance matrix Data Spheres Example: Longitude and Latitude Sample from Highest Percentile Mahalanobis Percentile of License Policy ID Depth Mahalanobis Age Year 22244 6159 22997 5412 30577 28319 27815 16158 4908 28790 59 60 65 61 72 8,490 55 24 25 24 100 100 100 100 100 100 100 100 100 100 27 22 NA 17 43 30 44 82 56 82 1997 2001 NA 2003 1979 490 1976 1938 1997 2039 Number Number of Cars of Drivers 3 2 2 3 3 1 -1 1 4 1 6 6 1 6 1 1 0 1 4 1 Model Year 1994 1993 1954 1994 1952 1987 1959 1989 2003 1985 Incurred Loss 4,456 0 0 0 0 0 0 61,187 35,697 27,769 Categorical Data: Data Cubes Example – Marital Status 1 2 4 D M S Total Frequency Percent 5,053 14.3 2,043 5.8 9,657 27.4 2 0 4 0 2,971 8.4 15,554 44.1 35,284 100 Screening for Missing Data BUSINESS TYPE Age License Year 35,284 35,284 30,242 30,250 0 0 5,042 5,034 25 27.00 1,986.00 Percentiles 50 35.00 1,996.00 75 45.00 2,000.00 N Valid Gender Missing Blanks as Missing Frequency Valid Percent Valid Percent Cumulative Percent 5,054 14.3 14.3 14.3 F 13,032 36.9 36.9 51.3 M 17,198 48.7 48.7 100.0 Total 35,284 100.0 100.0 Types of Missing Values Missing completely at random Missing at random Informative missing Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization Imputation A method to “fill in” missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x1,x2,…xn) Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b1X1+b2X2…bnXn Model for Age Source Corrected Model Intercept ClassCode CoverageType ModelYear No of Vehicles No of drivers Error Total Corrected Total Tests of Between-Subjects Effects Dependent Variable: Age Type III Sum of Squares df Mean Square F Sig. 3,218,216 24 134,092 1,971.2 0.000 9,255 1 9,255 136.0 0.000 3,198,903 18 177,717 2,612.4 0.000 876 3 292 4.3 0.005 7,245 1 7,245 106.5 0.000 2,365 1 2,365 34.8 0.000 3,261 1 3,261 47.9 0.000 2,055,243 30,212 68 46,377,824 30,237 5,273,459 30,236 Censorship Problem Property and casualty insurance data is typically censored We do not know final settlement value for data Adjustments must be made to avoid erroneous models Use ultimates Mix adjust Example From Ignoring Censorship Table 21 Age (Years) Closed Claim Severity Percent of Claims 1 500 25% 2 1,000 50% 3 5,000 15% 4 10,000 10% Distribution of claims and average settlement amounts by age. Table 22 New Company Accident Year Age Severity Percent 2003 1 500 100% Average Severity 500 Industry Accident Year Age Severity Percent 2003 1 500 25% 2002 2 1,000 50% 2001 3 5,000 15% 2000 4 10,000 10% Average Severity 2,375 Metadata Data about data Detailed description of the variables in the file, their meaning and permissible values Marital Status Value 1 2 4 D M S Blank Description Married, data from source 1 Single, data from source 1 Divorced, data from source 1 Divorced, data from source 2 Married, data from source 2 Single, data from source 2 Marital status is missing Conclusions Data quality is significant problem in insurance and in other industries Statistical methods can be used to detect and remediate data quality problems How do we get better data? Conclusions “In the end, the best defense is relentless monitoring of data and metadata” Dasu and Johnson, Exploratory Data Mining and Data Cleaning