Dancing With Dirty Data: Methods for Exploring and Cleaning Data

advertisement
Dancing With Dirty Data:
Methods for Exploring and
Cleaning Data
2005 CAS Ratemaking Seminar
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.
Louise_francis@msn.com
www.data-mines.com
Objectives


Discuss topic of data quality
Present methods for screening data for
problems



Errors
Missing data
Present methods of fixing problems
Data Quality: A Problem

Actuary doing reconciliation on February 15
It’s Not Just Us

“In just about any organization, the state of
information quality is at the same low level”

Olson, Data Quality
AAA Standards of Practice


AAA SOP: Review data for
completeness, accuracy and relevance
IDMA and CAS White paper. Evaluate
data for




Validity
accuracy
reasonableness
completeness
Example Data


Private passenger auto
Some variables are:








Age
Gender
Marital status
Zip code
Earned premium
Number of claims
Incurred losses
Paid losses
Screening Data

Many of the Methods have been in use
for a while


Pioneered in field of exploratory data
analysis
More recently – missing data methods for
dealing with some quality problems
Some Methods for Numeric Data

Visual



Histograms
Box and Whisker Plots
Statistical


Descriptive statistics
Data spheres
Histograms

Can do them in
Microsoft Excel
Histograms
Frequencies for Age Variable
Bin
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
More
Frequency
2853
3709
4372
4366
4097
3588
2707
1831
1140
615
397
271
148
83
32
12
5
Histograms of Age Variable
Varying Window Size
20
80
15
60
10
40
5
20
0
0
16
33
51
Age
68
85
16 23 30 37 44 51 57 64 71 78 85
Age
40
30
20
10
0
16 24 31 39 47 54 62 70 77 85
Age
Formula for Window Width
h
3.5
1
3
N
  standard deviation
N=sample size
h =window width
Example of Suspicious Value
25,000
Frequency
20,000
15,000
10,000
5,000
0
600
900
1200
License Year
1500
1800
Discrete-Numeric Data
40,000
Frequency
30,000
20,000
10,000
0
0
50,000
Paid Losses
100,000
Filtered Data
Filter out Unwanted Records
1,200
1,000
Frequency
800
600
400
200
0
0
20,000
40,000
60,000
Paid Losses
80,000
100,000
Box and Whisker Plot
80
Normally Distributed Age
60
Outliers
40
+1.5*midspread
20
75th Percentile
0
median
-20
25th Percentile
600
Frequency
Box and Whisker Example
1,000
800
400
200
0
96
92
90
88
86
84
82
80
78
76
74
72
70
68
66
64
62
60
58
56
54
52
50
48
46
44
42
40
38
36
34
32
30
28
26
24
22
20
18
16
Age
Plot of Heavy Tailed Data
Paid Losses
110000
90000
Paid Loss
70000
50000
30000
10000
-10000
Heavy Tailed Data – Log Scale
107
106
Paid Loss
105
104
103
102
101
Descriptive Statistics
N
License Year
30,250
Valid N
30,250
Minimum
490
Maximum
2,049
Mean
1,990
Std.
Deviation
16.3
Mahalanobis Distance
1
MD  ( x  μ) ' Σ ( x  μ)
x is a vector of variables
μ is a vector of means
Σ is a variance-covariance matrix
Data Spheres

Example: Longitude and Latitude
Sample from Highest Percentile
Mahalanobis Percentile of
License
Policy ID
Depth
Mahalanobis Age
Year
22244
6159
22997
5412
30577
28319
27815
16158
4908
28790
59
60
65
61
72
8,490
55
24
25
24
100
100
100
100
100
100
100
100
100
100
27
22
NA
17
43
30
44
82
56
82
1997
2001
NA
2003
1979
490
1976
1938
1997
2039
Number Number
of Cars of Drivers
3
2
2
3
3
1
-1
1
4
1
6
6
1
6
1
1
0
1
4
1
Model
Year
1994
1993
1954
1994
1952
1987
1959
1989
2003
1985
Incurred
Loss
4,456
0
0
0
0
0
0
61,187
35,697
27,769
Categorical Data: Data Cubes
Example – Marital Status
1
2
4
D
M
S
Total
Frequency Percent
5,053
14.3
2,043
5.8
9,657
27.4
2
0
4
0
2,971
8.4
15,554
44.1
35,284
100
Screening for Missing Data
BUSINESS
TYPE
Age
License Year
35,284
35,284
30,242
30,250
0
0
5,042
5,034
25
27.00
1,986.00
Percentiles 50
35.00
1,996.00
75
45.00
2,000.00
N
Valid
Gender
Missing
Blanks as Missing
Frequency
Valid
Percent
Valid Percent
Cumulative
Percent
5,054
14.3
14.3
14.3
F
13,032
36.9
36.9
51.3
M
17,198
48.7
48.7
100.0
Total
35,284
100.0
100.0
Types of Missing Values



Missing completely at random
Missing at random
Informative missing
Methods for Missing Values




Drop record if any variable used in
model is missing
Drop variable
Data Imputation
Other


CART, MARS use surrogate variables
Expectation Maximization
Imputation



A method to “fill in” missing value
Use other variables (which have values)
to predict value on missing variable
Involves building a model for variable
with missing value

Y = f(x1,x2,…xn)
Example: Age Variable


About 14% of records missing values
Imputation will be illustrated with
simple regression model

Age = a+b1X1+b2X2…bnXn
Model for Age
Source Corrected Model
Intercept
ClassCode
CoverageType
ModelYear
No of Vehicles
No of drivers
Error
Total
Corrected Total
Tests of Between-Subjects Effects
Dependent Variable: Age
Type III Sum of Squares
df
Mean Square
F
Sig.
3,218,216
24
134,092 1,971.2 0.000
9,255
1
9,255
136.0 0.000
3,198,903
18
177,717 2,612.4 0.000
876
3
292
4.3 0.005
7,245
1
7,245
106.5 0.000
2,365
1
2,365
34.8 0.000
3,261
1
3,261
47.9 0.000
2,055,243
30,212
68
46,377,824
30,237
5,273,459
30,236
Censorship Problem



Property and casualty insurance data is
typically censored
We do not know final settlement value
for data
Adjustments must be made to avoid
erroneous models


Use ultimates
Mix adjust
Example From Ignoring Censorship
Table 21
Age (Years) Closed Claim Severity Percent of Claims
1
500
25%
2
1,000
50%
3
5,000
15%
4
10,000
10%
Distribution of claims and average settlement amounts by age.
Table 22
New Company
Accident Year Age Severity Percent
2003 1
500
100%
Average Severity
500
Industry
Accident Year Age Severity Percent
2003 1
500
25%
2002 2
1,000
50%
2001 3
5,000
15%
2000 4 10,000
10%
Average Severity
2,375
Metadata


Data about data
Detailed description of the variables in
the file, their meaning and permissible
values
Marital Status Value
1
2
4
D
M
S
Blank
Description
Married, data from source 1
Single, data from source 1
Divorced, data from source 1
Divorced, data from source 2
Married, data from source 2
Single, data from source 2
Marital status is missing
Conclusions



Data quality is significant problem in
insurance and in other industries
Statistical methods can be used to
detect and remediate data quality
problems
How do we get better data?
Conclusions

“In the end, the best defense is
relentless monitoring of data and
metadata”

Dasu and Johnson, Exploratory Data
Mining and Data Cleaning
Download