Variable Reduction for Predictive Modeling with Clustering Bob Sanche March 14, 2006

advertisement
Variable Reduction for Predictive
Modeling with Clustering
Casualty Actuarial Society
Seminar on Ratemaking
Bob Sanche
March 14, 2006
© 2006 Towers Perrin
Contents
 Data Storage and Amount of Predictive Variables
 Predictive Modeling and Model Generalization
 Dimension Reduction
 Goal of Variable Clustering
 What Is Clustering?
 Variable Clustering
 When Does Variable Clustering Occur During the
Predictive Modeling Process?
 Example
© 2006 Towers Perrin
2
Data Storage and Predictive Variables
 Data storage economics
“In 1956, IBM sold its first magnetic disk system,
RAMAC (Random Access Method of Accounting and
Control). It used 50 24-inch metal disks, with 100
tracks per side. It could store 5 megabytes of data and
cost $10,000 per megabyte. (As of 2005, disk storage
costs less than $1 per gigabyte).”
http://en.wikipedia.org/wiki/History_of_computing_hardware
 1 gigabyte = 130 numeric characteristics for 1
million policies for $1
© 2006 Towers Perrin
3
Data Storage and Predictive Variables
 New data sources
 Data warehousing
 External sources (demographics, meteorological,
etc.)
 Policyholder, household or company information
 Agency
 Other
 Data storage economics and availability of data
 Increase the number of predictive variables
 Data mining paradigm
 Additional inputs add lift to the model
© 2006 Towers Perrin
4
Predictive Modeling and Model Generalization
 A predictive model is created from a number of
predictors that are likely to influence future results
 Y = α1X1 + … + αnXn + β
 n is universe of all available predictors
 Goal of predictive modeling
 Obtain coefficients for α’s and β
 Predictive of future results
— Model generalizes well over time
 Model complexity → Overfitting
© 2006 Towers Perrin
5
Dimension Reduction
 Need to reduce model complexity
 Dimension reduction
 Clustering (K-Means)
— Rows
 Variable clustering
— Columns
— Alternatives to variable clustering
– PCA and factor analysis
– Difficult to interpret and deploy
© 2006 Towers Perrin
6
Goal of Variable Clustering
 Reduce the number of variables
 More difficult to identify irrelevant variables than redundant
variables
 Y = α1X1 + … + αmXm + β
 where m<n
 Why do we want to reduce the number of variables?
 Improve efficiency of predictive modeling process
— Time to develop the model
— Interpretation of the results
— Reduce variance of the model estimates
 Demographics example
 Average household size, median household size,
proportion of families, median vehicles per household
— could be replaced by only one variable
© 2006 Towers Perrin
7
What Is Clustering?
 “Cluster Analysis is a set of methods for constructing
a sensible and informative classification of an initially
unclassified set of data, using the variable values
observed on each individual” B.S. Everitt , The
Cambridge Dictionary of Statistics, 1998
 Divide set of data (variable) into groups of similar
characteristics
 Unsupervised learning technique
 Useful only when there is redundancy in the data
© 2006 Towers Perrin
8
What Is Clustering?
 Similarity measured by distance or correlation metrics
 Types of clustering
 Hierarchical clustering
— Agglomerative
— Divisive
 Partitive (optimization) clustering
© 2006 Towers Perrin
9
Variable Clustering
 Variable clustering divides a set of numeric* variables into
clusters. A cluster representing a large set of variables can
be replaced by a single member (cluster representative).
* Hamming distance for categorical variables
 Selection of the cluster representative
1-R
2
ratio
 ( 1-R
2
own
2
nearest
)/( 1-R
)
 Intuitively, we want the cluster representative to be as
closely correlated to its own cluster (R2own1) and as
uncorrelated to the nearest cluster (R2nearest0).
Therefore, the optimal representative of a cluster is a
variable where 1-R2 ratio tends to zero
© 2006 Towers Perrin
10
When Does Variable Clustering Occur During
the Predictive Modeling Process?
 SEMMA process for data mining
 Sample
 Explore
 Modify
 Model
 Assess
© 2006 Towers Perrin
11
Example
3 CLUSTERS
Cluster
Cluster 1
Cluster 2
Cluster 3
© 2006 Towers Perrin
R-SQUARED WITH
1-R2 Ratio
Own Cluster
Next
Closest
Rain Days
0.5995
0.0426
0.4183
Snow Days
0.8976
0.0317
0.1095
Annual Snow
0.8940
0.0314
0.1095
Population
Density
0.9804
0.0228
0.0201
Car Density
0.9804
0.0113
0.0199
Population
Growth
0.6459
0.0911
0.3896
Legal
Expenditures
0.6459
0.0013
0.3546
Variable
12
Example
Name of Variable or Cluster
Population Density
Legal Expenditures
Snow Days
Annual Snow
Accumulation
Rain Days
Population Growth
Car Density
1.00
0.95
0.90
0.85
0.80
0.75
0.70
Proportion of Variance Explained
© 2006 Towers Perrin
13
Conclusion
 Need for dimension reduction for model generalization
 Variable clustering reduces the amount of variables
available for predictive modeling (GLM, etc.)
 The predictive modeling process using variable
clustering
 Avoid overfitting
 Increases interpretability
 Reduces time for modeling
© 2006 Towers Perrin
14
Download