Underwriting Model Auto One

advertisement
Taming Text: An Introduction to
Text Mining
CAS 2006 Ratemaking Seminar
Prepared by
Louise Francis
Francis Analytics and Actuarial Data Mining, Inc.
April 1, 2006
Louise_francis@msn.com
www.data-mines.com
Objectives
• Present a new data mining technology
• Show how the technology uses a
combination of
•
•
String processing functions
Common multivariate procedures available in
statistical most statistical software
• Present a simple example of text mining
• Discuss practical issues for implementing the
methods
Actuarial Rocket Science
 Sophisticated predictive
modeling methods are
gaining acceptance for
pricing, fraud detection and
other applications
 The methods are typically
applied to large, complex
databases
 One of the newest of these
is text mining
Major Kinds of Modeling
 Supervised learning



Most common
situation
A dependent variable
 Frequency
 Loss ratio
 Fraud/no fraud
Some methods
 Regression
 CART
 Some neural
networks

Unsupervised learning
 No dependent variable
 Group like records together



A group of claims with
similar characteristics
might be more likely to
be fraudulent
Ex: Territory
assignment, Text
Mining
Some methods



Association rules
K-means clustering
Kohonen neural
networks
Text Mining: Uses Growing in Many
Areas
ECHELON Program
Lots of Information, but no Data
Example: Claim Description Field
INJURY DESCRIPTION
BROKEN ANKLE AND SPRAINED WRIST
FOOT CONTUSION
UNKNOWN
MOUTH AND KNEE
HEAD, ARM LACERATIONS
FOOT PUNCTURE
LOWER BACK AND LEGS
BACK STRAIN
KNEE
Objective
 Create a new variable from free form text
 Use words in injury description to create an
injury code
 New injury code can be used in a predictive
model or in other analysis
A Two - Step Process
 Use string manipulation functions to parse the
text



Search for blanks, commas, periods and other
word separators
Use the separators to extract words
Eliminate stopwords
 Use multivariate techniques to cluster like
terms together into the same injury code


Cluster analysis
Factor and Principal Components analysis
Parsing a Claim Description Field With
Microsoft Excel String Functions
Total
Length
(2)
Location
of Next
Blank
(3)
First Word
(4)
Remainder
Length 1
(5)
31
7
BROKEN
24
Remainder 1
(6)
2nd
Blank
(7)
2nd Word
(8)
Remainder
Length 2
(9)
ANKLE AND SPRAINED
WRIST
6
ANKLE
18
Remainder 2
(10)
AND SPRAINED WRIST
3rd
Blank
(11)
4
3rd Word
(12)
AND
Remainder
Length 3
(13)
14
Full Description
(1)
BROKEN ANKLE AND
SPRAINED WRIST
Remainder 3
(14)
SPRAINED WRIST
Remainder 4
(18)
WRIST
4th
Blank
(15)
9
5th
Blank
(19)
0
th
4 Word
(16)
SPRAINED
th
5 Word
(20)
WRIST
Remainder
Length 4
(17)
5
Extraction Creates Binary Indicator
Variables
INJURY
DESCRIPTION
BROKEN
ANKLE AND
SPRAINED
WRIST
FOOT
CONTUSION
UNKNOWN
NECK AND
BACK STRAIN
BROKEN
ANKLE
AND
SPRAINED
W
R
I
S
T
F
O
O
T
CONTU
-SION
UNKNOWN
N
E
C
K
BACK
STRAIN
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
1
Eliminate Stopwords
 Common words with no meaningful content
Stopwords
A
And
Able
About
Above
Across
Aforementioned
After
Again
Stemming: Identify Synonyms and
Words with Common Stem
Parsed Words
HEAD
INJURY
LACERATION NONE
KNEE
BRUISED
UNKNOWN
TWISTED
L
LOWER
LEG
BROKEN
ARM
FRACTURE
R
FINGER
FOOT
INJURIES
HAND
LIP
ANKLE
RIGHT
HIP
KNEES
SHOULDER
FACE
LEFT
FX
CUT
SIDE
WRIST
PAIN
NECK
INJURED
Dimension Reduction
The Two Major Categories of
Dimension Reduction
 Variable reduction
Factor Analysis
 Principal Components Analysis

 Record reduction

Clustering
 Other methods tend to be developments on
these
Correlated Dimensions
Ultimate ALAE (000s)
100
80
60
40
20
0
-100
0
0
100
200
300
100 200
400
300 400
500 600 600 500
Ultimate L
oss (0
00s)
&
Loss
e
t
a
Ultim E (000s)
ALA
Clustering
 Common Method: k-means and
hierarchical clustering
 No dependent variable – records are
grouped into classes with similar
values on the variable
 Start with a measure of similarity or
dissimilarity
 Maximize dissimilarity between
members of different clusters
Dissimilarity (Distance) Measure –
Continuous Variables
Euclidian Distance
dij 

Manhattan Distance
dij 

1/ 2
m
2
( xik  x jk )
i, j = records k=variable
k 1

m
xik  x jk
k 1

Binary Variables
Record 2
Record 1
1
0
1
a
c
0
b
d
Binary Variables
 Sample Matching
bc
d
abcd
 Rogers and Tanimoto
2(b  c)
d
(a  d )  2(b  c)
K-Means Clustering
 Determine ahead of time how many clusters or
groups you want
 Use dissimilarity measure to assign all records
to one of the clusters
Cluster
Number
1
2
back
0.00
1.00
contusion
0.15
0.04
head
0.12
0.11
knee
0.13
0.05
strain
0.05
0.40
unknown
0.13
0.00
laceration
0.17
0.00
Hierarchical Clustering
 A stepwise procedure
 At beginning, each records is its own cluster
 Combine the most similar records into a single
cluster
 Repeat process until there is only one cluster
with every record in it
Hierarchical Clustering Example
Dendogram for 10 Terms
Rescaled Distance Cluster Combine
C A S E
Label
Num
arm
foot
leg
0
5
10
15
20
25
+---------+---------+---------+---------+---------+


 

 
9
10
8
laceration
contusion
7
2

 

 
head
knee
3
4



 


unknown
back
6
1



strain
5

How Many Clusters?
 Use statistics on strength of relationship to
variables of interest
A Statistical Test for Number of
Clusters
 Swartz Bayesian
Information Criterion
(2.8)
X ~ N (μ, Σ)
where X is a vector of random variables μ is the centroid (mean) of the data and
Σ is the variance-covariance matrix
1
(2.9)
BIC  log L( X , M )   p*log(N)
2
where log(L(X,M)) is the logliklihood function for a model, p is the number of
parameters, N the number of records,  is a penalty parameter, often equal to 1
Final Cluster Selection
Cluster
1
2
3
4
5
6
7
Weighted
Average
Back
0.000
0.022
0.000
1.000
0.000
0.681
0.034
Contusion
0.000
1.000
0.000
0.000
0.000
0.021
0.000
head
0.000
0.261
0.162
0.000
0.065
0.447
0.034
knee
0.095
0.239
0.054
0.043
0.258
0.043
0.103
strain
0.000
0.000
0.000
1.000
0.065
0.000
0.483
unknown
0.277
0.000
0.000
0.000
0.000
0.000
0.000
laceration
0.000
0.022
1.000
0.000
0.000
0.000
0.000
Leg
0.000
0.087
0.135
0.000
0.032
0.000
0.655
0.163
0.134
0.120
0.114
0.114
0.108
0.109
0.083
Use New Injury Code in a Logistic
Regression to Predict Serious Claims
Y  B0  B1 Attorney  B2 Injury _ Group
Y = Claim Severity > $10,000
Mean Probability of Serious Claim vs. Actual Value
Actual Value
Avg Prob
1
0
0.31
0.01
Software for Text Mining-Commercial
Software
 Most major software companies, as well as some
specialists sell text mining software


These products tend to be for large complicated
applications, such as classifying academic papers
They also tend to be expensive
 One inexpensive product reviewed by American
Statistician had disappointing performance
Software for Text Mining – Free
Software
 A free product, TMSK, was used for much of the
paper’s analysis
 Parts of the analysis were done in widely
available software packages, SPSS and S-Plus
(R )
 Many of the text manipulation functions can be
performed in Perl (www.perl.com) and Python
(www.python.org)
Software used for Text Mining
Text Mining
Parse Terms
Feature Creation
Perl, TMSK, S-PLUS,
SPSS
Prediction
SPSS, SPLUS, SAS
Perl
 Free open source programming language
 www.perl.org
 Used a lot for text processing
 Perl for Dummies gives a good introduction
Perl Functions for Parsing















$TheFile ="GLClaims.txt";
$Linelength=length($TheFile);
open(INFILE, $TheFile) or die "File not found";
# Initialize variables
$Linecount=0;
@alllines=();
while(<INFILE>){
$Theline=$_;
chomp($Theline);
$Linecount = $Linecount+1;
$Linelength=length($Theline);
@Newitems = split(/ /,$Theline);
print "@Newitems \n";
push(@alllines, [@Newitems]);
} # end while
References
 Hoffman, P, Perl for Dummies, Wiley, 2003
 Weiss, Shalom, Indurkhya, Nitin, Zhang,
Tong and Damerau, Fred, Text Mining,
Springer, 2005
Questions?
Download