Taming Text: An Introduction to Text Mining CAS 2006 Ratemaking Seminar Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. April 1, 2006 Louise_francis@msn.com www.data-mines.com Objectives • Present a new data mining technology • Show how the technology uses a combination of • • String processing functions Common multivariate procedures available in statistical most statistical software • Present a simple example of text mining • Discuss practical issues for implementing the methods Actuarial Rocket Science Sophisticated predictive modeling methods are gaining acceptance for pricing, fraud detection and other applications The methods are typically applied to large, complex databases One of the newest of these is text mining Major Kinds of Modeling Supervised learning Most common situation A dependent variable Frequency Loss ratio Fraud/no fraud Some methods Regression CART Some neural networks Unsupervised learning No dependent variable Group like records together A group of claims with similar characteristics might be more likely to be fraudulent Ex: Territory assignment, Text Mining Some methods Association rules K-means clustering Kohonen neural networks Text Mining: Uses Growing in Many Areas ECHELON Program Lots of Information, but no Data Example: Claim Description Field INJURY DESCRIPTION BROKEN ANKLE AND SPRAINED WRIST FOOT CONTUSION UNKNOWN MOUTH AND KNEE HEAD, ARM LACERATIONS FOOT PUNCTURE LOWER BACK AND LEGS BACK STRAIN KNEE Objective Create a new variable from free form text Use words in injury description to create an injury code New injury code can be used in a predictive model or in other analysis A Two - Step Process Use string manipulation functions to parse the text Search for blanks, commas, periods and other word separators Use the separators to extract words Eliminate stopwords Use multivariate techniques to cluster like terms together into the same injury code Cluster analysis Factor and Principal Components analysis Parsing a Claim Description Field With Microsoft Excel String Functions Total Length (2) Location of Next Blank (3) First Word (4) Remainder Length 1 (5) 31 7 BROKEN 24 Remainder 1 (6) 2nd Blank (7) 2nd Word (8) Remainder Length 2 (9) ANKLE AND SPRAINED WRIST 6 ANKLE 18 Remainder 2 (10) AND SPRAINED WRIST 3rd Blank (11) 4 3rd Word (12) AND Remainder Length 3 (13) 14 Full Description (1) BROKEN ANKLE AND SPRAINED WRIST Remainder 3 (14) SPRAINED WRIST Remainder 4 (18) WRIST 4th Blank (15) 9 5th Blank (19) 0 th 4 Word (16) SPRAINED th 5 Word (20) WRIST Remainder Length 4 (17) 5 Extraction Creates Binary Indicator Variables INJURY DESCRIPTION BROKEN ANKLE AND SPRAINED WRIST FOOT CONTUSION UNKNOWN NECK AND BACK STRAIN BROKEN ANKLE AND SPRAINED W R I S T F O O T CONTU -SION UNKNOWN N E C K BACK STRAIN 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 Eliminate Stopwords Common words with no meaningful content Stopwords A And Able About Above Across Aforementioned After Again Stemming: Identify Synonyms and Words with Common Stem Parsed Words HEAD INJURY LACERATION NONE KNEE BRUISED UNKNOWN TWISTED L LOWER LEG BROKEN ARM FRACTURE R FINGER FOOT INJURIES HAND LIP ANKLE RIGHT HIP KNEES SHOULDER FACE LEFT FX CUT SIDE WRIST PAIN NECK INJURED Dimension Reduction The Two Major Categories of Dimension Reduction Variable reduction Factor Analysis Principal Components Analysis Record reduction Clustering Other methods tend to be developments on these Correlated Dimensions Ultimate ALAE (000s) 100 80 60 40 20 0 -100 0 0 100 200 300 100 200 400 300 400 500 600 600 500 Ultimate L oss (0 00s) & Loss e t a Ultim E (000s) ALA Clustering Common Method: k-means and hierarchical clustering No dependent variable – records are grouped into classes with similar values on the variable Start with a measure of similarity or dissimilarity Maximize dissimilarity between members of different clusters Dissimilarity (Distance) Measure – Continuous Variables Euclidian Distance dij Manhattan Distance dij 1/ 2 m 2 ( xik x jk ) i, j = records k=variable k 1 m xik x jk k 1 Binary Variables Record 2 Record 1 1 0 1 a c 0 b d Binary Variables Sample Matching bc d abcd Rogers and Tanimoto 2(b c) d (a d ) 2(b c) K-Means Clustering Determine ahead of time how many clusters or groups you want Use dissimilarity measure to assign all records to one of the clusters Cluster Number 1 2 back 0.00 1.00 contusion 0.15 0.04 head 0.12 0.11 knee 0.13 0.05 strain 0.05 0.40 unknown 0.13 0.00 laceration 0.17 0.00 Hierarchical Clustering A stepwise procedure At beginning, each records is its own cluster Combine the most similar records into a single cluster Repeat process until there is only one cluster with every record in it Hierarchical Clustering Example Dendogram for 10 Terms Rescaled Distance Cluster Combine C A S E Label Num arm foot leg 0 5 10 15 20 25 +---------+---------+---------+---------+---------+ 9 10 8 laceration contusion 7 2 head knee 3 4 unknown back 6 1 strain 5 How Many Clusters? Use statistics on strength of relationship to variables of interest A Statistical Test for Number of Clusters Swartz Bayesian Information Criterion (2.8) X ~ N (μ, Σ) where X is a vector of random variables μ is the centroid (mean) of the data and Σ is the variance-covariance matrix 1 (2.9) BIC log L( X , M ) p*log(N) 2 where log(L(X,M)) is the logliklihood function for a model, p is the number of parameters, N the number of records, is a penalty parameter, often equal to 1 Final Cluster Selection Cluster 1 2 3 4 5 6 7 Weighted Average Back 0.000 0.022 0.000 1.000 0.000 0.681 0.034 Contusion 0.000 1.000 0.000 0.000 0.000 0.021 0.000 head 0.000 0.261 0.162 0.000 0.065 0.447 0.034 knee 0.095 0.239 0.054 0.043 0.258 0.043 0.103 strain 0.000 0.000 0.000 1.000 0.065 0.000 0.483 unknown 0.277 0.000 0.000 0.000 0.000 0.000 0.000 laceration 0.000 0.022 1.000 0.000 0.000 0.000 0.000 Leg 0.000 0.087 0.135 0.000 0.032 0.000 0.655 0.163 0.134 0.120 0.114 0.114 0.108 0.109 0.083 Use New Injury Code in a Logistic Regression to Predict Serious Claims Y B0 B1 Attorney B2 Injury _ Group Y = Claim Severity > $10,000 Mean Probability of Serious Claim vs. Actual Value Actual Value Avg Prob 1 0 0.31 0.01 Software for Text Mining-Commercial Software Most major software companies, as well as some specialists sell text mining software These products tend to be for large complicated applications, such as classifying academic papers They also tend to be expensive One inexpensive product reviewed by American Statistician had disappointing performance Software for Text Mining – Free Software A free product, TMSK, was used for much of the paper’s analysis Parts of the analysis were done in widely available software packages, SPSS and S-Plus (R ) Many of the text manipulation functions can be performed in Perl (www.perl.com) and Python (www.python.org) Software used for Text Mining Text Mining Parse Terms Feature Creation Perl, TMSK, S-PLUS, SPSS Prediction SPSS, SPLUS, SAS Perl Free open source programming language www.perl.org Used a lot for text processing Perl for Dummies gives a good introduction Perl Functions for Parsing $TheFile ="GLClaims.txt"; $Linelength=length($TheFile); open(INFILE, $TheFile) or die "File not found"; # Initialize variables $Linecount=0; @alllines=(); while(<INFILE>){ $Theline=$_; chomp($Theline); $Linecount = $Linecount+1; $Linelength=length($Theline); @Newitems = split(/ /,$Theline); print "@Newitems \n"; push(@alllines, [@Newitems]); } # end while References Hoffman, P, Perl for Dummies, Wiley, 2003 Weiss, Shalom, Indurkhya, Nitin, Zhang, Tong and Damerau, Fred, Text Mining, Springer, 2005 Questions?