R In Actuarial Pricing Teams Chibisi Chima-Okereke Mango Solutions E-mail: cchima-okereke@mango-solutions.com Agenda Current software in actuarial analysis What is R? R as a functional language Basic Examples Actuarial pricing GLM Example Challenges and opportunities Actuarial Survey Geographical Area UK Actuaries & CAS (Casualty Actuarial Society) Source Palisade ( @Risk ): http://www.palisade.com/downloads/pdf/Pryor.pdf Main Areas Of Work UK Actuaries & CAS (Casualty Actuarial Society) Source Palisade 2006 ( @Risk ): http://www.palisade.com/downloads/pdf/Pryor.pdf Main area of work in which software is used UK Actuaries & CAS (Casualty Actuarial Society) Source Palisade ( @Risk ): http://www.palisade.com/downloads/pdf/Pryor.pdf Percentage of respondents using each package UK Actuaries & CAS (Casualty Actuarial Society) Source Palisade ( @Risk ): http://www.palisade.com/downloads/pdf/Pryor.pdf Use of Statistical Packages Percentage of statistical package users using individual packages UK Actuaries & CAS (Casualty Actuarial Society) Source Palisade ( @Risk ): http://www.palisade.com/downloads/pdf/Pryor.pdf R is the programming language of statistics Why should it not be the programming language of Actuaries? Inadequate current incumbents • VBA: huge versioning issues and inadequate data manipulation and statistical function capabilities • Excel: Inappropriate for analysis • Proprietary Actuarial Software: No Granular Access To Processing Outputs R offers so much in terms of data manipulation, statistical models Spreadsheets are unstructured computer programs: The Risks Of Using Spreadsheets for Statistical Analysis (IBM White Paper): http://public.dhe.ibm.com/common/ssi/ecm/en/imw14297usen/IMW14297USEN.PDF Excel Very labour intensive Excel spreadsheets are unstructured computer programs Problems with checking calculations and types of errors which can be silent and unknown Do your spreadsheets start to grind to a halt with rather moderate sets of data? Versioning excel files could be over 50MB each relative to script versions few KB. Imagine this across your network and the waste of space this encourages Linking spreadsheets stability issues etc VBA versioning problems, inadequate for data analysis and most useful purposes – harsh but true? What is R? People have described R as: • A big calculator? • A programming language? • A rapid prototyping tool? • A free SAS? • Statistical Analysis Tool? Useful R Features Open source object oriented and functional programming language based on S+ designed for manipulating data/objects and carrying out statistical analysis Easy connections to external programs databases, e.g. RODBC - very stable, dynamic SQL queries etc Massive library of tools >>3400 packages GUIs can be created in a straightforward way, gWidgets (GTK+, RGTK) package Easy output formats, all picture files, data formats, even Excel! Current Actuarial R Packages actuar (loss distributions) ChainLadder lifecontingencies LifeTables http://cran.r-project.org/web/packages/ Functional Programming Reference: http://nsaunders.wordpress.com/2010/08/20/a-brief-introductionto-apply-in-r/ apply(data, index, function) lapply(list, function) aggregate(data, by, FUN) mapply(function(arg1, arg2), vector(arg1), vector(arg2), ...) by(data, indices, function) More “advanced/powerful” {plyr} package extends the apply functionality (Hadley Wickham) {plyr} Author: Hadley Wickham http://www.jstatsoft.org/v40/i01/paper I/O Array Data Frame List Discarded Array aaply adply alply a_ply Data Frame daply ddply dlply d_ply List laply ldply llply l_ply a*ply(.data, .margins, .fun, ...) d*ply(.data, .variables, .fun, ...) l*ply(.data, .fun, ...) Example Data Data Source (Simulated): Modern Actuarial Risk Theory Using R: Kaas, Goovaerts, Dhaene, and Denuit. Dynamic SQL Query Example require(RODBC) doMyAnalysis <- function(myYear = 2001){ sqlString <- paste("SELECT * FROM policyClaims WHERE Year='",myYear,"'", sep = "") myData <- sqlQuery(channel = odbcConnect(dsn = "InsuranceData"), query = sqlString) odbcCloseAll() myGlm <- glm(noclaims ~ age + bonusmalus + region + mileage, data = myData, offset = log(exposure), family = poisson(link = "log")) myCoeffs <- summary(myGlm)$coeff theNames <- colnames(myCoeffs) myCoeffs <- data.frame(myCoeffs) myCoeffs <- data.frame(rownames(myCoeffs), myYear, myCoeffs) colnames(myCoeffs) <- c("Coeff", "Year", theNames) print(myYear) return(myCoeffs[1,]) } analysisOutPut <- lapply(2001:2010, doMyAnalysis) analysisOutPut <- do.call(rbind, analysisOutPut) rownames(analysisOutPut) <- 1:nrow(analysisOutPut) Dynamics SQL Query Analysis Combination Example Coeff Year Estimate Std. Error z value Pr(>|z|) Intercept 2001 -0.76 0.03 -24.68 0.00 Intercept 2002 -0.77 0.03 -24.92 0.00 Intercept 2003 -0.80 0.03 -25.65 0.00 Intercept 2004 -0.78 0.03 -25.17 0.00 Intercept 2005 -0.80 0.03 -25.91 0.00 Intercept 2006 -0.76 0.03 -24.92 0.00 Intercept 2007 -0.70 0.03 -23.03 0.00 Intercept 2008 -0.76 0.03 -24.67 0.00 Intercept 2009 -0.79 0.03 -25.30 0.00 Intercept 2010 -0.75 0.03 -24.46 0.00 Plotting Analysis myFun <- function(x){ hist(x$GrossIncurred, col = "blue", xlab = "GIC", main = paste("Histogram of GIC for bonus malus \n group ", x$BonusMalus[1], " and year ", x$Year[1], sep = "")) } pdf(file = paste(myFolder, "myPlots.pdf", sep = ""), width = 7, height = 7) by(policyTable, list("Year" = policyTable$Year, "BonusMalus" = policyTable$BonusMalus), FUN = myFun) dev.off() C:\Users\cchima-okereke\Documents\R\RScripts\ActuarialPricing\tmp\myPlots.pdf Plotting Analysis GUI In R (claimsExploreR) GUI In R (claimsExploreR) Histogram of claim counts with BonusMalus and Age Age : >65 Year : 2001 Age : >65 Year : 2002 Age : >65 Year : 2003 Age : >65 Year : 2004 Age : >65 Year : 2005 Age : >65 Year : 2006 Age : >65 Year : 2007 Age : >65 Year : 2008 Age : >65 Year : 2009 Age : >65 Year : 2010 Age : 18-23 Year : 2001 Age : 18-23 Year : 2002 Age : 18-23 Year : 2003 Age : 18-23 Year : 2004 Age : 18-23 Year : 2005 Age : 18-23 Year : 2006 Age : 18-23 Year : 2007 Age : 18-23 Year : 2008 Age : 18-23 Year : 2009 Age : 18-23 Year : 2010 Age : 24-64 Year : 2001 Age : 24-64 Year : 2002 Age : 24-64 Year : 2003 Age : 24-64 Year : 2004 Age : 24-64 Year : 2005 Age : 24-64 Year : 2006 Age : 24-64 Year : 2007 Age : 24-64 Year : 2008 Age : 24-64 Year : 2009 Age : 24-64 Year : 2010 40 30 20 10 0 Frequency 40 30 20 10 0 40 30 20 10 0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 Exposure weighted claims count BonusMalus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 GUI In R (claimsExploreR) Boxplots of exposure weighted severity with BonusMalus and Age 10 Exposure Weighted Severity (Log Scale) 10 10 10 10 10 10 10 Age : >65 Year : 2003 Age : >65 Year : 2004 Age : >65 Year : 2005 Age : >65 Year : 2006 Age : >65 Year : 2007 Age : >65 Year : 2008 Age : >65 Year : 2009 Age : >65 Year : 2010 Age : 18-23 Year : 2001 Age : 18-23 Year : 2002 Age : 18-23 Year : 2003 Age : 18-23 Year : 2004 Age : 18-23 Year : 2005 Age : 18-23 Year : 2006 Age : 18-23 Year : 2007 Age : 18-23 Year : 2008 Age : 18-23 Year : 2009 Age : 18-23 Year : 2010 Age : 24-64 Year : 2001 Age : 24-64 Year : 2002 Age : 24-64 Year : 2003 Age : 24-64 Year : 2004 Age : 24-64 Year : 2005 Age : 24-64 Year : 2006 Age : 24-64 Year : 2007 Age : 24-64 Year : 2008 Age : 24-64 Year : 2009 Age : 24-64 Year : 2010 3 2 1 4 3 2 1 4 10 3 10 Age : >65 Year : 2002 4 10 10 Age : >65 Year : 2001 2 1 BonusMalus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GLM Models in Pricing Poisson – Frequency Gamma – Severity Negative Binomial for frequency {MASS} Tweedie combines frequency and severity {statmod} Variable Selection Criteria What metrics shall we use to include/exclude variables? • Information Criteria • AIC • BIC (Multiple flavours) • Significance of variable: ChiSquared/F-Test • Consistency measures • Other Measures Automation Algoritms What mechanics will we use to select/exclude variables? •Forward Algorithm •Backward Algorithm •Some other bespoke method Actuarial Pricing in R Any statistical or data analysis process can be implemented in R but we will think specifically about GLMs Example: But actuarial pricing is also the whole decision making process around the GLM ... • glm(Claims ~ Location + CarType + Age + ..., data = myData, family = poisson(link = “log”), offset = log(Exposure)) Automated pricing Process Structure in R Claim Counts analysis •Load data from database •Carry out pre-specified step algorithm with variable aggregation •Variable selection criteria •Check variable consistency •Decide to reject/accept variable Severity analysis Obtain Final Models Continuously writing desired outputs, PDF, log files, documentation, model plots, coefficients etc Automated Actuarial Pricing We need to defined the consolidation structure for categorical variables e.g. Location 1 Location 2 Location 3 Location 4 North North North North N.East North North North N.West N.West N.West North S.West S.West S.West South S.East S.East South South South South South South Outputting Results R has perhaps the most extensive choices for outputs of analysis Link to Excel Text files, e.g. CSV etc Charting Output: picture files: jpeg, tiff, png, pdf, etc.. Report generation: PDF(Sweave - Latex), Word PowerPoint direct output Printing log reports of process Example Process Example Process Example Process Effects package Effects plot of Age and Bonus Malus Age : >65 Age : 24-64 Age : 18-23 150 Relativity (%) 140 130 120 110 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Bonus Malus effects package from John Fox: http://www.jstatsoft.org/v08/i15/paper 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example Process Example Process: Final Model Final Charts Final Model Potential Scheme for analytical process Data residing in some database Connect to R, RODBC, RPostgreSQL, RODM etc. Carry out analysis in R Write results to PDF, any picture format, push to Latex, Excel, CSV, etc Advantages of R for GLM Analysis R offers a complete statistical, data processing, and analysis environment • Standard actuarial GLM techniques are available, e.g. splines, interaction terms etc. • The best plotting functions of any statistical package • More advanced techniques are available, GAM, GMM, GNM, GHMM, MCMC methods – too many packages to list here! • Bespoke methods and new actuarial techniques can be readily implemented in R while they are unavailable in standard actuarial software • Easy to integrate and fully customisable in any analytical environment • Complete array of statistical/analysis tools, clustering, neural nets, GRM, tree models, bootstrapping, Bayesian techniques, ODE/PDE, HMMs, contingency tables, survival analysis, copulas, extreme value analysis, geospatial analysis and visualisation Challenges & Opportunities If you are new to R, do something small to begin with test R out IT support for R There is great need for training and generation of material to enable actuarial analysts to use R For mere mortals (like me) the learning curve is tough and the documentation appears ambiguous R & Hadoop and R & Oracle See me later for live R demos