Introduction to Generalized Linear Models 2007 CAS Predictive Modeling Seminar Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise_francis@msn.com October 11, 2007 Objectives Gentle introduction to Linear Models Illustrate some simple applications of linear models Address some practical modeling issues Show features common to LMs and GLMs Predictive Modeling Family Predictive Modeling Classical Linear Models GLMs Data Mining Many Aspects of Linear Models are Intuitive Severity An Introduction to Linear Regression Year Intro to Regression Cont. Fits line that minimizes squared deviation between actual and fitted values min (Y Y ) 2 i Some Work Related Liability Data Closed Claims from Tx Dept of Insurance Total Award Initial Indemnity reserve Policy Limit Attorney Involvement Lags Closing Report Injury Sprain, back injury, death, etc Data, along with some of analysis will be posted on internet Simple Illustration Total Settlement vs. Initial Indemnity Reserve How Strong Is Linear Relationship?: Correlation Coefficient Varies between -1 and 1 Zero = no linear correlation lnInitialIndemnityRes lnTotalAward lnInitialExpense lnReportlag lnInitialIndemnityRes lnTotalAward lnInitialExpense lnReportlag 1.000 0.303 1.000 0.118 0.227 1.000 -0.112 0.048 0.090 1.000 Scatterplot Matrix Prepared with Excel add-in XLMiner Excel Does Regression Install Data Analysis Tool Pak (Add In) that comes wit Excel Click Tools, Data Analysis, Regression How Good is the fit? First Step: Compute residual Residual = actual – fitted Y=lnTotal Award 10.13 14.08 10.31 Predicted Residual 11.76 -1.63 12.47 1.61 11.65 -1.34 Sum the square of the residuals (SSE) Compute total variance of data with no model (SST) Goodness of Fit Statistics R2: (SSE Regression/SS Total) percentage of variance explained Adjusted R2 R2 adjusted for number of coefficients in model Note SSE = Sum squared errors MS is Mean Square Error 2 R Statistic Significance of Regression F statistic: (Mean square error of Regression/Mean Square Error of Residual) Df of numerator = k = number of predictor vars Df denominator = N - k ANOVA (Analysis of Variance) Table Goodness of Fit Statistics T statistics: Uses SE of coefficient to determine if it is significant SE of coefficient is a function of s (standard error of regression) Uses T-distribution for test It is customary to drop variable if coefficient not significant T-Statistic: Are the Intercept and Coefficient Significant? Intercept lnInitialIndemnity Res Coefficients 10.343 Standard Error 0.112 0.154 0.011 t Stat P-value 92.122 0 13.530 8.21E-40 Other Diagnostics: Residual Plot Independent Variable vs. Residual Points should scatter randomly around zero If not, regression assumptions are violated Predicted vs. Residual Random Residual DATA WITH NORMALLY DISTRIBUTED ERRORS RANDOMLY GENERATED What May Residuals Indicate? If absolute size of residuals increases as predicted increases, may indicate nonconstant variance may indicate need to log dependent variable May need to use weighted regression May indicate a nonlinear relationship Standardized Residual: Find Outliers N zi ( yi yˆi ) se , se = 2 ˆ (yi yi ) i=1 N k 1 Standardized Residuals by Observation 6 4 2 0 -2 0 500 1000 -4 Observation 1500 2000 Outliers May represent error May be legitimate but have undue influence on regression Can downweight oultliers Weight inversely proportional to variance of observation Robust Regression Based on absolute deviations Based on lower weights for more extreme observations Non-Linear Relationship Non-Linear Relationships Suppose Relationship between dependent and independent variable is non-linear? Linear regression requires a linear relationship Transformation of Variables Apply a transformation to either the dependent variable, the independent variable or both Examples: Y’ = log(Y) X’ = log(X) X’ = 1/X Y’=Y1/2 Transformation of Variables: Skewness of Distribution Use Exploratory Data Analysis to detect skewness, and heavy tails After Log Transformation -Data much less skewed, more like Normal, though still skewed Box Plot Histogram 500 20 450 400 18 Y Values 14 12 10 8 6 4 2 0 Frequency 16 350 300 250 200 150 100 50 0 lnTotalAward 9.6 10.4 11.2 12 12.8 13.6 14.4 15.2 lnTotalAward 16 16.8 17.6 Transformation of Variables Suppose the Claim Severity is a function of the log of report lag Compute X’ = log(Report Lag) Regress Severity on X’ Categorical Independent Variables: The Other Linear Model: ANOVA Average of Totalsettlementamountorcourtaward Injury Amputation Backinjury Braindamage Burnschemical Burnsheat Circulatorycondition Table above created with Excel Pivot Tables Total 567,889 168,747 863,485 1,097,402 801,748 302,500 Model Model is Model Y = ai, where i is a category of the independent variable. ai is the mean of category i. Drop Page Fields Here A ver age Sever i t y B y I nj ur y A ver age of T r ended Sever i t y 16, 000. 00 14, 000. 00 12, 000. 00 10, 000. 00 Drop Series Fields Here 8, 000. 00 6, 000. 00 4, 000. 00 2, 000. 00 B RUI SE T ot al 4, 215. 78 B URN 2, 185. 64 CRUSHI CUT / P U NG NCT 2, 608. 14 1, 248. 90 Y=a1 EYE FRA CT U OT HE R SP RA I N ST RA I N 6, 849. 98 3, 960. 45 7, 493. 70 RE 534. 23 14, 197. 4 Y=a 9 I nj ur y Two Categories Model Y = ai, where i is a category of the independent variable In traditional statistics we compare a1 to a2 If Only Two Categories: T-Test for test of Significance of Independent Variable Variable 1 Mean 124,002 Variance 2.35142E+11 Observations 354 Hypothesized Mean Difference 0 df 1591 t Stat -7.17 P(T<=t) one-tail 0.00 t Critical one-tail 1.65 P(T<=t) two-tail 0.00 t Critical two-tail 1.96 Variable 2 440,758 1.86746E+12 1448 Use T-Test from Excel Data Analysis Toolpak More Than Two Categories Use F-Test instead of T-Test With More than 2 categories, we refer to it as an Analysis of Variance (ANOVA) Fitting ANOVA With Two Categories Using A Regression Create A Dummy Variable for Attorney Involvement Variable is 1 If Attorney Involved, and 0 Otherwise Attorneyinvolvement-insurer Y Y Y N Y N Y Y N Attorney TotalSettlement 1 25000 1 1300000 1 30000 0 42500 1 25000 0 30000 1 36963 1 145000 0 875000 More Than 2 Categories If there are K Categories Create k-1 Dummy Variables Dummyi = 1 if claim is in category i, and is 0 otherwise The kth Variable is 0 for all the Dummies Its value is the intercept of the regression Design Matrix Injury Code 1 1 12 11 17 Injury_Backin Injury_Multipl Injury_Nervou jury einjuries scondition 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 Injury_Other Top table Dummy variables were hand coded, Bottom table dummy variables created by XLMiner. 0 0 0 0 1 Regression Output for Categorical Independent A More Complex Model Multiple Regression • Let Y = a + b1*X1 + b2*X2 + …bn*Xn+e • The X’s can be numeric variables or categorical dummies Multiple Regression Y = a + b1* Initial Reserve+ b2* Report Lag + b3*PolLimit + b4*age+ ciAttorneyi+dkInjury k+e SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.49844 0.24844 0.24213 1.10306 1802 ANOVA df Regression Residual Total Intercept lnInitialIndemnityRes lnReportlag Policy Limit Clmt Age Attorney Injury_Backinjury Injury_Braindamage Injury_Burnschemical Injury_Burnsheat Injury_Circulatorycondition 15 1786 1801 SS 718.36 2173.09 2891.45 Coefficients Standard Error 10.052 0.156 0.105 0.011 0.020 0.011 0.000 0.000 -0.002 0.002 0.718 0.068 -0.150 0.075 0.834 0.224 0.587 0.247 0.637 0.175 0.935 0.782 MS 47.89 1.22 t Stat 64.374 9.588 1.887 4.405 -1.037 10.599 -1.995 3.719 2.375 3.645 1.196 F 39.360 P-value 0.000 0.000 0.059 0.000 0.300 0.000 0.046 0.000 0.018 0.000 0.232 More Than One Categorical Variable For each categorical variable Create k-1 Dummy variables K is the total number of variables The category left out becomes the “base” category It’s value is contained in the intercept Model is Y = ai + bj + …+ e or Y = u+ai + bj + …+ e, where ai + bj are offsets to u e is random error term Correlation of Predictor Variables: Multicollinearity Multicollinearity Predictor variables are assumed uncorrelated • Assess with correlation matrix • Remedies for Multicollinearity • Drop one or more of the highly correlated variables • Use Factor analysis or Principle components to produce a new variable which is a weighted average of the correlated variables • Use stepwise regression to select variables to include Similarities with GLMs Linear Models Transformation of Variables Use dummy coding for categorical variables Residual Test significance of coefficients GLMs Link functions Use dummy coding for categorical variables Deviance Test significance of coefficients Introductory Modeling Library Recommendations • Berry, W., Understanding Regression Assumptions, Sage University Press • Iversen, R. and Norpoth, H., Analysis of Variance, Sage University Press • Fox, J., Regression Diagnostics, Sage University Press • Data Mining for Business Intelligence, Concepts, Applications and Techniques in Microsoft Office Excel with XLMiner,Shmueli, Patel and Bruce, Wiley 2007