Introduction to Predictive Modeling with Examples Nationwide Insurance Company, November 2 D. A. Dickey Cool < ------------------------ > Nerdy “Analytics” = “Statistics” “Predictive Modeling” = “Regression” Part 1: Simple Linear Regression If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others http://www.ofesite.com/spirit/palm/lines/linelife.htm Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death proc sgplot; scatter Y=age X=line; reg Y=age X=line; run ; Result: Predicted Age at Death = 79.24 – 1.367(lifeline) (Is this “real”??? Is this repeatable???) We Use LEAST SQUARES Squared residuals sum to 9609 Error sum of squares SSq versus slope and intercept (truncated at SSq=9700) “Best” line is the one that minimizes sum of squared residuals. Best for this sample – is it the true relationship for everyone? SAS PROC REG will compute it. What other lines might be the true line for everyone?? Probably not the purple one. Red one has slope 0 (no effect). Is red line unreasonable? Can we reject H0:slope is 0? Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. Simulate 20 cases with n= 50 bodies each. NOTE: Regression equations : Age(rep:1) = 80.56253 - 1.345896*line. Age(rep:2) = 61.76292 + 0.745289*line. Age(rep:3) = 72.14366 - 0.546996*line. Age(rep:4) = 95.85143 - 3.087247*line. Age(rep:5) = 67.21784 - 0.144763*line. Age(rep:6) = 71.0178 - 0.332015*line. Age(rep:7) = 54.9211 + 1.541255*line. Age(rep:8) = 69.98573 - 0.472335*line. Age(rep:9) = 85.73131 - 1.240894*line. Age(rep:10) = 59.65101 + 0.548992*line. Age(rep:11) = 59.38712 + 0.995162*line. Age(rep:12) = 72.45697 - 0.649575*line. Age(rep:13) = 78.99126 - 0.866334*line. Age(rep:14) = 45.88373 + 2.283475*line. Age(rep:15) = 59.28049 + 0.790884*line. Age(rep:16) = 73.6395 - 0.814287*line. Age(rep:17) = 70.57868 - 0.799404*line. Age(rep:18) = 72.91134 - 0.821219*line. Age(rep:19) = 55.46755 + 1.238873*line. Age(rep:20) = 63.82712 + 0.776548*line. Predicted Age at Death = 79.24 – 1.367(lifeline) Would NOT be unusual if there is no true relationship . Distribution of t Under H0 Conclusion: Estimated slopes vary Standard deviation of estimated slopes = “Standard error” (estimated) Compute t = (estimate – hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05 hypothesized value is wrong. p>0.05 is inconclusive. proc reg data=life; model age=line; run; Parameter Estimates Variable DF Intercept 1 Line 1 Parameter Estimate 79.23341 -1.36697 Standard Error 14.83229 1.59782 t Value Pr > |t| 5.34 <.0001 0.86 0.3965 Area 0.19825 Area 0.19825 0.39650 -0.86 0.86 Conclusion: insufficient evidence against the hypothesis of no linear relationship. H0: H1: H0: Innocence H1: Guilt Beyond reasonable doubt P<0.05 H0: True slope is 0 (no association) H1: True slope is not 0 P=0.3965 Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. WHY? Simulate 20 cases with n= 50 bodies each. Want estimate of variability around the true line. True variance is 2 Use sums of squared residuals (SS). Sum of squared residuals from the mean is “SS(total)” 9755 Sum of squared residuals around the line is “SS(error)” 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R2, i.e. proportion of variablity “explained” by the model. Analysis of Variance Source Model Error Corrected Total Root MSE 14.14854 DF 1 48 49 Sum of Squares 146.51753 9608.70247 9755.22000 R-Square 0.0150 Mean Square 146.51753 200.18130 F Value 0.73 Pr > F 0.3965 Part 2: Multiple Regression Issues: (1) Testing joint importance versus individual significance Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually Jointly critical (can’t omit both!!) (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical company’s sales Y depend on TV advertising X1 and Radio Advertising X2. Y = b0 + b1X1 + b2X2 +e Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; Sales 1 869 868 9089 2 836 820 8290 (more data) 40 969 961 10130 Radio TV proc g3d data=sales; scatter radio*TV=sales/shape=sval color=cval zmin=8000; run; P2 axis P2 axis P2 axis Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio; Analysis of Variance Source Model Error Corrected Total Root MSE Sum of Squares 32660996 1683844 34344840 DF 2 37 39 213.32908 Mean Square 16330498 45509 R-Square F Value 358.84 Pr > F <.0001 (Can’t omit both) 0.9510 Explaining 95% of variation in sales Parameter Estimates Variable Intercept TV radio DF 1 1 1 Parameter Estimate 531.11390 5.00435 4.66752 Standard Error 359.90429 5.01845 4.94312 t Value 1.48 1.00 0.94 Pr > |t| 0.1485 0.3251 (can omit TV) 0.3512 (can omit radio) Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213). TV approximately equal to radio so, approximately Estimated Sales = 531 + 9.7 TV or Estimated Sales = 531 + 9.7 radio Setting TV = radio (approximate relationship) Estimated Sales = 531 + 9.7 TV is this the BEST TV line? Estimated Sales = 531 + 9.7 radio is this the BEST radio line? Proc Reg Data=Stores; Model Sales = TV; Model Sales = radio; run; Analysis of Variance Source DF Sum of Squares Model Error Corrected Total 1 38 39 32620420 1724420 34344840 Root MSE Variable Intercept TV 213.02459 R-Square Mean Square 32620420 45379 F Value Pr > F 718.84 <.0001 0.9498 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 478.50829 9.73056 355.05866 0.36293 1.35 26.81 0.1857 <.0001 ********************************************************************************************* Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 38 39 32615742 1729098 34344840 32615742 45503 Root MSE Variable Intercept radio 213.31333 R-Square F Value Pr > F 716.79 <.0001 0.9497 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 612.08604 9.58381 350.59871 0.35797 1.75 26.77 0.0889 <.0001 Sums of squares capture variation explained by each variable Type I: How much when it is added to the model? Type II: How much when all other variables are present (as if it had been added last) Parameter Estimates Variable Intercept TV radio DF Parameter Estimate Standard Error 1 1 1 531.11390 5.00435 4.66752 359.90429 5.01845 4.94312 t Value Pr > |t| 1.48 1.00 0.94 0.1485 0.3251 0.3512 Type I SS Type II SS 3964160640 32620420 40576 99106 45254 40576 *********************************************************************************** Parameter Estimates Variable Intercept radio TV DF Parameter Estimate Standard Error t Value Pr > |t| Type I SS Type II SS 1 1 1 531.11390 4.66752 5.00435 359.90429 4.94312 5.01845 1.48 0.94 1.00 0.1485 0.3512 0.3251 3964160640 32615742 45254 99106 40576 45254 Summary: Good predictions given by Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or Sales = 612 + 9.6 x Radio or (lots of others) Why the confusion? The evil Multicollinearity!! (correlated X’s) Those Mysterious “Degrees of Freedom” (DF) First Martian information about average height 0 information about variation. 2nd Martian gives first piece of information (DF) about error variance around mean. n Martians n-1 DF for error (variation) Martian Height 2 points no information on variation of errors n points n-2 error DF Martian Weight How Many Table Legs? (regress Y on X1, X2) Source Model Error Corrected Total error X2 Three legs will all touch the floor. DF 2 37 39 Sum of Squares 32660996 1683844 34344840 Mean Square 16330498 45509 X1 Fourth leg gives first chance to measure error (first error DF). Fit a plane n-3 (37) error DF (2 “model” DF, n-1=39 “total” DF) Regress Y on X1 X2 … X7 n-8 error DF (7 “model” DF, n-1 “total” DF) Grades vs. IQ and Study Time Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time; Variable Intercept IQ Variable Intercept IQ Study_Time DF 1 1 Parameter Estimate 62.57113 0.16369 Standard Error 48.24164 0.41877 t Value 1.30 0.39 Pr > |t| 0.2423 0.7094 DF 1 1 1 Parameter Estimate 0.73655 0.47308 2.10344 Standard Error 16.26280 0.12998 0.26418 t Value 0.05 3.64 7.96 Pr > |t| 0.9656 0.0149 0.0005 Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added. Model for Grades: Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this. proc reg; model Grade = IQ Study_Time IQ_S; Source Model Error Corrected Total Root MSE Variable Intercept IQ Study_Time IQ_S DF Sum of Squares Mean Square 3 4 7 610.81033 31.06467 641.87500 203.60344 7.76617 2.78678 R-Square F Value Pr > F 26.22 0.0043 0.9516 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 72.20608 -0.13117 -4.11107 0.05307 54.07278 0.45530 4.52430 0.03858 1.34 -0.29 -0.91 1.38 0.2527 0.7876 0.4149 0.2410 “Interaction” model: Predicted Grade = 72.21 - 0.13 x IQ - 4.11 x Study Time + 0.053 x IQ x Study Time = (72.21 - 0.13 x IQ )+( - 4.11 + 0.053 x IQ )x Study Time IQ = 102 predicts Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time IQ = 122 predicts Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time Slope = 2.36 Slope = 1.30 (1) (2) (3) (4) Adding interaction makes everything insignificant (individually) ! Do we need to omit insignificant terms until only significant ones remain? Has an acquitted defendant proved his innocence? Common sense trumps statistics! Part 3: Diagnosing Problems in Regression Main problems are Multicollinearity (correlation among inputs) Outliers Proc Corr; Var TV radio sales; Pearson Correlation Coefficients, N = 40 Prob > |r| under H0: Rho=0 TV radio sales TV 1.00000 0.99737 <.0001 0.97457 <.0001 radio 0.99737 <.0001 1.00000 0.97450 <.0001 sales 0.97457 <.0001 0.97450 <.0001 1.00000 Principal Component Axis 2 P2 TV $ Radio $ Principal Component Axis 1: P1 Principal Components TV radio 1.00000 0.99737 <.0001 0.99737 <.0001 1.00000 (1) (2) (3) (4) Center and scale variables to mean 0 variance 1. Call these X1 (TV) and X2 (radio) n variables total variation is n (n=2 here) Find most variable linear combination P1=__X1+__X2 Variances are 1.9973 out of 2 (along P1 axis) standard deviation 1.9973 and 0.0027 out of 2 (along P2 axis) standard deviation 0.0027 Ratio of standard deviations (27.6) is “condition number” large unstable regression. Rule of thumb: Ratio 1 is perfect, >30 problematic. Spread on long axis is 27.6 times that on short axis. Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: Ri2 (2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem. Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: Ri2 (2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem. Example: Proc Reg Data=Sales; Model Sales = TV Radio/VIF collinoint; Parameter Estimates Variable Intercept TV radio DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation 1 1 1 531.11390 5.00435 4.66752 359.90429 5.01845 4.94312 1.48 1.00 0.94 0.1485 0.3251 0.3512 0 190.65722 190.65722 Collinearity Diagnostics (intercept adjusted) Number Eigenvalue Condition Index 1 2 1.99737 0.00263 1.00000 27.57948 --Proportion of VariationTV radio 0.00131 0.99869 0.00131 0.99869 We have a MAJOR problem! (note: other diagnostics besides VIF and condition number are available) Another problem: Outliers TV ‚ 1200 ˆ ‚ + ‚ ‚ + ‚ + ‚ ++ 2 ‚ +++ ‚ + ‚ + + + 1000 ˆ ++ ‚ ++++ ‚ ++++ ‚ + + ‚ ++ ‚ ++ ‚ + ‚ + ‚ ++ 800 ˆ+ Šˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒ 800 1000 1200 radio P Example: Add one point to TV-Radio data TV 1021, radio 954, Sales 9020 Proc Reg: Model Sales = TV radio/ p r; Analysis of Variance Source Model Error Corrected Total Root MSE DF 2 38 40 229.86639 Sum of Squares 33190059 2007865 35197924 R-Square Mean Square 16595030 52839 F Value 314.07 Pr > F <.0001 0.9430 Parameter Estimates Variable Intercept TV radio Obs 39 40 41 DF 1 1 1 Parameter Estimate 689.01260 -6.28994 15.78081 Dependent Predicted Variable Value 9277 10130 9020 Residual 9430 -153.4358 9759 370.5848 9322 -301.8727 Standard Error 382.52628 2.90505 2.86870 Std Error Student Residual Residual 225.3 226.1 121.9 -0.681 | 1.639 | -2.476 | t Value 1.80 -2.17 5.50 Pr > |t| 0.0796 0.0367 <.0001 -2-1 0 1 2 Cook's D *| |*** ****| | | | 0.006 0.030 5.224 ??????? P1 P2 P1 Ordinary residual for store 41 not too bad (-300.87) PRESS residuals (1) Remove store i , Sales Y(i) (2) Fit model to other 40 stores (3) Get model prediction P(i) for store I (4) PRESS residual is Y(i)-P(i) Regular O and PRESS (dot) residuals Store number 41 proc reg data=raw; model sales = TV radio; output out=out1 r=r press= press; run; View Along the P2 Axis P2 (2nd Principal Component) Part 4: Classification Variables (dummy variables, indicator variables) Predicted Accidents = 1181 + 2579 X11 X11 is 1 in November, 0 elsewhere. Interpretation: In November, predict 1181+2579(1) = 3660. In any other month predict 1181 + 2579(0) = 1181. 1181 is average of other months. 2579 is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11; Analysis of Variance Source Model Error Corrected Total Root MSE Variable Intercept X11 DF 1 58 59 580.42294 Label Intercept Sum of Squares 30473250 19539666 50012916 R-Square DF 1 1 Mean Square 30473250 336891 F Value 90.45 Pr > F <.0001 0.6093 Parameter Estimate 1181.09091 2578.50909 Standard Error 78.26421 271.11519 t Value 15.09 9.51 Pr > |t| <.0001 <.0001 Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12; Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 3 56 59 46152434 3860482 50012916 15384145 68937 Root MSE Variable Intercept X10 X11 X12 262.55890 Label Intercept R-Square DF 1 1 1 1 F Value Pr > F 223.16 <.0001 0.9228 Parameter Estimate 929.40000 1391.20000 2830.20000 1377.40000 Standard Error 39.13997 123.77145 123.77145 123.77145 t Value 23.75 11.24 22.87 11.13 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December. Pr > |t| <.0001 <.0001 <.0001 <.0001 What the heck – let’s do all but one (need “average of rest” so must leave out at least one) Proc reg data=deer; model deer = X1 X2 … X10 X11; Analysis of Variance Source Model Error Corrected Total Root MSE DF 11 48 59 182.07290 Sum of Squares 48421690 1591226 50012916 R-Square Mean Square 4401972 33151 F Value 132.79 Pr > F <.0001 0.9682 Parameter Estimates Variable Label Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Intercept DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 1 1 1 2306.80000 -885.80000 -1181.40000 -1220.20000 -1486.80000 -1526.80000 -1433.00000 -1559.20000 -1646.20000 -1457.20000 13.80000 1452.80000 81.42548 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 28.33 -7.69 -10.26 -10.60 -12.91 -13.26 -12.44 -13.54 -14.30 -12.65 0.12 12.62 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.9051 <.0001 Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December. positive negative Add date (days since Jan 1 1960 in SAS) to capture trend Proc reg data=deer; model deer = date X1 X2 … X10 X11; Analysis of Variance Source Model Error Corrected Total Root MSE DF 12 47 59 129.83992 Sum of Squares 49220571 792345 50012916 R-Square Mean Square 4101714 16858 F Value 243.30 Pr > F <.0001 0.9842 Parameter Estimates Variable Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 date Label Intercept DF 1 1 1 1 1 1 1 1 1 1 1 1 1 Parameter Estimate -1439.94000 -811.13686 -1113.66253 -1158.76265 -1432.28832 -1478.99057 -1392.11624 -1525.01849 -1618.94416 -1436.86982 27.42792 1459.50226 0.22341 Standard Error 547.36656 82.83115 82.70543 82.60154 82.49890 82.41114 82.33246 82.26796 82.21337 82.17106 82.14183 82.12374 0.03245 t Value -2.63 -9.79 -13.47 -14.03 -17.36 -17.95 -16.91 -18.54 -19.69 -17.49 0.33 17.77 6.88 Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0. Pr > |t| 0.0115 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.7399 <.0001 <.0001 Part 5 Logistic Regression The problem: response is binary yes or no, accident or no accident, claim or no claim, at fault, not at fault Prediction is prediction of probability (of fault for example) • Logistic idea: Map p in (0,1) to L in whole real line, p=probability of fabric igniting. • Use L = ln(p/(1-p)) • Model L as linear in flame exposure time. • Predicted L = a + b(time) • Given temperature X, compute a+bX then p = eL/(1+eL) • p(i) = ea+bXi/(1+ea+bXi) • Write p(i) if response, 1-p(i) if not • Multiply all n of these together, get function Q(a,b), find a,b to maximize. Example: Ignition • Flame exposure time = X • Ignited Y=1, did not ignite Y=0 – Y=1, X = 11, 12 14, 15, 17, 25, 30 – Y=0, X= 3, 5, 9 10 , 13, 16 • Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp • P’s all different p=f(exposure time) • Find a,b to maximize likelihood Q(a,b) Likelihood function (Q) -2.6 0.23 Example: Shuttle Missions • • • • • O-rings failed in Challenger disaster Low temperature Prior flights “erosion” and “blowby” in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0)