Chapter 6: Multiple Linear Regression Data Mining for Business Intelligence Shmueli, Patel & Bruce 1 Topics Explanatory vs. predictive modeling with regression Example: prices of Toyota Corollas Fitting a predictive model Assessing predictive accuracy Selecting a subset of predictors 2 Explanatory Modeling Goal: Explain relationship between predictors (explanatory variables) and target Familiar use of regression in data analysis Model Goal: Fit the data well and understand the contribution of explanatory variables to the model “goodness-of-fit”: R2, residual analysis, p-values 3 Predictive Modeling Goal: predict target values in other data where we have predictor values, but not target values Classic data mining context Model Goal: Optimize predictive accuracy Train model on training data Assess performance on validation (hold-out) data Explaining role of predictors is not primary purpose (but useful) 4 Example: Prices of Toyota Corolla ToyotaCorolla.xls Goal: predict prices of used Toyota Corollas based on their specification Data: Prices of 1442 used Toyota Corollas, with their specification information 5 Data Sample (showing only the variables to be used in analysis) Price Age 13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 20950 6 KM 23 23 24 26 30 32 27 30 27 23 25 46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 31461 Fuel_Type HP Metallic Diesel 90 1 Diesel 90 1 Diesel 90 1 Diesel 90 0 Diesel 90 0 Diesel 90 0 Diesel 90 1 Diesel 90 1 Petrol 192 0 Diesel 69 0 Petrol 192 0 Automatic 0 0 0 0 0 0 0 0 0 0 0 cc 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 1800 Doors 3 3 3 3 3 3 3 3 3 3 3 Quarterly_Tax 210 210 210 210 210 210 210 210 100 185 100 Weight 1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 1185 Variables Used 7 Price in Euros Age in months as of 8/04 KM (kilometers) Fuel Type (diesel, petrol, CNG) HP (horsepower) Metallic color (1=yes, 0=no) Automatic transmission (1=yes, 0=no) CC (cylinder volume) Doors Quarterly_Tax (road tax) Weight (in kg) Preprocessing Fuel type is categorical, must be transformed into binary variables Diesel (1=yes, 0=no) CNG (1=yes, 0=no) None needed for “Petrol” (reference category) 8 Subset of the records selected for training partition (limited # of variables shown) Id Model Price Age_08_04 Mfg_Month Mfg_Year KM Corolla 2.0 D4D HATCHB TERRA 1 2/3-Doors Corolla 2.0 D4D HATCHB TERRA 4 2/3-Doors TA Corolla 2.0 D4D HATCHB 5 SOL 2/3-Doors TA Corolla 2.0 D4D HATCHB 6 SOL 2/3-Doors OTA Corolla 1800 T SPORT9VVT I 2/3-Doors TA Corolla 1.9 D HATCHB 10 TERRA 2/3-Doors 8 16V VVTLI 3DR T SPORT 12 BNS 2/3-Doors olla 1.8 16V VVTLI 3DR T 17 SPORT 2/3-Doors 13500 14950 13750 12950 21500 12950 19950 22750 23 26 30 32 27 23 22 30 10 7 3 1 6 10 11 3 2002 2002 2002 2002 2002 2002 2002 2002 46986 48000 38500 61000 19700 71138 43610 34000 Fuel_Type_Di Fuel_Type_Pe esel trol 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 60% training data / 40% validation data 9 The Fitted Regression Model Input variables Constant term Age_08_04 KM Fuel_Type_Diesel Fuel_Type_Petrol HP Met_Color Automatic cc Doors Quarterly_Tax Weight 10 Coefficient -3608.418457 -123.8319168 -0.017482 210.9862518 2522.066895 20.71352959 -50.48505402 178.1519013 0.01385481 20.02487946 16.7742424 15.41666317 Std. Error 1458.620728 3.367589 0.00175105 474.9978333 463.6594238 4.67398977 97.85591125 212.0528565 0.09319961 51.0899086 2.09381151 1.40446579 p-value 0.0137 0 0 0.6571036 0.00000008 0.00001152 0.60614568 0.40124047 0.88188446 0.69526076 0 0 SS 97276410000 8033339000 251574500 6212673 4594.9375 330138600 596053.75 19223190 1272449 39265060 160667200 214696000 Error reports Training Data scoring - Summary Report Total sum of squared errors RMS Error Average Error 1514553377 1325.527246 -0.000426154 Validation Data scoring - Summary Report Total sum of squared errors 1021587500 11 RMS Error Average Error 1334.079894 116.3728779 Predicted Values Predicted price computed using regression coefficients 12 Predicted Value 15863.86944 16285.93045 16222.95248 16178.77221 19276.03039 19263.30349 18630.46904 18312.04498 19126.94064 16808.77828 15885.80362 15873.97887 15601.22471 15476.63164 15544.83584 15562.25552 15222.12869 17782.33234 Actual Value Residual 13750 -2113.869439 13950 -2335.930454 16900 677.047525 18600 2421.227789 20950 1673.969611 19600 336.6965066 21500 2869.530964 22500 4187.955022 22000 2873.059357 16950 141.2217206 16950 1064.196384 16250 376.0211263 15750 148.7752903 15950 473.3683568 14950 -594.835836 14750 -812.2555172 16750 1527.871313 19000 1217.667664 Residuals = difference between actual and predicted prices Distribution of Residuals Symmetric distribution Some outliers 13 Selecting Subsets of Predictors Goal: Find parsimonious model (the simplest model that performs sufficiently well) More robust Higher predictive accuracy Exhaustive Search Partial Search Algorithms Forward Backward Stepwise 14 Exhaustive Search All possible subsets of predictors assessed (single, pairs, triplets, etc.) Computationally intensive Judge by “adjusted R2” 2 adj R 15 n 1 1 (1 R 2 ) n p 1 Penalty for number of predictors Forward Selection Start with no predictors Add them one by one (add the one with largest contribution) Stop when the addition is not statistically significant 16 Backward Elimination Start with all predictors Successively eliminate least useful predictors one by one Stop when all remaining predictors have statistically significant contribution 17 Stepwise Like Forward Selection Except at each step, also consider dropping non- significant predictors 18 Backward elimination (showing last 7 models) 1 Constant Constant Constant Constant Constant Constant Constant 2 Age_08_04 Age_08_04 Age_08_04 Age_08_04 Age_08_04 Age_08_04 Age_08_04 3 4 5 6 7 * * * * * Weight * * * * KM Weight * * * KM Fuel_Type_Petrol Weight * * KM Fuel_Type_Petrol Quarterly_Tax Weight * KM Fuel_Type_Petrol HP Quarterly_Tax Weight KM Fuel_Type_Petrol HP Automatic Quarterly_Tax Top model has a single predictor (Age_08_04) Second model has two predictors, etc. 19 8 * * * * * * Weight All 12 Models 20 Diagnostics for the 12 models Good model has: 21 High adj-R2, Cp = # predictors Next step Subset selection methods give candidate models that might be “good models” Do not guarantee that “best” model is indeed best Also, “best” model can still have insufficient predictive accuracy Must run the candidates and assess predictive accuracy (click “choose subset”) 22 Model with only 6 predictors The Regression Model Input variables Constant term Age_08_04 KM Fuel_Type_Petrol HP Quarterly_Tax Weight Coefficient -3874.492188 -123.4366303 -0.01749926 2409.154297 19.70204735 16.88731384 15.91809368 Std. Error 1415.003052 3.33806777 0.00173714 319.5795288 4.22180223 2.08484554 1.26474357 p-value SS 0.00640071 97276411904 0 8033339392 0 251574528 0 5049567 0.00000394 291336576 0 192390864 0 281026176 Training Data scoring - Summary Report Model Fit Total sum of squared errors RMS Error Average Error 1516825972 1326.521353 -0.000143957 Validation Data scoring - Summary Report Predictive performance 23 (compare to 12-predictor model!) Total sum of squared errors 1021510219 RMS Error Average Error 1334.029433 118.4483556 Summary Linear regression models are very popular tools, not 24 only for explanatory modeling, but also for prediction A good predictive model has high predictive accuracy (to a useful practical level) Predictive models are built using a training data set, and evaluated on a separate validation data set Removing redundant predictors is key to achieving predictive accuracy and robustness Subset selection methods help find “good” candidate models. These should then be run and assessed.