ICS422 Applied Predictive Analytics [3- 0-0-3] Multiple Linear Regression Class 19 Presented by Dr. Selvi C Assistant Professor IIIT Kottayam REGIONAL DELIVERY SERVICE • Let's assume that you are a small business owner for Regional Delivery Service, Inc. (RDS) who offers same-day delivery for letters, packages, and other small cargo. • You are able to use Google Maps to group individual deliveries into one trip to reduce time and fuel costs. Therefore some trips will have more than one delivery. • As the owner, you would like to be able to estimate how long a delivery will take based on two factors: 1) the total distance of the trip in miles and 2) the number of deliveries that must be made during the trip. 2 RDS DATA AND VARIABLE NAMING • To conduct your analysis you take a random sample num milesTraveled travelTime( of 10 past trips and record three pieces of Deliveries (x1) hrs), (y) (x2) information for each trip: 1) total miles traveled, 2) 89 4 7 number of deliveries, and 3) total travel time in hours. 66 1 5.4 78 3 6.6 • Remember that in this case, you would like to be able 111 6 7.4 to predict the total travel time using both the miles 44 1 4.8 traveled and number of deliveries on each trip. 77 3 6.4 80 3 7 • In what way does travel time DEPEND on the first 66 2 5.6 two measures? 109 5 7.3 76 3 6.4 • Travel time is the dependent variable and miles traveled and number of deliveries are independent variables. 3 Multiple Linear Regression • Extension of Simple Linear regression • One to one IV->DV • Extension of Simple Linear regression • Many to One • IV1+IV2+… ->DV 4 New Consideration • Adding more independent variables to a multiple regression procedure does not mean the regression will be "better" or offer better predictions; in fact it can make things worse. This is called OVERFITTING. • The addition of more independent variables creates more relationships among them. So not only are the independent variables potentially related to the dependent variable, they are also potentially related to each other. When this happens, it is called MULTICOLLINEARITY. • The ideal is for all of the independent variables to be correlated with the dependent variable but NOT with each other. 5 New Consideration • Because of multicollinearity and overfitting, there is a fair amount of pre-work to do BEFORE conducting multiple regression analysis if one is to do it properly. • Correlations • Scatter plots • Simple regressions 6 Multicollinearity • Travel time is the dependent variable and miles traveled and number of deliveries are independent variables. • Some independent variables or set of independent variables are better at predicting the Dependent variable than others. Some contribute nothing 7 Multiple Regression Model • Multiple Regression Model • π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + β― + π½π π₯π + ε • Multiple Regression Equation • πΈ π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + β― + π½π π₯π Error term assumed to be zero • Estimated Multiple Regression Equation • π¦ = π0 + π1 π₯1 + π2 π₯2 + β― + ππ π₯π 8 Coefficient Interpretation 9 INTERPRETING COEFFICIENTS • π¦ = 27 + 9π₯1 + 12π₯2 • x1 = capital investment ($1000s) • x2 = marketing expenditures ($1000s) • Ε· = predicted sales ($1000s) • In multiple regression, each coefficient is interpreted as the estimated change in y corresponding to a one unit change in a variable, when all other variables are held constant. • So in this example, $9000 is an estimate of the expected increase in sales y, corresponding to a $1000 increase in capital investment (x1) when marketing expenditures (x2) are held constant. 10 • Multiple regression is an extension of simple linear regression • Two or more independent variables are used to predict / explain the variance in one dependent variable • Two problems may arise: • Overfitting • Multicollinearity • Overfitting is caused by adding too many independent variables; they account for more variance but add nothing to the model • Multicollinearity happens when some/all of the independent variables are correlated with each other • In multiple regression, each coefficient is interpreted as the estimated change in y corresponding to a one unit change in a variable, when all other variables are held constant. 11 REGIONAL DELIVERY SERVICE • Let's assume that you are a small business owner for Regional Delivery Service, Inc. (RDS) who offers same-day delivery for letters, packages, and other small cargo. • You are able to use Google Maps to group individual deliveries into one trip to reduce time and fuel costs. Therefore some trips will have more than one delivery. • As the owner, you would like to be able to estimate how long a delivery will take based on two factors: 1) the total distance of the trip in miles 2) the number of deliveries that must be made during the trip and 3) the daily price of gas/petrol in U.S. dollars 12 Multicollinearity 13 MULTIPLE REGRESSION Conducting multiple regression analysis requires a fair amount of pre-work before actually running the regression. Here are the steps: 1. Generate a list of potential variables; independent(s) and dependent 2. Collect data on the variables 3. Check the relationships between each independent variable and the dependent variable using scatterplots and correlations 4. Check the relationships among the independent variables using scatterplots and correlations 5. (Optional) Conduct simple linear regressions for each IV/DV pair 6. Use the non-redundant independent variables in the analysis to find the best fitting model 7. Use the best fitting model to make predictions about the dependent variable. 14 RDS DATA AND VARIABLE NAMING To conduct your analysis you take a random sample of 10 past trips and record four pieces of information for each trip: 1) total miles traveled, 2) number of deliveries, 3) the daily gas price, and 4) total travel time in hours. milesTraveled (x1) 89 66 78 111 44 77 80 66 109 76 num travelTime(hrs), gasPrice (x3) Deliveries (x2) (y) 4 1 3 6 1 3 3 2 5 3 3.84 3.19 3.78 3.89 3.57 3.57 3.03 3.51 3.54 3.25 7 5.4 6.6 7.4 4.8 6.4 7 5.6 7.3 6.4 15 16 SCATTERPLOT SUMMARY • Dependent variable vs independent variables • travelTime(y) appears highly correlated with milesTraveled (x1) • travelTime(y) appears highly correlated with numDeliveries (x2) • travelTime(y) DOES NOT appear highly correlated with gasPrice(x3) • Since gasPrice (x3) does NOT APPEAR CORRELATED with the dependent variable we would NOT use that variable in the multiple regression • Note: for now, we will keep gasPrice in and then take it out later for learning purposes 17 18 Multicollinearity Check • Independent variable vs independent variable • numDeliveries(x2) APPEARS highly correlated with milesTraveled (x,); this is multicollinearity • milesTraveled (x1) does not appear highly correlated with gasPrice(x3) • gasPrices(x3) does not appear correlated with numDeliveries(x2) • Since numDeliveries is HIGHLY CORRELATED with milesTraveled, we would NOT use BOTH in the multiple regression; they are redundant • Note: for now, we will keep both in and then take one out later for learning purposes 19 Correlations num milesTravel Correlation Deliveries ed (x1) (x2) num Deliveries (x2) 0.956 gasPrice (x3) 0.356 0.498 travelTime(hrs) , (y) 0.928 0.916 gasPrice (x3) 0.267 20 Scatterplot (DV vs IV) r=0.928 r=0.916 r=0.267 21 IV Scatterplots(Multicollinearity) r= 0.956 r=0.356 r=0.498 22 Understanding • Correlation analysis confirms the conclusions reached by visual examination of the scatterplots • Redundant multicollinear variables • milesTraveled and numDeliveries are both highly correlated with each other and therefore are redundant; only one should used in the multiple regression analysis • Non-contributing variables • gasPrice is NOT correlated with the depended variable and should be excluded • For the sake of learning, we are going to break the rules and include all three independent variables in the regression at first • Then we will remove the problematic independent variables as we should and then watch what happens to the regression results • We will also perform simple regressions with the dependent variable to use as a baseline (again for the sake of learning) 23 Example BP 105 115 116 117 112 121 121 110 110 114 114 115 114 106 125 114 106 113 110 122 Weight 85.4 94.2 95.3 94.7 89.4 99.5 99.8 90.9 89.2 92.7 94.4 94.1 91.6 87.1 101.3 94.5 87 94.5 90.5 95.7 Age 47 49 49 50 51 48 49 47 49 48 47 49 50 45 52 46 46 46 48 56 BSA 1.75 2.1 1.98 2.01 1.89 2.25 2.25 1.9 1.83 2.07 2.07 1.98 2.05 1.92 2.19 1.98 1.87 1.9 1.88 2.09 Dur 5.1 3.8 8.2 5.8 7 9.3 2.5 6.2 7.1 5.6 5.3 5.6 10.2 5.6 10 7.4 3.6 4.3 9 7 Pulse 63 70 72 73 72 71 69 66 69 64 74 71 68 67 76 69 62 70 71 75 Stress 33 14 10 99 95 10 42 8 62 35 90 21 47 80 98 95 18 12 99 99 BP BP Age Weight BSA Dur Pulse Stress Age Weight BSA Dur 1 0.659093 1 0.950068 0.407349 1 0.865879 0.378455 0.875305 1 0.292834 0.343792 0.20065 0.13054 1 0.721413 0.618764 0.65934 0.464819 0.401514 0.163901 0.368224 0.034355 0.018446 0.31164 Pulse 1 0.50631 Stress 1 24 Variance Inflammation Factor (VIF) • A variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model. • Detecting multicollinearity is important because while multicollinearity does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables. • A large VIF on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables. 25 Variance Inflammation Factor (VIF) SUMMARY OUTPUT Regression Statistics Multiple R 0.998073 R Square 0.99615 Adjusted R Square 0.994373 Standard Error 0.407229 Observation s 20 ANOVA df Regression Residual Total Intercept Age Weight BSA Dur Pulse Stress SS 6 13 19 557.8441 2.155858 560 MS F 92.97402 0.165835 560.641 Significance F 6.4E-15 8.42, 5.33, and 4.41 — are fairly large. The VIF for the predictor Weight, for example, tells us that the variance of the estimated coefficient of Weight is inflated by a factor of 8.42 because Weight is highly correlated with at least one of the other predictors in the model. Standard Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% VIF R Squared -12.8705 2.55665 -5.03412 0.000229 -18.3938 -7.34717 -18.3938 -7.34717 0.703259 0.049606 14.17696 2.76E-09 0.596093 0.810426 0.596093 0.810426 1.76 0.96992 0.063108 15.36909 1.02E-09 0.833582 1.106257 0.833582 1.106257 8.417035 0.88119332 3.776491 1.580151 2.389956 0.032694 0.362783 7.190199 0.362783 7.190199 5.33 0.068383 0.048441 1.411663 0.181534 -0.03627 0.173035 -0.03627 0.173035 1.24 -0.08448 0.051609 -1.63702 0.125594 -0.19598 0.02701 -0.19598 0.02701 4.41 0.005572 0.003412 1.63277 0.126491 -0.0018 0.012943 -0.0018 0.012943 1.834845 0.45499493 26 Variance Inflammation Factor (VIF)Weight as Dependent variable SUMMARY OUTPUT Regression Statistics Multiple R 0.938719 R Square 0.881193 Adjusted R Square 0.838762 Standard Error 1.724594 Observati ons 20 ANOVA df Regressio n Residual Total SS MS 5 308.8389 61.76777 14 41.63913 2.974223 19 350.478 F Significan ce F 20.7677 5.05E-06 Coefficien Standard Lower Upper Lower Upper ts Error t Stat P-value 95% 95% 95.0% 95.0% Intercept 19.67444 9.464742 2.078708 0.056513 -0.62541 39.97429 -0.62541 39.97429 Age -0.14464 0.206491 -0.70048 0.495103 -0.58752 0.298235 -0.58752 0.298235 BSA 21.42165 3.464586 6.183034 2.38E-05 13.99086 28.85245 13.99086 28.85245 Dur 0.008696 0.205134 0.042394 0.966783 -0.43127 0.448665 -0.43127 0.448665 Pulse 0.557697 0.159853 3.488813 0.003615 0.214847 0.900548 0.214847 0.900548 Stress -0.023 0.013079 -1.75834 0.10052 -0.05105 0.005054 -0.05105 0.005054 27 Understanding • we see that the predictors Weight and BSA are highly correlated (r = 0.875). We can choose to remove either predictor from the model. The decision of which one to remove is often a scientific or practical one. • For example, if the researchers here are interested in using their final model to predict the blood pressure of future individuals, their choice should be clear. Which of the two measurements — body surface area or weight — do you think would be easier to obtain?! • If weight is an easier measurement to obtain than body surface area, then the researchers would be well-advised to remove BSA from the model and leave Weight in the model. • Reviewing again the above pairwise correlations, we see that the predictor Pulse also appears to exhibit fairly strong marginal correlations with several of the predictors, including Age (r = 0.619), Weight (r = 0.659), and Stress (r = 0.506). Therefore, the researchers could also consider removing the predictor Pulse from the model. 28 After removal SUMMARY OUTPUT BP 105 115 116 117 112 121 121 110 110 114 114 115 114 106 125 114 106 113 110 122 Weight 85.4 94.2 95.3 94.7 89.4 99.5 99.8 90.9 89.2 92.7 94.4 94.1 91.6 87.1 101.3 94.5 87 94.5 90.5 95.7 Age 47 49 49 50 51 48 49 47 49 48 47 49 50 45 52 46 46 46 48 56 Dur 5.1 3.8 8.2 5.8 7 9.3 2.5 6.2 7.1 5.6 5.3 5.6 10.2 5.6 10 7.4 3.6 4.3 9 7 Stress 33 14 10 99 95 10 42 8 62 35 90 21 47 80 98 95 18 12 99 99 Regression Statistics Multiple R 0.995934 R Square 0.991884 Adjusted R Square 0.989719 Standard Error 0.550462 Observatio ns 20 BP BP Weight Age Dur Stress Weight Age 1 0.950068 1 0.659093 0.407349 1 0.292834 0.20065 0.343792 0.163901 0.034355 0.368224 Dur Stress 1 0.31164 1 ANOVA df SS MS F Significance F 555.4549 4.545126 560 138.8637 0.303008 458.2834 1.76E-15 Regression Residual Total 4 15 19 Intercept Weight Age Dur Stress Standard Coefficients Error -15.8698 3.195296 1.034128 0.032672 0.683741 0.061195 0.039889 0.064486 0.002184 0.003794 t Stat -4.96662 31.65228 11.17309 0.618561 0.575777 P-value Lower 95% Upper 95% 0.000169 -22.6804 -9.05922 3.76E-15 0.964491 1.103766 1.14E-08 0.553306 0.814176 0.545485 -0.09756 0.177339 0.573304 -0.0059 0.01027 Lower 95.0% -22.6804 0.964491 0.553306 -0.09756 -0.0059 Upper 95.0% -9.05922 1.103766 0.814176 0.177339 0.01027 VIF R squared 1.234653 1.47 1.2 1.24 0.190056 29 • The remaining variance inflation factors are quite satisfactory! That is, it appears as if hardly any variance in inflation remains. Incidentally, in terms of the adjusted R squared we did not seem to lose much by dropping the two predictors BSA and Pulse from our model. The adjusted R squared decreased to only 98.97% from the original adjusted of 99.44%. 30 Dummy Variable 31 Scenario – Dummy Variable • You are an analyst for a small company that develops house pricing models for independent realtors. To generate your models you use publicly available data such as; list price, square footage, number of bedrooms, number of bathrooms, etc. • You are interested in another question: Is the public high school in the neighborhood "Exemplary" (the highest rating) and how is that rating related to home price? • High school rating is not quantitative, it is qualitative (categorical). For each home the high school is either exemplary or not; yes or no. 32 Price1000s(y) 145 69.9 315 144.9 134.9 369 95 228.9 149 295 388.5 75 130 174 334.9 sqrt(x1) 1872 1954 4104 1524 1297 3278 1192 2252 1620 2466 3188 1061 1195 1552 2901 exempHS(x2) Not exemplary Not exemplary Exemplary Not exemplary Not exemplary Exemplary Not exemplary Exemplary Not exemplary Exemplary Exemplary Not exemplary Not exemplary Exemplary Exemplary Price1000s(y) 145 69.9 315 144.9 134.9 369 95 228.9 149 295 388.5 75 130 174 334.9 sqrt(x1) 1872 1954 4104 1524 1297 3278 1192 2252 1620 2466 3188 1061 1195 1552 2901 exempHS(x2) 0 0 1 0 0 1 0 1 0 1 1 0 0 1 1 33 DUMMY VARIABLES • In many situations we must work with categorical independent variables • In regression analysis we call these dummy or indicator variables • For a variable with n categories there are always n-1 dummy variables • Exemplary/Not exemplary there are 2 categories, so 2 − 1 = 1 dummy variable • North/South/East/West there are 4 categories, so 4 - 1 = 3 dummy variables 34 Region Variable Coding Region North South East West x1 1 0 0 0 x2 0 1 0 0 x3 0 0 1 0 35 Interpretation • πΈ π¦ = π½0 + π½1 π₯1 + π½2 π₯2 Estimated Regression Equation • Expected value of home price given the high school is Not exemplary , π₯2 = 0 πΈ(π¦ πππ‘ ππ₯πππππππ¦) = π½0 + π½1 π₯1 + π½2 (0) πΈ(π¦ πππ‘ ππ₯πππππππ¦) = π½0 + π½1 π₯1 • Expected value of home price given the high school is Not exemplary , π₯2 = 1 πΈ(π¦ πΈπ₯πππππππ¦) = π½0 + π½1 π₯1 + π½2 (1) πΈ(π¦ πΈπ₯πππππππ¦) = (π½0 + π½2 ) + π½1 π₯1 36 Regression Analysis • Sqrt: Every square foot is related to an increase in price of 0.0621($1000s) or $62.10 per square foot. • ExempHS: On average, a home in an area with an exemplary high school is related to a $98600 higher price. SUMMARY OUTPUT Regression Statistics Multiple R 0.925716 R Square 0.85695 Adjusted R Square 0.833109 Standard Error 44.65395 Observations 15 ANOVA df Regression Residual Total Intercept sqrt(x1) exempHS(x2) SS MS F 143340.7 23927.7 167268.4 71670.36 1993.975 35.94346 Coefficients Standard Error 27.07494 33.68996 0.062066 0.020324 t Stat 0.80365 3.05383 P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 0.437229 -46.3292 100.479 -46.3292 100.479 0.010013 0.017784 0.106348 0.017784 0.106348 2.743023 0.017831 2 12 14 98.64787 35.96319 Significance F 8.57E-06 20.2908 177.0049 20.2908 177.0049 37 Interpretation • πΈ π¦ = 27.1 + 0.0621 π₯1 + 98.6π₯2 Estimated Regression Equation • Expected value of home price given the high school is Not exemplary , π₯2 = 0 πΈ(π¦ πππ‘ ππ₯πππππππ¦) = π½0 + π½1 π₯1 + π½2 (0) πΈ(π¦ πππ‘ ππ₯πππππππ¦) = 27.1 + 0.0621 π₯1 • Expected value of home price given the high school is Not exemplary , π₯2 = 1 πΈ(π¦ πΈπ₯πππππππ¦) = π½0 + π½1 π₯1 + π½2 (1) πΈ(π¦ πΈπ₯πππππππ¦) = (27.1 + 98.6) + 0.0621 π₯1 38 • You are an analyst for a small company that develops house pricing models for independent realtors. To generate your models you use publicly available data such as; list price, square footage, number of bedrooms, number of bathrooms, etc. • You are interested in two questions: 1. Is the public high school in the neighborhood "Exemplary" (the highest rating) and how is that rating related to home price? 2. What region (N,S,E,W) of the city is the home located and how is that related to home price? • This data is not quantitative, it is qualitative (categorical) so we will need to use dummy variables in our regression. 39 Price1000s(y) sqrt(x1) exempHS(x2) Region 145 1872 Not exemplary South 69.9 1954 Not exemplary North 315 4104 Exemplary South 144.9 1524 Not exemplary North 134.9 1297 Not exemplary East 369 3278 Exemplary South 95 1192 Not exemplary West 228.9 2252 Exemplary North 149 1620 Not exemplary West 295 2466 Exemplary South 388.5 3188 Exemplary South 75 1061 Not exemplary South 130 1195 Not exemplary East 174 1552 Exemplary South 334.9 2901 Exemplary South Price1000s(y) sqrt(x1) exempHS(x2) South(x3) West(x4) North(x5) 145 1872 0 1 0 0 69.9 1954 0 0 0 1 315 4104 1 1 0 0 144.9 1524 0 0 0 1 134.9 1297 0 0 0 0 369 3278 1 1 0 0 95 1192 0 0 1 0 228.9 2252 1 0 0 1 149 1620 0 0 1 0 295 2466 1 1 0 0 388.5 3188 1 1 0 0 75 1061 0 1 0 0 130 1195 0 0 0 0 174 1552 1 1 0 0 334.9 2901 1 1 0 0 40 41 42 For home with 2500 square feet having high school exemplary the price is premium, but the price is less for the case high school is not exemplary. Also the price is less for the west region, high for the south region. Moderate for other two. 43 SUMMARY OUTPUT Regression Statistics Multiple R 0.940779 R Square 0.885066 Adjusted R Square 0.821214 Standard Error 46.2179 Observations 15 ANOVA df Regression Residual Total Intercept sqrt(x1) exempHS(x2) South(x3) West(x4) North(x5) 5 9 14 SS 148043.6 19224.85 167268.4 Coefficients 51.30076 0.065128 102.1551 -32.1221 -20.8704 -61.8466 Standard Error 42.29369 0.021546 40.13882 44.4748 46.34628 43.87802 MS 29608.71743 2136.094023 F 13.861149 Significance F 0.000527941 t Stat 1.212964705 3.022764609 2.545046236 -0.72225415 -0.450315462 -1.409511619 P-value 0.25601839 0.01441461 0.03144932 0.4884789 0.66313455 0.19228908 Lower 95% -44.37422483 0.016387876 11.3548324 -132.7311102 -125.7130273 -161.1055453 Upper 95% 146.9757412 0.113867729 192.9554514 68.48688566 83.9721305 37.41239569 Lower 95.0% -44.37422483 0.016387876 11.3548324 -132.7311102 -125.7130273 -161.1055453 Upper 95.0% 146.9757412 0.113867729 192.9554514 68.48688566 83.9721305 37.41239569 44 Interpretation • πΈ π¦ = π½0 + π½1 π₯1 + π½2 π₯2 + π½3 π₯3 + π½4 π₯4 + π½5 π₯5 Estimated Regression Equation • Expected value of home price given the high school is Not exemplary , π₯2 = 0 and in west π₯4 = 1 πΈ(π¦ πππ‘ ππ₯πππππππ¦ πππ π€ππ π‘) = π½0 + π½1 π₯1 + π½2 (0)+π½3 (0) + π½4 (1) + π½5 (0) πΈ(π¦ πππ‘ ππ₯πππππππ¦ πππ π€ππ π‘) = π½0 + π½1 π₯1 + π½4 45 • It appears from this data and its analysis: • Higher square footage is a significant predictor of higher home price • Being in a district with an exemplary high school is a significant predictor of higher home price • Region is NOT a significant predictor of higher home price. 46 Any Queries? Thank you