Multiple Regression Multiple regression is a statistical technique that is used when examining the relationship between three or more continuous variables and in which one of the variables can be thought of as predicted by the other variables. Similar to simple linear regression, in which there is only one independent or predictor variable, in multiple regression there are two or more independent variables. The following table (from a somewhat larger Excel file LifeExpectancywithallnotes.xlsx downloaded from the World Health Organization http://www.who.int/whosis/en/ of a select list of countries and a select list of indicators from the most recent data available) contains life expectancy at birth along with five factors that may have an influence on how long people live, we might develop a model to predict length of life. Location Life expectancy (years) at birth Adult literacy Hospital beds Number of rate (%) (per 10 000 Physicians both sexes population) Incidence of tuberculosi s (per 100 000 population per year) Newborns with low birth weight (%) Australia 82 40 47875 6 7 Bangladesh 63 47.5 3 42881 225 30 Botswana 52 81.2 24 715 551 10 Brazil 72 88.6 26 198153 50 10 Canada 81 34 62307 5 6 Chile 78 95.7 23 17250 15 5 China 73 90.9 22 1862630 99 6 Cuba Czech Republic 78 99.8 49 66567 9 6 77 84 36595 10 7 Denmark 79 38 19287 8 5 Egypt 68 22 179900 24 12 France 81 73 207277 14 7 Germany 80 83 284427 6 7 Ghana 57 57.9 9 3240 203 11 Greece 80 96 47 55556 18 8 India 63 61 645825 168 30 Indonesia 68 90.4 29499 234 9 Israel 81 60 25138 8 8 Italy 81 40 215000 7 6 Japan 83 141 270371 22 8 Jordan 71 19 13460 5 10 71.4 98.4 91.1 Kuwait 78 93.3 19 4840 24 7 Mexico 74 91.6 10 195897 21 9 The model would follow the format In the life expectancy example y would represent life expectancy and there would be five Xs (adult literacy rate (%), hospital beds per 10,000 population, number of physicians, tuberculosis rate, and newborns (%) with low birth weight). The b terms represent the slope of y with the corresponding x while holding each of the other x variables in the model constant. In simple linear regression b represents the change in y per unit change in x. In multiple regression, however, b represents the change in y per unit change in the corresponding x after taking into account the effects of the other x variables in the model. Performing the multiple regression with the 42 countries in the Excel file results in the following prediction equation (where y represents the predicted life expectancy, x1 represents the adult literacy rate, x2 represents the number of hospital beds per 10,000 population, x3 represents the number of physicians, x4 represents the tuberculosis rate, and x5 represents the percentage of low birth weight. y = 47.26 + .33(x1) -.076(x2) - .0000003(x3) - .029(x4) - .059(x5) After developing the multiple regression equation, values for the Xs can be substituted and the predicted y can be computed. To see how multiple regression is performed in Excel, please watch the MultipleRegression video clip. The file used for this example is located at: Baseball2009.xlsx and the annotated output contained in the video clip is: As in simple linear regression, r square is the amount of variation in y that’s explained by the Xs. In our baseball example, the five independent variables explain slightly more than 87% of the variation in wins. The unexplained 13% of the variation in wins is explained by factors not included in the model. If the model is statistically significant, the significance (p-values) of the individual predictors can be examined. In our above example, because the model is significant, we can proceed to examine the significance of the five Xs and find that four of them are statistically significant with the Opponents Batting Average being the only predictor that is not statistically significant. If the overall regression model is not statistically significant, it is not appropriate to examine the statistical significance of the independent variables. Other Multiple Regression Considerations A general rule of thumb is that in a multiple regression model us that ideally there would be a minimum of 25 observations per independent variable. So, we would need to have at least 125 observations for our baseball problem. Regression coefficients are very unstable with fewer observations, thus our confidence in the above model would be limited. Another consideration, like with the other statistics we’ve covered in this course, is that simply because a relationship is statistically significant, it doesn’t mean that it’s important or useful. Although the r square provides evidence regarding the usefulness of the regression equation, in the end the importance or usefulness of a particular result is a management decision! A common problem with multiple regression is when two or more of the independent variables are highly inter-correlated. This problem is known as multicollinearity and is particularly problematic when trying to determine the statistical significance of the individual predictors. When simply developing a prediction model or determining how much variance in the dependent variable is explained by a group of independent variables, multicollinearity is not a problem. Although dealing with issues of multicollinearity are beyond the scope of this course, some ways this problem can be dealt with include 1) combining the inter-correlated variables, 2) collecting more observations, 3) eliminating one of the offending variables from the model, or 4) standardizing the independent variables. Data is sometimes not very clean. In other words oftentimes there are errors or missing values in data sets. In many studies observations are not obtainable or people simply do not answer every item in a questionnaire. In software specifically designed for statistical analysis, such as SAS (http://www.sas.com/ ) or SPSS (http://www.spss.com/), missing values are automatically excluded from the analysis. In the regression routines in Excel, however, this is not the case. Thus in performing regression (linear and multiple) using Excel, remember that if the routine does not work, a common problem is that there may be missing data. The remedy for this problem is to manually delete the observation(s) which has the missing response for that particular analysis. Note: In large data sets with considerable missing data sorting the data, deleting the rows with missing data, then repeating this process simplifies the data clean-up. There are numerous regression techniques that have been developed for specific purposes. Several of these are designed to determine the predictors from a larger set that maximize the variance explained in the dependent variable. Examples of these specialized techniques include stepwise regression (both forward and backward) and best sub-set regression.