Notes AGR 206 687320275 Revised: 2/6/2016 Chapter 6. Multiple Linear Regression 6:1 Multiple Linear Regression. Multiple Linear Regression (MLR) is one of the most commonly used and useful statistical methods. MLR is particularly helpful for observational studies where treatments are not imposed and therefore, the multiple explanatory variables tend not to be orthogonal to each other. The concept of orthogonality is used repeatedly in this subject and in multivariate statistics. Two variables are said to be orthogonal to each other if they have no covariance. In other words, they exist in completely separate subspaces within the observation or sample space, which in geometric terms means that they are "perpendicular" to each other. ANOVA REGRESSION Figure 1. Multiple linear regression is one of the most general statistical techniques. Most techniques, particularly ANOVA for any type of experimental design and treatment structure, are special cases of MLR. In these lectures we will cover almost all basic aspects of MLR, and some more advanced topics. First, the general concept and the different types of uses of MLR are presented. Although there are no "formulas" in that section, it is probably one of the most important things to learn, because the application of different methods within MLR, such as selection of variables and biased regression, is completely dependent on the goals of the analysis. Second, as usual we define the model and the typical assumptions, which immediately leads into the screening of data and adequacy of model. Collinearity is one of the most difficult problems in MLR, so a good deal of detail is offered there, and two methods for dealing with collinearity, Ridge Regression and Principal Components Regression, are introduced later on. Finally, we study the methods to decide how to reduce the number of variables in the model, and how to validate the model for the specific purpose of the study. 6:2 Concept & Uses of MLR. Through MLR one can establish a least-squares estimation of a relationship between one dependent and many independent or explanatory variables. This relationship can involve both quantitative (continuous) and qualitative (class) variables. We will focus on the analysis of data where most or all of the variables are continuous. Rawlings et al., (1998) list six different uses for a regression equation: 687320275 1 Notes AGR 206 687320275 Revised: 2/6/2016 1. Description of the behavior of Y in a specific data set. The validity of the model is limited to the data in the sample, and there is no need to refine the model or eliminate variables. In the Spartina data set, did biomass tend to increase or decrease with increasing pH? 2. Prediction and estimation of future responses in the population represented by the sample. This objective concerns only the estimation of expected responses, not parameters. In this case, the model can be refined and the number of variables reduced to reduce the cost of obtaining the sample of X's necessary to make the estimations and predictions. How can we predict biomass in the marshes without having to clip? 3. Extrapolation outside the range of the data used to develop the model. This is similar to point 2, with the additional complication that there will be predictions outside the original multivariate range of X's. Emphasis should be in validation and periodical maintenance of the model by obtaining new observations. Can we predict biomass in an increasing number of estuaries? 4. Estimation of parameters. The precision of the estimates can be increased by eliminating redundant and unrelated variables, but there is a risk of introducing too much bias in the estimates by removing variables that are important. Assuming we know that the relationships are linear and without interactions, how much does pH affect biomass? 5. Control of a system or process. This implies that a cause-and-effect relationship must be established between the X's and Y. If no causality is demonstrated, there is not basis for trying to control Y by changing X. If I change the pH, will biomass change in a predictable amount? 6. Understanding of processes. This is related to point 4, where parameter estimates indicated how much and in which direction any given X affects Y. However, the linear model may be too restrictive to the real process, which may be non-linear. MLR can help identify variables or groups of variables that are important for further modeling. Why is there a predictably different expected biomass when pH, etc. change? 6:3 Model and Assumptions. 6:3.1 Linear, additive model to relate Y to p independent variables. Note: here, p is number of variables, but some authors use p for number of parameters, which is one more than variables due to the intercept. Yi=0+ 1 Xi1+…+ p Xip+i where i are normal and independent random variables with common variance 2. In matrix notation the model and solution are exactly the same as for SLR: Y= X+ b=(X’X)-1(X’Y) 6:3.1.1 All equations used for SLR (in matrix notation) apply without change. The equations are easiest in matrix notation. 687320275 2 Notes AGR 206 687320275 Revised: 2/6/2016 s 2 b MSE X X Yˆ X b 1 h h 1 s Yˆh MSE X h XX X h 2 where Xh is a row vector with its first element being 1 and the rest of the elements being values of each X for which the expected Y is being estimated. In general, for estimating L'b s2 L'b MSE L' X X1 L where L is a row vector of coefficients selected by the user to reflect the value or Ho of interest. 6:3.2 Response plane and error. Just as in SLR errors are measured as the vertical deviations from each observation to the prediction line, in MLR they are measured as the "vertical" (i.e. parallel to the axis of the response variable) distance from the observation to the prediction plane (Figure 2). The residuals are calculated as in simple linear regression, where ei=Yi-E{Yi}=Yi-Yhati. Figure 2. Example of a regression plane for two independent variables. 6:4 Performing MLR in JMP: the basics. Performing the procedure for MLR in JMP is extremely simple. By showing how simple it is to get almost all the output you need to work an MLR situation, I place the emphasis on understanding and interpreting the results. Take for example the Body Fat example from Neter et al., (1996). The response variable Y is body fat content; X1 is triceps skinfold thickness, X2 is thigh circumference, and X3 is midarm circumference. The data are in the file bodyfat.jmp. In this example, MLR is applied for prediction and estimation (#2 of the above list); to predict body fat in people without having to use more time-consuming and expensive methods. 687320275 3 Notes AGR 206 687320275 Revised: 2/6/2016 In order to practice and fully follow the example, it is recommended that you get and open the bodyfat.jmp data and perform the operations as they are described in the text. Select Analyze -> Fit Model and in the window enter Y in the Y box and all three X’s in the Model Effects box. The model specification should look like the one in this picture. Personality should be S. Least Squares. Choose the Minimal Report and click Run Model. In the results window, click on the red triangle by Response Y and select Estimates -> Sequential Tests. Then select Correlation of Estimates. Next, from the same drop-down menu at the red triangle, select Row Diagnostics -> Plot Residuals by Predicted. Then select Press, and later Durbin-Watson test. For this example, assume that the samples are in the temporal order in which they were obtained, so we can test for temporal independence of errors. 687320275 4 Notes AGR 206 Revised: 2/6/2016 687320275 Now select Save Columns -> Studentized Residuals, Hats, Cook’s D influence, and Mean Confidence interval. Finally, right- click (Windows) or CTL-click (Mac) on the body of the Parameter Estimates table. In the drop-down menu that appears sequentially select all the columns that were not already checked. At this point almost all information needed to do a complete interpretation and evaluation of the MLR is displayed in the results window or in the original data table. 1. R2 and Ra2. R-square is the coefficient of determination, and it represents the proportion of the variance in the response variable (body fat) that is explained by the three explanatory variables. R 1 2 SSE SSTotal This is a measure of how good the model is, but it has the problem that it always increases as more variables or parameters are added to the model. Unless we add a variable whose sample values are perfectly correlated with other variables already in the model, even the addition of superfluous variables will make R 2 increase. On the other hand, Ra2 is “adjusted” for the loss of degrees of freedom in the error as more variables are added to the model. Thus, the adjusted Ra2 will decrease when we add variables that do not explain enough additional variance in Y to justify the loss of degrees of freedom of the error. 687320275 5 Notes AGR 206 687320275 Ra2 1 Revised: 2/6/2016 MSE SSTotal n 1 2. The Analysis of Variance table shows the sums of squares, degrees of freedom and mean squares for the whole model, residual or error, and total. The F ratio tests the Ho that the full model is no better than just a simple mean to describe the data. A probability value lower than 0.05 (or your preferred critical level) leads to a rejection of the hypothesis and opens the possibility of determining which parameters are significantly different from zero, or in other words, what variables explain a significant part of the variance in Y. 3. The Parameter Estimates table is critical for MLR. The second column contains the values of the estimated parameters or partial regression coefficients. Each coefficient represents the expected change in Y for each unit of change in the corresponding X, with all other variables remaining constant. These values of the partial regression coefficients depend on the units used for the variables in the model. Therefore, they are not directly comparable. 4. The standard error for each estimate is given in this column. Recall that the estimated partial regression coefficients are linear combinations of normal random variables. Thus, they are normal random variables themselves. The printed standard errors are used to test whether the estimated parameters are significantly different from zero, and to construct confidence intervals for their expected values. 5. The t-ratio is the estimated parameter divided by its standard error. It tests the Ho: parameter=0. Please, at this point note that things get more interesting as in the previous table we saw that the Ho: all parameters=0 rejected and in this table we see that not a single parameter is significantly different from zero. What is happening?! 6. The confidence intervals for the parameters are calculated by adding and subtracting from the estimated value the product of the standard error and the t value for P=0.05 (two-tailed). 687320275 6 Notes AGR 206 Revised: 2/6/2016 687320275 1 2 4 3 7 6 5 8 9 10 11 12 13 7. The standardized betas or estimated standardized partial regression coefficients are the coefficients obtained from regressing standardized Y on the standardized X’s. Typically we use bi’ to represent them. Standardization is obtained by subtracting the mean and dividing by the standard deviation. It is not actually necessary to run the new regression, because bi' bi sXi sY where bi is the coefficient for Xi in the regular MLR, sXi is the standard deviation of Xi, and sY is the standard deviation of the response variable. The standardized partial regression coefficients can be used to compare the magnitude of the effects of different predictors on the response variable. The value of the b’ for a predictor X is 687320275 7 Notes AGR 206 Revised: 2/6/2016 687320275 the change in Y, in standard deviations of Y, associated with an increase of X equal to one standard deviation of X. 8. The VIF is the variance inflation factor. This is a statistic used to assess multicollinearity (correlation among the predictors), one of the main problems of MLR in observational studies. Each predictor has an associated VIF. The variance inflation factor of Xi is the reciprocal of 1 minus the proportion of variance of Xi explained by the rest of the X’s. VIFi 1 1 Ri2 Ri2 is the proportion of variance of Xi explained by the other X’S. The VIF get its name from the fact that it is the factor by which the variance of each estimated coefficient increases due to the collinearity among X’s. That’s right. Collinearity (or correlations) among the X variables increases the variance of the b’s without necessarily changing the overall R2 of the model. This is behind the apparent paradox described in point 5, where it was determined that no parameter is significantly different from zero based on the type III SS, but the model is highly significant. 9. The effects test shows the sum of squares, df and F test for each variable in the equation. Once again, these tests do not seem to reject the Ho: parameter=0 for any of the b’s. These tests are constructed with the Type III sum of square, which is the default for JMP. A detailed introduction to types of SS is given below. At this point suffice it to say that type III SS only considers that part of the variation in Y that is exclusively explained by the factor under consideration. 10. The PRESS statistic is the prediction error sum of squares, a statistic that is similar to the SSE. Models with smaller PRESS are better. PRESS is calculated as the sum of squares of the errors when the error for each observation is calculated with a model obtained while holding the observation out of the data set (Jackknifing). PRESS is used for model building or variable selection. Y Y n PRESSp 2 i 1 i i n 1 The subscript p indicates that there are p variables in the model, and i(i) indicates that the predicted value for observation i was obtained without observation i in the data set. 11. Correlation of estimates. The fact that the b’s (we use b’s as short for “estimated partial regression coefficients”) have correlations once again emphasizes their random nature. The correlation between pairs of estimated coefficients also relates to the problem of collinearity. Usually, when there is a high correlation between two parameters, each parameter will have a high variance. Intuitively, this means that minor random changes in the sample data might be accommodated by the model by large associated changes in the two b’s. The b’s vary widely from sample to sample, but the overall model fit is equally good across samples! 12. The Sequential Tests are based in the Type I SS. These are calculated according to the order in which the variables were entered in the model. Thus, variables entered first get “full credit” for any previously unexplained variance in Y that they can explain. Variables only get a chance to explain the variance in Y not previously explained by earlier X’s. The sum of type I SS should equal the total SS. Type I SS are useful to understand the relationships among variables in the model and to test for effects when there is a predetermined order in which variables should be entered. For example, suppose that you are developing a method to estimate plant mass in a crop or an herbaceous native plant community. The method has to be quick and inexpensive. Because you are interested in prediction, you are not concerned with the variance of b’s. Canopy height is a very quick and inexpensive non-destructive measure, so you would include it in all models first and use its type I SS to assess its explanatory power. 13. Durbin-Watson statistic. This is a test of first-order autocorrelation in the residuals. It is assumed that the data are ordered by time or spatial position on a transect. If the data have no time sequence or position on a transect, this statistic is meaningless. A small P value (say less than 0.01) indicates that there is a significant temporal or spatial autocorrelation in the errors, which violates the assumption of independence. 687320275 8 Notes AGR 206 687320275 Revised: 2/6/2016 6:5 Four Types of Sum of Squares. When X variables are not orthogonal, which is usually the case when data do not come from designed experiments, it is not possible to assign a unique portion of the total sum of squares to each variable. In order to understand the explanatory power and importance of each X, 4 different types of sum of squares are used. Thus, understanding and interpreting the partition of SS performed in MLR and ANOVA requires that one understand the different types of SS, particularly types I and III. SAS defines four types of sum of squares (SAS Institute, Inc. 1990; SAS/STAT User's Guide). In the vast majority of analyses, the relevant SS are Type I and Type III. I mention all the types for completeness and understanding. This discussion applies to any linear model, including all ANOVA models. In order to understand the differences among the concepts of different types of SS, it is necessary to consider the potential sources of differences in their values. 1. First, the simplest situation, where all SS are equal in value, is a designed experiment where all effects are orthogonal, all treatment combinations are observed (no missing cells or combinations), and all cells have the sum number of observations (balanced experiment).I=II=III=IV. 2. Second, the next level of complexity is when not all effects are orthogonal, but all cells are represented and have the same number of observations. I≠II=III=IV. 3. Third, when cell sizes are not equal and effects are not orthogonal, but all cells are represented I≠II≠III=IV. 4. If in addition to the conditions in case 3 there are missing cells, I≠II≠III≠IV. Type I SS are the sequential sums of squares. They depend on the order in which effects (or variables) are included in the model. The order is determined by the order in which effects are specified in the MODEL statement in SAS, or the order in which they are entered into the Model Effects in the Fit Model platform of JMP. The Type I SS for any effect is the sum of squares of the Y variable explained by the effect that is not explained by any of the effects previously entered in the model. Thus, it is completely dependent on the order in which effects are entered, except when effects are orthogonal to each other. The sum of Type I SS is the same as the total sums of squares for the whole model (for example, SSRegression=type I SS). Type I SS and Type III SS are always the same for the last effect in the model. Type II and III SS are the partial sums of squares for an effect. This partial SS is the SS explained by the effect under consideration that is not explained by any of the other effects or variables. The use of F tests based on type II and III SS is conservative in the sense that it completely ignores the variations in the response variable the can be explained by more than one effect. The sum of the type II/III SS can be anything between 0 and the total SS of the model or regression. In MLR (in AGR206) we deal mostly with continuous variables (no grouping variables), so there are no cells nor possibility for unequal cell sizes. Thus, typically type II and type III SS will be the same. In general, type I and II/III are the most commonly used SS. 687320275 9 Notes AGR 206 687320275 Revised: 2/6/2016 Figure 3. Schematic representation of the different types of sum of squares relevant for multiple linear regression. This is figure 7.1 from Neter et al., (1996). Figure 3 illustrates the partitioning of the total SS using type I SS for the regression of Y on two X's as the X's are entered into the model in different order. A semi-standard notation is used where commas separate effects or variables included in the model, and bars separate the variable or variables in question from those already in the model. Thus, SSE(X1) is the SS of the error when X1 is in the model, SSR(X1, X2) is the SS of the regression when both variables are in the model, and SSR(X2|X1) is the increment in the SSR (and thus the reduction in SSE) caused by including X2 after X1. Usually, the type III SS for a variable will be smaller than the type I SS when they are entered first. However, this is not the case for "suppressor" variables, which have a larger type III than type I SS. This fact still escapes my intuition and imagination. An excellent explanation for the different types of sum of squares can be found in the General Linear Models sections of StatSoft’s electronic textbook (http://www.statsoft.com/textbook/stathome.html). The concept of extra sum of squares, particularly type III SS allows the construction of general linear tests for individual of groups of variables. For example, one may want to test whether the regression coefficients for X1 and X3 are zero in a model with 5 independent variables, Y = 0 + 1 X1 +2 X2 + 3 X3 + 4 X4 + 5 X5 + . The F test would be: F SSR(X1 , X3 X2 , X 4 ,X5 ) / 2 SSE(X1 , X2 , X 3, X 4 , X5 ) (n 6) 6:6 Partial correlations and plots. When many explanatory variables are present, as in MLR, it is difficult to visually assess the relationship between Y and any one of the X's by looking at the scatterplots, because the effects of the variable may be 687320275 10 Notes AGR 206 Revised: 2/6/2016 687320275 masked by the effects of the other variables. In this case, it is necessary to use the concept of partial regression, correlation, and plots. The partial correlation between Y and X is defined as the correlation between these two variables when all other variables are controlled for. The partial correlation and regression may reveal unexpected relationships. This concept is illustrated in Figure 4. Because the predictors exhibit a high positive correlation and X2 has a strong negative effect on Y2, the effect of X1 on Y2 is reversed and appears to be negative, while in fact it is positive. The partial correlation between a response variable Y and one predictor out of many is calculated in the Multivariate platform of JMP. 2 X1 1 0 -1 -2 3 Correlations Varia ble X1 X2 Y2 X2 2 X1 1.00 00 0.88 68 -0 .487 0 X2 0.88 68 1.00 00 -0 .672 1 Y2 -0 .487 0 -0 .672 1 1.00 00 1 0 -1 -2 Y2 2 Partial Corr 1 Varia ble X1 X2 Y2 X1 • 0.86 51 0.31 85 X2 0.86 51 • -0 .595 1 Y2 0.31 85 -0 .595 1 • 0 -1 -2 pa rtial led wi th re spect to al l other vari able s -2 -1 0 1 2 -2 -1 0 1 2 3 -2 -1 0 1 2 Figure 4. Effects of correlations among the predictors on the apparent simple correlations with the response variable. These correlations and partial correlations are based on a random sample from a data set where X1 and X2 follow a bivariate normal distribution with correlation equal to 0.9. The true model relating Y to the predictors is: Y = 0.3 X1 - 0.7 X2 + , where has a normal distribution with variance equal to 0.25. The partial correlation between two variables, say Y and X1 for the Body Fat example, is the correlation between Y and X1 after correction for the linear relationships between Y and the other X’s and X1 and the other X’s. To understand the meaning of the partial correlation intuitively, consider how it can be calculated. First, regress Y on the other X’s, in this example X2 and X3. Save the residuals and call them eY(X2,X3). Second, regress X1 on X2 and X3. Save the residuals and call them eX1(X2,X3). Finally, correlate eY(X2,X3) and eX1(X2,X3). The results from JMP are presented below. 687320275 11 Notes AGR 206 Revised: 2/6/2016 687320275 Partial correlation in the Body Fat example. The errors obtained in the first two steps can be used to assess the linearity and strength of the relationship between Y and X1, controlling for the other variables. Partial regression plots represent the marginal effects of each X variable on Y, given that all other X’s are in the model. In the case of Y and X1 in the example above, we plot eY(X2, X3) vs. eX1(X2, X3). Partial regression plot of Y and X1 Figure 6-5. Partial regression of residuals in Y vs. residuals in X1. No assumptions are violated. The partial regression is simply the regression of the residuals of Y on the residuals of X1. The scatter plot should be inspected as a regular residual plot, looking out for non-linearity, heterogeneity of variance, and outliers. The relationship appears to be linear with homogeneous variance over levels of X1, and without outliers. 6:7 Generalized MLR. 6:7.1 Linear, and intrinsically linear models. Linearity refers to the parameters. Thus, a model is linear if it can be expressed as a linear combination of parameters. The model can involve any functions of X’s as long as this functions do not involve parameters whose true values are unknown. As a result, not all linear models represent a hyperplane in the variable space. Yi=0+ 1 f1(Xi1)+…+ p fp(Xip)+i 687320275 generic linear model 12 Notes AGR 206 687320275 Revised: 2/6/2016 For example, the first function could be sin[3X+ln(2+0.75X)]. In particular, powers of X are used as these functions, creating polynomials of any degree desired. Intrisically linear models are models that seem not to be linear but that by rearranging terms or changing variables can be expressed as in the generic equation above. 6:7.2 Polynomial regression. Polynomial regression is a special case where the functions are powers of X. More complex and general models include interactions among variables. These interactions can be incorporated simply as new variables resulting from the product of the values of the original variables interacting. By doing a substitution of variables one can see models with interactions as regular linear models. For example, a polynomial on X of order 3 can be expressed as a regular linear model. Y 0 1 X1 2 X12 3 X13 X12 X 2 X13 X 3 Y 0 1 X1 2 X 2 3 X 3 Likewise, the interaction between two different variables can be transformed into a new variable and approached as regular MLR. An interaction between two variables, say X1 and X2, means that the effect of either one on Y depends on the level of the other. For example, let Y=2X1X2. An increase in the value of X2 by one unit will make Y change by 2X1. When X1=1 the effect of the unit change in X2 is an increase of Y by 2 units. When X1=3 the effect of the unit change in X2 is an increase of Y by 6 units. NOTE: the concept and intuition of statistical interaction is essential. Two variables have an interaction when the effect of one of them on the response depends on the level of the other one. In the clover example used for homework 2, there was an interaction between group and days. This means that the effect of temperature depended on the age of the plants. When plants were very young, temperature groups did not differ in plant size. After plants had a chance to grow for a while, the average sizes became different among groups. The same fact can be stated by saying that the effect of time on size of plants (growth rate) depended on the group. 687320275 13