Correlated Explanatory Variables If there are very many variables, it is likely that they will be highly correlated. This implies that some variables or sets of variables are measuring similar things. If included in a prediction formula, there will be considerable redundancy, with many variables merely contributing to the chance variation in the formula. The task of selecting a reasonable subset of variables is not easy. While there are a number of more or less automatic techniques available, they are notoriously unstable and susceptible to influence by exceptional cases. In addition, causal interpretations of prediction formulas in terms of the variables remaining after a selection procedure are unreliable. For purely prediction purposes, this is not a serious problem. However, attempts to use such causal interpretations are likely to mislead. Even with small numbers of explanatory variables, correlations among them can lead to problems, particularly with causal interpretation. Consider the Jobtime case study. Following deletion of exceptional cases, the following regression was calculated. Regression Analysis: Jobtime versus Units, Ops per Unit, T-Ops, Rushed? 17 cases used, 3 cases contain missing values Predictor Constant Units Ops per Unit T-Ops Rushed? Coef 44.216 -0.06931 9.8286 0.107795 -37.960 SE Coef 9.080 0.02853 0.8873 0.004114 3.857 T 4.87 -2.43 11.08 26.20 -9.84 P 0.000 0.032 0.000 0.000 0.000 S = 7.41272 In earlier discussion of this case study, we decided that this gave a reasonable prediction formula, with reasonable prediction error of ± 2 days, approximately. This is helpful to customers, who want to know reasonably precisely when they can expect a delivery, so that they can arrange their own production schedules accordingly. Looking at the coefficient estimates, however, there is a problem. The estimated Units coefficient is negative, which seems to imply that jobs with more units take less time to complete. Clearly, such an implication is nonsense. So, what is going on? The explanation may be found in the correlation matrix, Correlations: Jobtime, Units, Ops per Unit, T-Ops, Rushed? Units Ops per Unit T-Ops Rushed? Jobtime 0.787 -0.039 0.961 -0.342 Units -0.524 0.909 -0.193 Ops per Unit -0.270 0.197 T-Ops -0.243 Before examining the correlations, note that the diagonal of this matrix is missing; the correlation of each variable with itself is 1. Also note that the upper triangle elements are missing; corr(X,Y) = corr(Y,X), so the matrix is symmetric and the lower triangle is sufficient. There is a case to be made for rounding to 2 significant digits. Correlations: Jobtime, Units, Ops per Unit, T-Ops, Rushed? Units Ops per Unit T-Ops Rushed? Jobtime 0.79 -0.04 0.96 -0.34 Units -0.52 0.91 -0.19 Ops per Unit -0.27 0.20 T-Ops -0.24 The first column gives the correlations of the response, Jobtime, with the explanatory variables. (This corresponds to the first row of the scatterplot matrix, which shows scatter plots of response against explanatory variables). Note that the strongest correlation is with T_Ops, with a value of 0.96, very close to 1. This simply reflects the fact that the simple linear regression of Jobtime on T_Ops is very highly significant1, t-value = 26.2. Units shows a correlation of 0.79, also relatively high. The corresponding t-value is 4.94, also highly significant. Exercise 1: Calculate the simple linear regressions of Jobtime on each of T_Ops and Units. Confirm the corresponding t-values. Note, however, the negative correlations between Jobtime and each of Ops per Unit and Rushed. These seem to suggest that Jobtime decreases with Ops per Unit and is lower for Rushed jobs. While the latter makes sense, the former appears not to. Exercise 2: Calculate the simple linear regression of Jobtime on Ops per Unit. Comment of the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value Correlations among explanatory variables The individual correlations of the explanatory variables with the response do not shed much light on the unexpected pattern of regression coefficients in the multiple regression. Some clues begin to emerge when we inspect the correlations of the explanatory variables among themselves, as shown in the remaining columns of the correlation matrix (corresponding to the remaining rows of the scatterplot matrix). Note that the highest correlation in this subset of the correlation matrix, with value of 0.86, is that between Units and T_Ops, the two explanatory variables most highly correlated with the response. The fact that these two explanatory variables are so highly correlated may be interpreted by saying that they measure similar aspects of the jobs involved or that they convey closely related information concerning the jobs. Thus, including both of these variables in the multiple regression involves some degree of redundancy. In choosing values for the regression coefficients, a balance must be struck about how much information regarding the process to take from each variable. In some instances, this balancing act will entail balancing a contribution from one variable with a negative contribution from another, hence the possibility of a negative coefficient for a variable logically positively related to the response. Multiple correlations among explanatory variables With many explanatory variables, there is the possibility of more complex relationships involving several variables. To allow for this possibility, it is necessary to calculate multiple correlations between variables. Ultimately, this reduces to calculating a multiple correlation coefficient of each explanatory variable with all of the others. In practice, this may be done by calculating a 1 This reflects the fact that simple regression coefficients and correlation coefficients are related; see SA, p. 208. The t test of the hypothesis = 0 is also a test of the hypothesis = 0. page 2 value of R2 for each of the explanatory variables regressed on all of the others. For example, the value of R2 for Units, given the other explanatory variables, may be read from the results of the regression of Units on the others, Regression Analysis: Units versus Ops per Unit, T-Ops, Rushed? Predictor Constant Ops per Unit T-Ops Rushed? Coef 177.62 -22.147 0.13534 31.54 SE Coef 73.26 6.058 0.01382 36.47 S = 72.0706 R-Sq = 91.4% T 2.42 -3.66 9.80 0.86 P 0.031 0.003 0.000 0.403 R-Sq(adj) = 89.5% The value of R2 in this case is 91.4%. Repeating this for the other variables leads to: 2 RUnits = 91% 2 = 55% ROpsperUnit R2T _ Ops = 89% 2 RRushed = 13% Exercise 3: Confirm the calculation of the above R2 values. This set of R2 values may be taken as measuring the full extent of correlation between the explanatory variables. A popular convention indicates that such correlation is problematical if 2 the largest such R2 value exceeds 90%. Here, RUnits does, and R2T _ Ops is very close. Variance inflation factors A somewhat more mathematical expression of this problem may be seen by expressing these R2 values in a different way and relating them to the standard errors of the regression coefficients in the prediction formula for the response. It turns out that the more correlation there is between the explanatory variables, the greater are these standard errors and so the less precisely the regression coefficients are estimated from the data. This phenomenon is described as variance inflation, (really, standard error inflation). This parallels the difficulty of interpreting the regression coefficients in the presence of correlations among the explanatory discussed above; not only does interpretation of individual regression coefficients become difficult but also estimating the values of those regression coefficients becomes difficult. To see how variance inflation arises, consider the formula for the standard error of the regression coefficients in the regression of response, Y, on explanatory variables X1, . . . , Xp-1, SE(ˆ k ) sk n 1 1 Rk2 , 1 k p-1, Here, is the error standard deviation, sk is the standard deviation of variable Xk and Rk2 is the R2 value from the regression of Xk on the other explanatory variables, expressed as a proportion (and not as a percentage, as in standard regression output). page 3 Note that, if Rk2 = 0, that is, Xk is not correlated with the other X variables, then SE(ˆ k ) sk n , 1 k p-1, As Rk2 deviates from 0, that is, as Xk becomes more correlated with the other X variables, then Rk2 gets closer to 1, 1 – Rk2 gets closer to 0, 1 1 Rk2 increases towards infinity, and so SE( ̂k ) 1 1 Rk2 increases towards infinity. may be referred to as a standard error inflation factor. Traditionally, VIFk = 1 1 Rk2 is referred to as a variance inflation factor. Exercise 4: Calculate the variance inflation factors for the Jobtime case study. Traditionally, correlated X variables are regarded as problematic if the largest VIF exceeds 10, corresponding to the largest R2 value exceeding 90%. Multicollinearity Mathematicians refer to the problem of highly correlated variables as the multicollinearity problem. This arises from the special case when one (or more) of the explanatory variables is perfectly correlated with the other explanatory variables. This may be visualised easily in the case of two explanatory variables, say X1 and X2. In that case, X1 and X2 are perfectly linearly related, with no deviations from a straight line; recall SA, Figure 6.12, p. 211. X1 and X2 are said to be collinear. It follows that the correlation of Y with each of them is the same, the simple linear regressions of Y on each of them are equivalent and adding the other to the simple linear regression of Y on one of them will make no difference. When there are several explanatory, the linear relationships may be more complicated, for example, a linear combination of one subset may be perfectly related to a linear combination of another subset. In such circumstances, the term multicollinearity is used. Perfectly collinear data may cause computational problems for regression software; it may refuse to fit a model. Software may be programmed to detect multicollinearity in special cases. For example, Data Desk will detect the presence of a superfluous indicator variable and fit a 0 coefficient. In practice, perfect multicollinearity rarely arises in observational studies but near perfect multicollinearity can cause computational problems and will cause interpretational problems. page 4 page 5