Chapter 7: Multiple Regression Analysis Multiple regression involves the use of more than one independent (predictor) variable to predict a dependent variable. Dependent Variable (Y) The variable that is being predicted or explained by the regression equation. Independent (preditor) variables (X1, X2, … Xk) The variables that are doing the predicting or explaining. Good predictor variables: (i) are correlated with the response variable (Y), and (ii) are not correlated with other predictor variables (X’s) If two predicted variables are highly correlated with each other, they will explain the same variation in Y, and the addition of the second variable will not improve the forecast. This condition is called “multicollinearity”. Estimating Correlation Use Data Analysis/Correlation Multiple Regression Model Statistical model: Y = 0 + 1X1 + 2X2 + . . . + kXk + Mean response E(Y) = Y = 0 + 1X1 + 2X2 + . . . + kXk i.e. Y = y + The ’s are error components. The errors are assumed to be (i) independent, (ii) normally distributed, and (iii) with zero for mean and unknown for standard deviation. Estimated (Sample) Regression Equation: Y = b0 + b1X1 + b2X2 + . . . + bkXk + e Prediction equation for Y: Ŷ = b0 + b1X1 + b2X2 + . . . + bkXk, where, b0 = the y-intercept, bi = the slope of the regression, i = 1, 2, …, k The slope coefficients b1, b2, .. , bk are referred to as partial or net regression coefficients. Estimation (Least Squares Method) Y = Observation, Ŷ = Fit Error (residual) = e = Y – Ŷ SSE = Sum of squared errors = (Y – Ŷ)2 = [Y – (b0 + b1X1 + b2X2 + . . . + bkXk)]2 The least squares method chooses the values for b0, b2, … bk to minimize the sum of squared errors (SSE). In Excel use the LINEST array function to determine these estimates. Decomposition of variance Y = Fit + Residual = Ŷ + (Y – Ŷ) Subtracting Y from both sides, (Y - Y ) = (Ŷ - Y ) + (Y – Ŷ) Then, (Y - Y )2 = (Ŷ - Y )2 + (Y – Ŷ)2 i.e. SST = SSR + SSE SST = Sum of Squares Total = (Y - Y )2 SSR = Sum of Squares Regression = (Ŷ - Y )2 SSE = Sum of Squares due to Error (Y – Ŷ)2= SST - SSR ANOVA Table Source Sum of squares Regression SSR Error SSE Total SST Standard Error of estimate = Sy.x = df k n-k-1 n-1 MSE Mean Square MSR = SSR/k MSE = SSE/(n-k-1) F-test F = MSR/MSE Coefficient of Determination SSR R2 = Proportion of variation of y explained by the regression SST Multiple correlation coefficient R= R 2 = Correlation between Y and Ŷ, (0 < R < 1) Significance of Regression The regression is said to be significant if at least one slope coefficient is not equal to zero. Therefore, the hypotheses we need to test is given as follows. Ho: 1 = 2 = …. k = 0 Regression is NOT significant Ha: At least one 0 Regression IS significant Assume: 1. 2. 3. 4. Y = 0 + 1X1 + 2X2 + . . . + kXk + The errors ’s are independent The errors ’s have constant variance, The error ’s are normally distributed MSR F Test for Significance OF Regression: F MSE t Test for significance of individual predictor variables: t bi s bi where sbi = Estimated Standard Deviation of bi Forecasting Y Point prediction Ŷ = b0 + b1X1 + b2X2 + . . . + bkXk, Interval prediction = Ŷ ± t Sf, where Sf = Standard error of the forecast with df = n-k-1 Dummy variables Dummy (indicator) variable is a column of just zeros ad ones, with ones representing a condition or category, and zero representing the opposite. Dummy (indicator) variables are used to determine the relationship between qualitative predictor (independent) variables and a dependent (response) variable. Example 7.5 (page 293) Model: Y = 0 + 1X1 + 2X2 + where, Y = Job performance rating X1 = Aptitude test score, and X2 = Gender (0 = Female, 1 = Male) Predictor equation = Ŷ = b0 + b1X1 + b2X2 This single prediction equation is equivalent to the following two equations. 1. For X2 = 1, Males, Ŷ = b0 + b1X1 + b2, and 2. For X2 = 0, Females, Ŷ = b0 + b1X1 Interpretation for b2: The estimated difference in job performance rating between males and females. Multicollinearity A linear relationship between two or more independent variables (X’s) is called multicollinearity. The strength of multicollinearity is measured by the variance inflation factor (VIF). 1 Variance Inflation factor = VIFj = , j = 1, 2, .. , k 1 R 2j Where R 2j = Coefficient of determination from a regression with Xj as the dependent variable, and the remaining k-1 X’s as the independent variables. If a given X variable is not correlated with the X’s already in the regression, the corresponding R 2j value will be small and the VIFj value will be close to 1. This would indicate multicollinearity is not a problem. On the other hand, if a given X variable is highly correlated with the X’s already in the regression, the corresponding R 2j value will be high (close to 1) and the VIFj value will be large. This would indicate the presence of multicollinearity problem. If multicollinearity problem exists, the b estimates and the corresponding standard errors will change considerably when the given X is included in the regression. Use the user-defined VIF array-function for determining the VIF values. Example 7.7 (Page 298) Selecting the “best” regression equation Selecting the best regression equation involves developing a regression equation with, as many useful predictor variables (X’s), and, as few predictors as possible in order to minimize cost. Steps to follow: 1. Select a complete list of predictors (X’s) 2. Screen out the predictor variables that do not seem appropriate. Four reasons why a predictor may not be appropriate, (i) predictor not directly related to the response variable, (ii) may be subject to measurement errors, (iii) may have multicollinearity with another predictor variable, and (iv) may be difficult to measure accurately. 3. Shorten the list of predictors so as to obtain the “best” regression equation. Model selection methodologies (a) All possible regression Goal: Develop the best possible regression with each number of predictor (X) variables, namely, with one, two, three, etc. number of predictor (X) variables. (b) Step-wise regression Goal: Develop the best possible regression with as few predictor (X) variables. Procedure: 1. The predictor variable with the largest correlation with Y is entered first into the regression. 2. Of the remaining X variables the one that will increase the F statistics the most, provided it is at least a specified minimum amount, is added to the regression equation. 3. Once an additional X-variable is included in the regression equation, the individual contributions of the X-variables already in the equation are checked for significance using F. If any such F is smaller than a threshold minimum value the corresponding X-variable is removed from the regression equation. 4. Repeat steps 2 and 3 until no more X-variable can be added or removed. Example 7.8 (page 300) Y = One month’s sales X2 = age, in years X4 = experience, in years X1 = selling aptitude test score X3 = anxiety test score X5 = high school GPA Regression diagnostics and residual analysis Histogram of residuals Checks for the normality assumption – moderate deviation from bell-shape is permitted Residuals (on y-axis) v. fitted Checks for the linear assumptions – if the plot is values Ŷ (on x-axis) not completely random a transformation may be considered Residuals v. explanatory Also checks for the linear model and for constant variables (x) variance assumption Residuals over time (for time Used for time-series data - checks for all series) assumptions Autocorrelation of residuals Checks for independence of residuals Leverage of the ith data point (hii) 0 < hii < 1 Leverage depends only on predictors, not Y Rule of thumb: hii > 3(k+1)/n is considered high High leverage point is an outlier among X’s High leverage points will unduly influence the estimated parameter values Outlying or extreme Y value Residual = ei = Y – Ŷ e Standardized residual = i , where, S ei SY . X 1 hii S ei If ei > 2, the residual is considered extreme. S ei Forecasting Caveats Over-fitting Too many X’s in the model Rule of thumb: At least 10 observations for each X-variable in the model Useful Regression F-ratio must be four times the corresponding critical value for the model to yield useful results.