Chapter 8. Multiple Regression A. Multiple Regression Model Many applications of regression analysis involve situations in which there are more than one independent variable. A regression model that contains more than one independent variable is called a multiple regression model. Multiple Linear Regression Model Y 0 1 X 1 k X k Y is the dependent variable X 1 , X 2 ,..., X k is the independent variables (known constant) 0 , 1, , k is the unknown regression coefficients (parameters) is the random error Assumptions on the random error iid iid Y ~ Normal[ 0 1 X1 k X k , 2 ] ( 2 unknown) ~ Normal[0, 2 ] 1) the mean of the error variable is 0 ( E [ ] 0) 2) the variance of is (V [ ] 2 ) 3) the errors are independent 4) the errors are normally distributed B. Estimation Suppose that we have n sets of observations ( xi1 , xi 2 ,..., xik , yi ) , i 1,..., n to be used to estimate 0 , 1, , k in a multiple linear regression model. yi 0 1 xi1 2 xi 2 ... k xik i (i 1, 2,..., n ) Estimating the Coefficients by the Method of Least Square Choose the values of ˆ0 , ˆ1,..., ˆk that minimize SS as estimators of 0 , 1 ,..., k SS n i 1 ( yi 0 1xi1 2 xi 2 ... k xik )2 1 ▪ Least Squares Estimates of ˆ0 , ˆ1,..., ˆk SS 0 , j 0,1,2,..., k Find ˆ0 , ˆ1,..., ˆk that satisfy j ▶ ˆ0 , ˆ1,..., ˆk : let the computer produce these values The fitted or estimated regression model is yˆ i ˆ0 ˆ1 xi1 ... ˆk xik (i 1,2,..., n ) ▪ Residual ei ei yi yˆi yi ˆ0 ˆ1xi1 ... ˆk xik Estimating 2 The residuals ei yi yˆ i are used to obtain an estimate of 2 ▪ SSE (error sum of square) n i 1 ei2 n i 1 ( yi yˆi )2 ▪ MSE (mean square error, n i 1 ( yi ˆ0 ˆ1xi1 ... ˆk xik )2 SSE ) : ̂ 2 (estimator of 2 ) n k 1 SSE 2 E MSE n k 1 ▶ SSE , MSE (ˆ 2 ) : let the computer produce these values C. Hypothesis Tests in Multiple Regression In simple linear regression, the t-test and the F test provide the same conclusion; that is, if the null hypothesis is rejected, we conclude that 0 . In multiple regression, the t-test and the F-test have different purposes. 1) The F-test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables. (a test for overall significance) 2) The t-test is used to determine whether each of the individual independent variables is significant. A separate t-test is conducted for each of the independent variables in the model. (a test for individual significance) Use of t-Tests For each independent variable X 1 , X 2 ,..., X k , we can test to determine whether there is enough evidence of a linear relationship between it and the dependent variable Y for 2 the entire population. Hypothesis(Testing the significance of coefficients) H0 : j 0 Failure to reject H1 : j 0 ( j 1,..., k ) vs H 0 : j 0 is equivalent to concluding that there is no linear relationship between and X j and Y . Alternatively, rejecting H 0 : j 0 implies that there is enough evidence of linear relationship between X j and Y . Sampling Distribution of the Test Statistic (under H 0 : j 0 ) T ˆ j 0 S .E .[ ˆ j ] t [ n k 1] ▶ S . Eˆ .[ ˆ j ] : let the computer produce these values Rejection Rule Critical Value Approach Reject H 0 if T ˆ j S . E .[ ˆ j ] t /2,n k 1 p-value Approach Reject H 0 if p value ※ 100(1 )% Confidence Interval on the slope j ˆ t j /2,n k 1 S.E.[ ˆ j ], ˆ j t /2,n k 1 S .E.[ ˆ j ] The F-test Hypothesis(Analysis of Variance Approach to Test Significance of Regression) H 0 : 1 2 ... k 0 H 1 : At least one j 0 ( j 1,2,..., k ) If H 0 : 1 2 ... k 0 is rejected, the test gives us sufficient statistical evidence to conclude that one or more of the parameters is not equal to zero and that the overall relationship between Y and the set of independent variables X 1 , X 2 ,..., X k is significant. However, if H 0 cannot be rejected, we do not have sufficient evidence to conclude that a significant relationship is present Sampling Distribution of the Sum of Squares The procedure partitions the total variability in the response variable into meaningful components as the basis for the test. 3 - Decomposition of variation n i 1 ( yi y )2 n i 1 ( yˆ i y )2 n i 1 ( yi yˆ i )2 ⇒ SST (total sum of squares) is partitioned into SSR (regression sum of squares) and SSE (error sum of squares) ▶ SST , SSR : let the computer produce these values - Decomposition of Degrees of Freedom ( n 1) ( k ) ( n k 1) ⇒ SST ’s degrees of freedom is partitioned into SSR ’s degrees of freedom and SSE ’s degrees of freedom - SSR 2 , SSE 2 are independently distributed chi-square random variables with k , ( n k 1) degrees of freedom (by Cochran’s Theorem). - we can show that SSE 2 E , 1 n k SSR 2 E , under H 0 : 1 2 ... k 0 k SSR 2 If H 0 : 1 2 ... k 0 is not true, E k Test Statistic F MSR SSR / k ~ F [k 1, n k 1] MSE SSE / (n k 1) ⇒ Under H 0 : 1 2 ... k 0 , MSR MSE . ⇒ If MSR MSE significantly, then reject H 0 (Always Upper Tail) Rejection Rule Critical Value Approach Reject H 0 if F p-value Approach MSR F ,k ,n k 1 MSE Reject H 0 if p value Table Result Source of Sum of Degrees of Variation Squares Freedom Regression SSR k 4 Mean Square F SSR MSR k MSR MSE Error SSE n k 1 Total SST n 1 SSE MSE n k 1 D. Estimating Values of the Dependent Variable As was the case with simple linear regression, we can use the multiple regression equation in two ways: we can produce the prediction interval for a particular value of y , and we can produce the confidence interval estimate of the expected value of y . ▶ 100(1 ) % Confidence Interval about the mean response E [ y0 ] , 100(1 ) Prediction Interval of new observation y0 : let the computer produce these values E. Adequacy of the Regression Model Using R 2 in Multiple Regression Model Adding independent variables causes the prediction error to become smaller, thus reducing SSE . When SSE becomes smaller, SSR become larger, causing R 2 SSR to increase SST ⇒ The R 2 value for a regression can be made arbitrarily high simply by including more and more predictors in the model. Adjusted R 2 The Adjusted R 2 statistic essentially penalizes the analyst for adding terms to the model. It is an easy way to guard against overfitting , that is, including independent variables that are not really useful. The Adjusted R 2 is given by Adj R 2 1 MSE SSE / (n k 1) 1 MST SST / (n 1) ⇒ Because SSE / ( n k 1) is the mean square error and SST / ( n 1) is a constant, Adj R 2 will only increase when a variable is added to the model if the new variable reduces the mean square error. F. Regression Diagnostics (Residual Analysis) the Required Conditions for the Validity of Regression Analysis 5 Estimation of the model parameters requires that the errors are uncorrelated normal random variables with mean zero and constant variance. Regression Diagnostics (examining the adequacy of the regression model) Most departures from required conditions can be diagnosed by examining the residuals ei yi yˆ i (i 1, 2,..., n ) . 1) Non-normality ⇒ Histogram of the residuals ei 2) Heteroscedasticity ⇒ Residual Plots (Plot the residuals ei against yˆ i , i 1, 2,..., n ) 3) Non-independence(Autocorrelation) of the error variable ⇒ Residual Plots (Plot the residuals ei against yˆ i or i , i 1,2,..., n ) Patterns for Residual Plots G. Regression Diagnostics (Multicollinearity) In most regression problems, we find that there are dependencies among the independent variables 6 X 1 ,..., X k . In situations where these dependencies are strong, we say that multicollinearity exists. The Effect of Multicollinearity The variance of the estimate of coefficient ˆ j ( j 1,2,..., k ) can be expressed as 2 1 , where Var[ ˆ j ] S x j x j (1 R 2j ) Sx j x j n i 1 ( xij x j ) 2 and R 2j is the coefficient of determination resulting from regressing x j on the other k 1 regressor variables. the stronger the linear dependency of x j on the remaining regressor variables, and hence the stronger the multicollinearity, the larger the value of R 2j will be. ⇒ the variance of ˆ j inflated by the quantity 1 (1 R 2j ) As a result, multicollinearity can have some negative effects on the estimates of the regression coefficients. - the individual coefficients are not statistically significant, even though the overall regression equation is strong and the predictive ability good - the relative magnitudes and even the signs of the coefficients may defy interpretation - the values of the individual regression coefficients may change radically with the removal or addition of a predictor variable in the equation Detection of the presence of multicollinearity 1) The variance inflation factor VIF ( j ) are very useful measures of multicollinearity. Some authors have suggested that if any variance inflation factor VIF ( j ) , j 1,2,..., k exceeds 10, multicollinearity is a problem. ▪ Variance Inflation Factor for j VIF ( j ) 1 , j 1,2,..., k (1 R 2j ) 2) If the F-test for significance of regression is significant, but tests on the individual regression coefficients are not significant, multicollinearity may be present. Multiple Regression Examples Example 1 In order to determine whether or not the sales volume of a company ( Y in millions of dollars) is related to advertising expenditures ( X1 in millions of dollars) and the number of salespeople ( X 2 ), 7 data were gathered for 10 years. Part of the regression results is shown below. Predictor Coefficient Standard Error Constant 7.0174 1.8972 X1 8.6233 2.3968 X2 0.0858 0.1845 Analysis of Variance Source Degrees of Freedom Sum of Squares Mean Square F Regression ? 321.11 ? ? Error ? 63.39 ? a) Use the above results and write the regression equation that can be used to predict sales. b) Estimate the sales volume for an advertising expenditure of 3.5 million dollars and 45 salespeople. Give your answer in dollars. c) At 0.05 , test to determine if the fitted equation developed in Part a) represents a significant relationship between the independent variables and the dependent variable. d) At 0.05 , test to see if 1 is significantly different from zero. e) Determine the multiple coefficient of determination. f) Compute the adjusted coefficient of determination. Example 2 The following is part of the results of a regression analysis involving sales ( Y in millions of dollars), advertising expenditures ( X1 in thousands of dollars), and number of sales people ( X 2 ) for a corporation: Analysis of Variance Source Degrees of Freedom Sum of Squares Mean Square F Regression 2 822.088 ? ? Error 7 736.012 ? a) At 0.05 level of significance, test to determine if the model is significant. That is, determine if there exists a significant relationship between the independent variables and the dependent variable. 8 b) Determine the multiple coefficient of determination. c) Determine the adjusted multiple coefficient of determination. d) What has been the sample size for this regression analysis? Example 3 Below you are given a partial computer output based on a sample of 12 observations relating the number of personal computers sold by a computer shop per month ( Y ), unit price ( X1 in $1,000) and the number of advertising spots ( X 2 ) used on a local television station. Predictor Coefficient Standard Error Constant 17.145 7.865 X1 -0.104 3.282 X2 1.376 0.250 a) Use the output shown above and write an equation that can be used to predict the monthly sales of computers. b) Interpret the coefficients of the estimated regression equation found in Part a). c) If the company charges $2,000 for each computer and uses 10 advertising spots, how many computers would you expect them to sell? d) At 0.05 , test to determine if the price is a significant variable. e) At 0.05 , test to determine if the number of advertising spots is a significant variable. Example 4 Below you are given a partial computer output relating the price of a company's stock ( Y in dollars), the Dow Jones industrial average ( X1 ), and the stock price of the company's major competitor ( X 2 in dollars). Analysis of Variance Source Degrees of Freedom Sum of Squares 9 Mean Square F Regression ? ? ? Error 20 40 ? Total ? 800 ? a) What has been the sample size for this regression analysis? b) At 0.05 level of significance, test to determine if the model is significant. That is, determine if there exists a significant relationship between the independent variables and the dependent variable. c) Determine the multiple coefficient of determination. Example 5 A microcomputer manufacturer has developed a regression model relating his sales ( Y in $10,000s) with three independent variables. The three independent variables are price per unit (Price in $100s), advertising (ADV in $1,000s) and the number of product lines (Lines). Part of the regression results is shown below. Predictor Coefficient Constant Standard Error 1.0211 22.8752 Price -0.1524 0.1411 ADV 0.8849 0.2886 Lines -0.1463 1.5340 Analysis of Variance Source Degrees of Freedom Sum of Squares Mean Square F Regression ? 2708.61 ? ? Error 14 2840.51 ? a) Use the above results and write the regression equation that can be used to predict sales. b) If the manufacturer has 10 product lines, advertising of $40,000, and the price per unit is $3,000, what is your estimate of their sales? Give your answer in dollars. c) Compute the coefficient of determination and fully interpret its meaning. 10 d) At 0.05 , test to see if there is a significant relationship between sales and unit price. e) At 0.05 , test to see if there is a significant relationship between sales and the number of product lines. f) Is the regression model significant? (Perform an F test.) g) Fully interpret the meaning of the regression (coefficient of price) per unit that is, the slope for the price per unit. h) What has been the sample size for this analysis? 11