PLS 802 Spring 2016 Professor Jacoby R2 and Analysis of Variance (ANOVA) in Multiple Regression For purposes of this handout, we will assume a multiple regression analysis with two independent variables (k = 2). Note, however, that all of the concepts generalize to any number of independent variables, as long as k + 1 < n. The basic regression equation is as follows: Yi = a + b1 X1i + b2 X2i + ei I. The Variance of the Predicted Values: Ŷi = a + b1 X1i + b2 X2i We can see that Ŷ is a linear combination. Therefore, its variance is obtained as follows: var(Ŷ ) = b21 var(X1 ) + b22 var(X2 ) + 2b1 b2 cov(X1 X2 ) Note that there are no terms involving a on the right-hand side of the preceding expression, because its “variable” is a constant which has no variance and does not covary with any other variables. II. The Variance of Y : The dependent variable, Y , is a linear combination of the independent variables and the residual term from the regression equation. Therefore, its variance can be broken down as follows: var(Y ) = b21 var(X1 ) + b22 var(X2 ) + var(e) + 2b1 b2 cov(X1 X2 ) + 2b1 cov(X1 e) + 2b2 cov(X2 e) Once again, there are no terms containing a on the right-hand side because its variable is a constant, with no variance or nonzero covariances with any other terms. Also, the OLS estimation procedure guarantees that the residual will be uncorrelated with all of the independent variables in the equation. Therefore, the preceding equation can be expressed as follows: var(Y ) = b21 var(X1 ) + b22 var(X2 ) + var(e) + 2b1 b2 cov(X1 X2 ) We can rearrange the terms in the preceding equation, and substitute in some of the earlier results to produce the following: var(Y ) = b21 var(X1 ) + b22 var(X2 ) + 2b1 b2 cov(X1 X2 ) + var(e) var(Y ) = var(Ŷ ) + var(e) Thus, the total variance in Y can be broken down into a sum of the variance in Ŷ (the “explained” variance) and the variance in e (the residual or “unexplained” variance). R2 and Analysis of Variance (ANOVA) in Multiple Regression Page 2 III. Sums of Squares and the R2 Goodness of Fit Measure: We can obtain the sums of squares from the previous equation, as follows (Sums are taken over all n observations in the sample, so limits of summation are not shown): var(Y ) = var(Ŷ ) + var(e) (n − 1)var(Y ) = (n − 1)var(Ŷ ) + (n − 1)var(e) X (Yi − Ȳ )2 = X TSS = RegSS + RSS (Ŷi − Ȳ )2 + X e2i Just as in the bivariate case, we can define R2 as the explained sum of squares, expressed as a proportion of the total sum of squares. This is usually interpreted as the proportion of variance in the dependent variable “explained” by the independent variables in the regression equation. Note that the R2 in a multiple regression equation is equal to the squared bivariate correlation between Y and Ŷ : R2 = RSS RegSS 2 = 1− = rY, Ŷ TSS TSS IV. The ANOVA Table for a Regression Equation: Each sum of squares is associated with its own degrees of freedom. Like the sums of squares, the total degrees of freedom can be broken down into a sum of the degrees of freedom associated with RegSS (often called “dfRegression ” or “dfM odel ” and equal to k, the number of independent variables) and the degrees of freedom associated with RSS (often called “dfResidual ” and equal to n − k − 1). A sum of squares divided by its degrees of freedom is called a “mean squares.” We can obtain the “mean squares total” (equal to the sample variance of Y ), the “mean squares model” (also known as the “regression mean squares” or RegMS) and the “mean squares residual” (RMS, also known as the “mean squared error”). This information is often presented in tabular form. The resultant table is called “the analysis of variance for the regression” or “the ANOVA table” for the equation. It usually looks like the following (although the actual numerical values obtained from the OLS estimates would be substituted into the cells of the table): Source: Regression: Residual: Total: Sum of Squares Degrees of Freedom Mean Squares P (Ŷi − Ȳ )2 k P (Ŷi − Ȳ )2 /k P e2i P (Yi − Ȳ )2 n−k−1 n−1 P e2i /(n − k − 1) P (Yi − Ȳ )2 /(n − 1)