REGRESSION ANALYSIS OF VARIANCE Consider the usual regression model Yi = 0 + 1 xi1 + 2 xi2 + … + k xik + i for i = 1, 2, 3, …, n with the usual assumptions on the 1, 2, …, n : These are unobserved random quantities. These are statistically independent of each other. E i = 0 and Var i = 2 for each i. These are normally distributed. The final assumption, regarding normal distributions, is needed to get exact tests and confidence intervals. The matrix form of the model is Y = X + in which the design matrix X is n-by-p. We use p = k + 1. Let PX be the projection matrix into £col(X), the space spanned by the columns of X. If X has full rank p, then £col(X) is a space of dimension p, and we can certainly write PX = X (X´ X)-1 X´ Certainly PX is symmetric. You can check that PX is a projection matrix, as PX = PX PX . If X does not have full rank, then PX is still uniquely defined, but it cannot be computed in the form just given. (The computation of PX in that case will not be discussed here.) The remainder of this document will use PX = X (X´ X)-1 X´ , along with the assumption that £col(X) has dimension p. Let the fitted vector be Y = PX Y = X (X´ X)-1 X´ Y . This vector of course lies in the space £col(X). Since 1 is a column of X, it follows that PX 1 = 1. Equivalently, 1´ PX = 1´. Thus we can show that 1´ Y = 1´ PX Y = 1´ Y In words, the statement above says “sum of the observed = sum of the fitted.” The residual e is defined as e = Y - Y = Y - PX Y = (I - PX ) Y. This lies in the n - p dimensional space which is orthogonal to £col(X). In words, “residual = observed minus fitted.” Page 1 gs2011 REGRESSION ANALYSIS OF VARIANCE In fact, the vectors Y, Y , and e form the corners of a right triangle. This gives the Pythagorean statement Y´ Y = Y ´ Y + e´ e The vectors lie in spaces of dimensions n, p, and n - p, respectively. This right triangle result is very easy to verify using PX = X (X´ X)-1 X´ . The regression analysis of variance table, however, is based on a somewhat different right triangle. The regression exercise is directed toward understanding the relationship between Y and the k independent variables. The design matrix X usually contains a column 1, corresponding to the intercept 0, and there is no inferential interest in 0. Accordingly, we remove the means by using the “mean sweeper” matrix M, where 1 M = I - n F1 G G = G G G H 1 n 1 n 1 n 1 n 1n 1 1 n 1n 1n 1 n 1n 1 1n 1n 1n 1 n 1n 1 1n I J J J J J K By direct calculation, you can see that FY Y I G Y YJ G J MY = G Y YJ G J G J G HY Y JK 1 2 3 n The matrix M is also a projection, and it’s easily confirmed that M M = M . Also, M´ = M. Certainly M 1 = 0. n You can also see that (M Y)´ (M Y) = Y´ M Y = cY Y h= 2 i SYY . i 1 Because the matrix X contains the column 1, it follows that PX 1 = 1, (I – PX) M = M (I - PX) = (I - PX). Note that PX M = M PX . Now apply the projection PX to M Y: M Y = PX (M Y) + (I - PX ) (M Y ) Page 2 gs2011 REGRESSION ANALYSIS OF VARIANCE However, (I - PX ) (M Y ) = [ (I - PX ) M ]Y = (I - PX ) Y = e, so we write this as M Y = PX (M Y) + e A Pythagorean result applies here as well: (M Y)´ (M Y) = [ PX (M Y) ]´ [ PX (M Y) ] + e´ e or (M Y)´ (M Y) = (M Y)´ PX PX M Y + e´ e = (M Y)´ PX M Y + e´ e It is this final form that is the basis of the regression analysis of variance. Observe that M Y lies in a space of dimension n - 1; as a consequence PX M Y lies in a space of dimension p - 1 = k. You might also note that PX M Y = M PX Y = M Y . These results are usually laid out in an analysis of variance table in this format: Source of Variation Degrees of Freedom Sum Squares Mean Squares Regression k SSregr = (M Y )´( M Y ) MSregr = Residual n-k-1 SSresid = e´ e Total n-1 SStot = SYY = (M Y)´(M Y) SSregr k SSresid MSresid = n k 1 F F= MSregr MSresid The Degrees of Freedom column and the Sum Squares column show exact additions. The Mean Squares column is the division of the Sum Squares by Degrees of Freedom. It is conventional not to show the division in the Total row. The F statistic is used to test the hypothesis H 0 : 1 = 0, 2 = 0, …, k = 0. Rejection of this hypothesis is rather common, and the rejection simply means that you have significant evidence that at least one of the ’s is not zero. Observe that the intercept 0 is not part of this inference. Page 3 gs2011