Econ 388 R. Butler 2014 rev Lecture 4 Multivariate 2 Μ . Inserting π· Μ gives I. Recap on Projections. The vector of least squares residuals is π Μ = π − πΏπ· π Μ = Y − X(X’X) −1X’Y= (I − X(X’ X) −1X’ )Y = (I − PX )Y = MX Y The n × n matrices PX and MX are symmetric (PX = P’X, MX = M’X) and idempotent (PX = PX PX,, MX = MX MX) are called projection matrices as they ‘project’ Y onto the space spanned by X and the space that is orthogonal to X, respectively. Recall, MX and PX are orthogonal. Also, PX MX = MXPX = 0 We can see that least squares partitions the vector Y into two orthogonal parts Y = P y + M y = Μ+π projection onto X + residual (projection onto space orthogonal to X) = πΏπ· Μ II. Partitioned Regression and Partial Regression Table 1: Projections everywhere ππ = π(π ′ π)−1 π′ and ππ = πΌ − π(π ′ π)−1 π′ and 1Μ is a vector of ones associated with the constant term Sample Model Normal Equation (minimize residual vector by Resulting Notes Regression choosing parameter values to make the parameter residual vector orthogonal to linear estimator combinations of the right hand side predictor variables) Finding the mean: 1Μ ′ πΜ = 0, or Decompose vector πΌΜ = πΜ ′ Y into its mean, π = πΌΜ1Μ + πΜ 1Μ (π − πΌΜ1Μ ) = 0, or πΌΜ = πΜ , and ∑(π¦π − πΌΜ) = 0, or deviations from its ∑(π¦π ) = ∑(πΌΜ) , or ∑π¦π =πΌΜ ∑1, or ∑π¦π =πΌΜ π mean πΜ = π − πΜ 1Μ (order nx1) ′ Μ Μ Simple regression: 1 πΜ = 0 and π′πΜ = 0, X has order nx1, or π½Μ1 = ratio of the π½1 = ∑(π₯π − π cov of (x,y) to π₯Μ )(π¦π − π¦Μ )/ 1Μ ′ (π − π½Μ0 1Μ − π½Μ1 π) = 0, π′(π − π½Μ0 1Μ − 2 Μ Μ Μ = π½0 1 + π½1 π + πΜ π½Μ1 π) = 0, ∑(π₯π − π₯Μ ) variance of x; π½Μ0 is the mean of y once ∑(π¦π − π½Μ0 − π½Μ1 π₯π ) = 0, ∑π₯π (π¦π − π½Μ0 − Μ0 = π¦Μ − π½Μ1 π₯Μ the effect of x is π½ π½Μ1 π₯π ) = 0, see chapter 2 of text and appendix taken out. A for solution General π′πΜ = 0, X has order nxk, k equations in k π½Μ = (π′π)−1 π′π π½Μπ generalizes the regression: unknowns: π′(π − ππ½Μ ) = 0 simple model: π = ππ½Μ + πΜ ratio of partial covar to partial variances Centered R2: Y In our class, Where π1Μ are deviations from the means, so ππππ‘ππππ π 2 = πππ πππ Μ Μ and Xs deviated almost everything (πΜ 1 π)′(πΜ 1 π) = (πΜ 1 ππ½ + πΜ )′(πΜ 1 ππ½ + πΜ ) = =1 - πππ πππ from means is done with the Μ Μ Μ (πΜ 1 ππ½ )′(πΜ 1 ππ½ ) + (πΜ ′πΜ )′ + (πΜ 1 ππ½ )′πΜ + Μ centered R2; adjπ = ππ½ + πΜ ((π1Μ ππ½Μ )′πΜ )′ the last two r.h.s. terms=0 R2 compares two π1Μ π = π1Μ ππ½Μ + πΜ (why?), so regressions with So we can square Μπ − π¦Μ ))2= ∑(π¦π − π¦Μ )2 = ∑((π¦π − π¦Μπ ) + (π¦ the same Y, n but both sides, ∑(π¦π − π¦Μπ )2 + ∑(π¦Μπ − π¦Μ )2, or SST= SSM + with different Xs simplify SSR Frisch theorem for Preliminary cleansing of Y (of X1 correlations): Symmetry and Even for subsets of ′ −1 idempotency of variables (not just ππ1 π = π − π1 (π1 π1 ) π1 π part of π = π1 π½Μ1 + a single variable), Μ π ensure the Y uncorrel with π1 π2 π½2 + πΜ , where π1 we have partitioned the Xs into two arbitrary subsets π1 , π2 Preliminary cleansing of X2 (of X1 correlations): ππ1 π2 = π2 − π1 (π1 ′ π1 )−1 π1 π2 part of π2 that is uncorrel with π1 Regressing cleansed Y on cleansed X2: π½Μ2 = ((ππ1 π2 )′(ππ1 π2 ))−1 (ππ1 π2 )′(ππ1 π) = −1 (π2 ′ππ1 π2 ) (π2 ′ππ1 π) last equivalencies the respective coefficients are the cleansed (of corr w/ X1) partial correl of X2 and Y This last result is “partialing” or “ netting” out the effect of X1. For this reason, the coefficients in a multiple regression are often called the partial regression coefficients. Based on Frisch Theorem, we also have the following results: IIa. Regression with a Constant Term The slopes in a multiple regression that contains a constant term are obtained by transforming the data to deviations from their means and then regressing the variable Y in deviation form on the explanatory variables, X, also in deviation form (second row above). IIb. Partial Regression and Partial Correlation Coefficients Μ , πΜ, and π Μ ), and Consider the following sample regression functions, with least squares coefficients (π Μ) residual vectors Μ π and π Y = XπΜ + Μ π Μ Y = Xπ + ZπΜ + π’Μ A new variable Z is added to the original regression (which is observed, and included in the analysis—in omitted variable bias analysis given in the next lecture, it is not included in the sample regression but should have been), where πΜ is a scalar (a regression coefficient). We have Μ = (X ’X)−1X’(Y − ZπΜ) = πΜ − (X’ X) −1X’ ZπΜ π Therefore, Μ − ZπΜ π’Μ = Y− Xπ = Y− X[πΜ − (X’X) −1X’ ZπΜ] − ZπΜ = (Y − XπΜ) + X(X’ X) −1X ’ ZπΜ − Zc = Μ π − M ZπΜ Now, Μ + πΜ2(Z’M Z) − 2πΜZ’M e π’Μ′π’Μ = Μ π′π Μ+ πΜ2(Z’M Z) − 2πΜZ’M (M y) = Μ π′π Μ + πΜ2(Z‘M Z) − 2πΜZ’M (XπΜ + ZπΜ + π’ = Μ π′π Μ) 2 ‘ ’ Μ + πΜ (Z M Z) − 2πΜZ M ZπΜ = Μ π′π Μ− πΜ2(Z‘M Z) ≤ Μ Μ = Μ π′π π′π : Hence, we obtain the useful result: Μ is the sum of squared residuals when y is regressed on X and uiu is the If Μ π′π sum of squared residuals when y is regressed on X and Z, then Μ − πΜ2(Z’M Z) ≤ Μ Μ π’Μ′π’Μ = Μ π′π π′π where πΜ is the coefficient on Z in the regression above, and M Z = [I−X(X’X) −1X’ ]Z is the vector of residuals when Z is regressed on X. IIc. Goodness of Fit and the Analysis of Variance R2 again, with more details on the derivation (the coefficient of determination, this is actually the centered R-square, when we talk about “R-square” in normal econometrese, we mean the centered R-square)-the proportion of the variation in the dependent (response) variable that is 'explained' by the variation in the independent variables. The idea is that the variation in the dependent variable, (called the total sum of squares, n SST= ο₯ (Yi ο Y ) 2 ο½|| ο 1 ο ||2 , can be divided up into two parts: 1) a part that is explained by the regression i ο½1 model (the “sum of squares explained by the regression model”, SSE) and 2) a part that is unexplained by the model (“sum of squared residuals, SSR). Some of the math is given in Wooldridge, chapter 2 where he has used the following algebra result: ο₯ (A i ο Bi ) 2 ο½ ο₯ (A i2 ο« 2A i Bi ο« Bi2 ) = ο₯ A i2 ο« 2ο₯ A i Bi ο« ο₯ Bi2 . n n n n n i ο½1 i ο½1 i ο½1 i ο½1 i ο½1 ˆ i ο οi Where οi οΊ ο ˆ i ο οi οΊ οΜ οi οΊ ο n The step that is omitted is showing that ο₯ (Yˆ ο Y ) uˆ =0. i ο½1 n n n i ο½1 i ο½1 i ο½1 i i This boils down pretty quickly into ο₯ Yˆ iuˆi ο Y ο₯ uˆi = ο₯ Yˆi uˆi ο 0 , so you just have to show that n ο₯ Yˆ uˆ = 0 in order to prove the result. i ο½1 i i It is zero because OLS residuals are constructed to be uncorrelated with predicted values of Y, since residuals are chosen so that they are uncorrelated with (or orthogonal to) the independent variables, they will be orthogonal to any linear combination of the independent variables including the predicted value of Y. So a very reasonable measure of goodness of fit would be the explained sum of squares (by the regression) relative to the total sum of squares. In fact, this is just what R2 is: R2 ο½ SSM / SST ο½ 1 ο SSR / SST It is the amount of variance in the dependent variable that is explained by the regression model (namely, explained by all of the independent variables). R2 = 1 is a perfect fit (deterministic relationship), R2= 0 indicates no relationship between the Y and the slope regressors in the data matrix X. The dark side of R2: R2 can only rise when another variable is added as indicated above. We can get to the same result more quickly by applying the M1 orthogonal projection operator (this takes deviations from the mean, as seem above) to the standard sample regression function. Adjusted R-Squared and a Measure of Fit One problem with R2 is, it will never decrease when another variable is added to a regression equation. To correct this, the adjusted R2 (denoted as adj-R¯2 ) was created by Theil (one of my teachers at the Univ of Chicago): οˆ ' οˆ /( n ο k ) adj ο R 2 ο½ 1 ο Y ' M 1Y /( n ο 1) With the connection between R2 and adj-R2 being 3 adj ο R 2 ο½ 1 ο (n ο 1) (1 ο R 2 ) (n ο k ) The adjusted R2 may rise or fall with the addition of another independent variable, so it is used to compare models with the same dependent variable with sample size fixed, and the number of right hand side variables changing. [[[[ Do You Want a Whole Hershey Bar? Well, Consider the following diagram for a simple regression model when answering these three multiple choice questions: Y X 1. The sum of the squared differences between where the solid bar ( Y ) hits the Y axis and the dotted lines hit the Y axis: a. Is the SST (sum of squares total) b. ο₯ (Yi ο Y )2 c. An estimate of the variance of Y (when dividing by n-1) d. all of the above 2. The sum of the square of the distances represented by the short, thick lines are a. SST (sum of squares total) b. SSE (explained sum of squares) c. SSR (residual sum of squares) d. none of the above 3. The sum of squares mentioned in question 1 will equal the sum of squares mentioned in 2 when a. ο’ˆ0 , ο’ˆ1 both must be equal to zero b. ο’Μ1 is equal to zero, regardless of the intercept coefficients value c. ο’Μ0 is equal to zero, regardless of the slope coefficient value d. when men at BYU are encouraged to wear beards ]]]]] III. In summation: Three Really Cool Results from Regression Projections Really Cool Result 1: Decomposition Theorem Any vector Y can be decomposed into orthogonal parts, a part explained by X and a part unexplained by X (or orthogonal to the X); it’s the Pythagorean theorem in the regression context ( X ο is the space orthogonal to X). PXY 4 Y MXY X 0 PXY || ο ||2 ο½|| οο ο ||2 ο« || ο ο ο ||2 (this is just the Pythagorean theorem) which is easily derived from the basic results on the orthogonal projectors, namely: a. PX and MX are idempotent and symmetric operators, such that b. PXMX=0 (they are orthogonal, or at right angles, so that the inner product of anything they jointly project will be zero) and c. I=PX+MX (there is the part explained by X, the PX part, and the part unexplained by X, the MX part, and that together they sum to the identity matrix, I, by defintion). The identity matrix I transforms a variable into itself (i.e., it leaves the matrix or vector unchanged). Proof of the Pythagorean theorem, given the decomposition: Y ' Y ο½ Y ' IY ο½ Y '( PX ο« M X )Y ο½ Y ' PX Y ο« Y ' M X Y ο½ ( PX Y ) ' PX Y ο« (M X Y ) ' M X Y (this last equality results from the idempotent and symmetry properties) Or || Y ||2 ο½ || PX Y ||2 ο« || M X Y ||2 Really Cool Result 2: Projections on Ones—Means and Deviations If X is a vector of ones, the P1 X ο½ X 1 (a vector of the mean value for X) and M 1 X ο½ X ο X 1 (a vector of deviations of X from its mean). Since these are in orthogonal spaces, their inner product or dot product is zero, that is, by construction they are uncorrelated. Proof: (I is the nxn identity matrix, M 1 is also nxn) where ο 1 ο½ 1(1`1) ο1 1` (the orthogonal projection one the vector of constants) Since 1`1 ο½ n , then 1 (1`1) ο1 ο½ , so then 1(1`1) ο1 1` n ο©1 1 1 οοο 1οΉ οͺ οΊ 1 1` 1 οͺ1 1 1οοο1 οΊ ο½ ο½ οΊ n n οͺ οͺ οΊ ο«1 1 1οοο1 ο» a nxn matrix of ones ο¦ ο1 οΆ ο§ ο· ο©1 1 1 οοο 1οΉ ο¦ ο1 οΆ ο§ο ο· οͺ1 1 1 οοο1 οΊ ο§ ο ο· ο§ ο· οο½ ο 1 οΊ ο§ 2ο· ο§ ο· , ο1ο ο½ οͺ For οͺ οΊ ο§ο ο ο ο· ο§ο ο· n οͺ οΊο§ ο· ο§ ο· ο«1 1 1 οοο1 ο» ο¨ ο n οΈ ο¨ οn οΈ a nxn matrix of ones nx1 5 ο¦ ο₯ οi οΆ ο§ ο· ο§ n ο· ο¦ ο1 ο ο οΆ ο§ ο ο· 1 ο§ ο· ο¦ οΆ ο§ ο· ο2 ο ο ο· ο§ο₯ i ο· ο§ ο§1ο· ο§ n ο· ο§ ο· ο· ο½ οο§ ο ο· = X 1 ; and ο ο ο½ (ο ο ο ) ο ο½ ο ο ο ο ο½ ο§ ο 3 ο ο ο· ο½ ο§ο 1 1 1 ο§ ο· ο§ ο· ο§ ο· ο§ο ο· ο§ο ο· ο§ ο· οοο ο§1ο· ο§ ο ο· ο§ ο· ο¨ οΈ ο§ο₯ i ο· ο§ο ο οο· ο¨ n οΈ ο§ n ο· ο§ ο· ο¨ οΈ Really Cool Result3: The Partial Derivative Result for Matrices (Frisch theorem) When we have multivariate regression, with more than one slope regressor (right hand side independent variable), how do we interpret the coefficients? If we imagine it is a linear regression, such as 6) ο ο½ ο1ο’1 ο« ο2 ο’2 ο« residuals and suppose that X1 is just one variable (so X2 contains the intercept and other variables), and we take the partial derivative of Y with respect to X1 we get ο’1 . This suggests that ο’1 is the change in Y when we increase X1 by one unit, holding all other variables constant (the same as in partial differentiation). And so it is, ο’1 is the slope of Y with respect to X1, all other things equal. Now suppose that X1 contains m variables, so the ο’1 coefficient vector has m elements in it: ο© x11 οͺx οͺ 12 οͺ . X 1 ο’1 ο½ οͺ οͺ . οͺ . οͺ οͺο« x1n x21 x22 . . . x2 n ... xm1 οΉ ο¦ο’ οΆ ... xm 2 οΊοΊ ο§ 1 ο· ο’ ... . οΊ ο§ 2 ο· οΊο§ . ο· ... . οΊ ο§ ο· . ... . οΊ ο§ο§ ο·ο· οΊ ο’ ... xmn οΊο» ο¨ m οΈ In what sense are these jointly “partial effects” (analogous to taking the partial derivative)? For this arbitrary division of the regressors in equation (6)—that is, ο’1 and ο’ 2 are of any arbitrary dimension (from 1 to k, where k < n, the sample size)—we have the Frisch theorem. The Frisch theorem says that, the regression in equation (6), consider another regression: 7) ο X 2 ο ο½ ο X 2 ο1 ο’ ο« residuals Then the following are always true: a) the estimate of ο’1 from equation (6) will be identical to the estimate of ο’ from equation (7), b) the residuals from these two equations will also be identical, and so c) ο’Μ1 is the estimate of the effect of X1 on Y, after the influence of X2 (that is, the influence of all the variables in X2) has been factored out. 6 If we applied the Frisch theorem to the simple regression model, where X1 (in equation 6) =X (the slope variable), and X2 (of equation 6) = 1 , a vector of ones to account for the intercept coefficient, then the regression of Y, deviated from its mean from by projecting it orthogonal to 1 , on X deviated from its mean from by projecting it orthogonal to 1 , should yield the slope coefficient in the simple regression. And so it does, as indicated in the following derivation (applying the matrix formula for a regression vector to the specification given in equation 7, where ο X 2 = ο 1 ): ο’1 ο½ ((ο1ο)`ο1ο) ο1 (ο1ο)`ο1ο ο½ and ( ο ο ο 1)`(ο ο ο 1) ( ο ο ο 1)`( ο ο ο 1) ο¦ 1οΆ ο¦ ο1 ο ο 1 οΆ ο§ ο· 1 ο§ ο· ο§ ο· ο2 ο ο1 ο· ο§ where 1 ο½ ο§ . ο· , ο1ο ο½ ο§ ο· . ο§ ο· ο§ο§ ο·ο· ο§. ο· ο¨ οn ο ο1 οΈ ο§ 1ο· ο¨ οΈ ο¦ ο1 ο ο οΆ ο§ ο· ο§ ο2 ο ο ο· ο§. ο· ο· ο1ο ο½ ο§ ο§. ο· ο§. ο· ο§ ο· ο§ο οοο· ο¨ n οΈ Remember that ο 1 takes deviations from the mean (so it is orthogonal to 1 , that is, 1`(ο1ο ) ο½ 0 and 1`(ο 1 ο ) ο½ 0 since the sum of deviations is zero.) That’s why ο’1 ο½ ((ο 1 ο)`( ο 1 ο)) ο1 (ο 1 ο)`( ο 1 ο ) ο₯ (ο ο ο)(ο ο ο ) ο₯ (ο ο ο) ο ο οο½ο₯ where ο ο½ ο₯ ο½ i i 2 i i n i n , the formulas in the book and from lecture 1 for the simple regression model. IV. A. Illustration Using Stata Code with the Simple Wage-College Regression Example: Meaning of the slope coefficient, ο’1 , in simple model: 8) ο ο½ ο’ 0 ο« ο’1ο ο« ο or ο ο½ ο’0 ο1 ο« ο’1ο ο« ο and ο 1 ο ο½ ο 1 ο ο ο’1 9) STATA Code (for the 3-observation example): # delimit ; clear; input y x; 1.3 .5 ; 0 -.3 ; .5 1.4 ; end; generate ones=1; regress y x; 7 predict resids, residuals; list y x resids; regress y ones, noconstant; predict residy, residuals; regress x ones, noconstant; predict residx, residuals; regress dresidy residx; regress residy residx, noconstant; predict resids2, residuals; list y x resids resids2; [[[Do You Want a Whole Hershey Bar? 1. The Frisch (or partial derivative) theorem says (among other things) that: a. the education coefficient in a wage regression has the same value regardless of which other regressors are included in the analysis b. the education coefficient in a simple wage regression with a constant will have the same value as the coefficient in the simplest regression without a constant c. the education coefficient in a wage regression is an estimate of education on wages after the influence of the other variables (in the regression) have been factored out d. all of the above 2. If the regression space (X) consists of only a single vector of ones (denoted 1 ), then a. the orthogonal projection of Y on 1 , P1 , produces a vector of means of Y b. the orthogonal projection onto the space orthogonal to 1 , M 1 , produces a vector of deviations of Y from its mean c. P1 M 1 =0 d. all of the above 3. Since M1 is a orthogonal projection, and hence, symmetric and idempotent, it follows that for any vectors X and Y: a. M1X = M1Y b. M1X = I - M1Y c. (M1X)’Y = (M1X)’(M1Y) =(X)’(M1Y) d. none of the above ]]]]] IV B. Woody Restaurant example Find ο’ 2 (suppose it is competitors, with ο’1 = income and population; notice the roles of ο’ 2 and ο’1 from the theorem is arbitrary) and the residuals from ο ο½ ο1ο’1 ο« ο2 ο’2 ο« residuals And ο’ 2 (and the residuals) from ο1ο ο½ ο1ο2 ο’2 ο« residuals Here’s the STATA code to check this out: # delimit ; infile revenue income pop competitor using "e:\classrm_data\woody3.txt", clear; generate ones=1; regress revenue income pop competitor; predict resids, residuals; 8 list resids; regress revenue income pop; predict residy, residuals; regress competitor income pop; predict residx, residuals; regress residy residx, noconstant; predict resids2, residuals; list revenue resids resids2; VI.Correlations From the Regression Projections.[Adapted from Wonnacott, p. 306] X y Cos ο± Simple correlation r οοο Cos ο¦ Partial correlation r οοο | 1 (also the centered R in this example) Cos ο¬ Cos2 ο¬ Multiple correlation R Squared multiple correlation R2 (uncentered R2) Note: |Cos ο± | ο£ | Cos ο¬ | r οοο ο£ R |Cos ο¦ | ο£ | Cos ο¬ | r οοο | 1 ο£ R Appendix on “Frisch” (or Partial Derivative) Theorem—A Picture of the theorem in three dimensions The partial derivative theorem for regression models says that for the following two regressions: Y ο½ X1 ο’ˆ1 ο« X 2 ο’ˆ2 ο« οˆ and M X `1 Y ο½ M X1 X 2 ο’ˆ ο« οˆ M X 1 , it’s the case that ο’ˆ2 ο½ ο’ˆ and οˆ ο½ οˆ M X 1 , where the variables in the second regression are residuals from regressing Y and X2 on X1 respectively: Y ο½ X 1ο§ˆ ο« M X1 Y and X ο½ X ο¦ˆ ο« M X . So first, we show just the first regression: 2 1 X1 2 9 X2 Y X2 X1 X1 O Top view of the X1, X2 plane Upper side view of regression Now we do the second regression (using the residuals from the X1 regressions) by itself (essentially decomposing Y and X2 in terms of X1) X2 Y X2 X1 X1 O Top view of the X1, X2 plane Upper side view of regression And finally we add them together so that equivalencies become apparent—obviously οˆ ο½ οˆ M X 1 by inspection of the graphs below, and ο’ˆ2 ο½ ο’ˆ by the law of similar triangles. In fact, that is all the partial derivative (Frisch) theorem really is—just a restatement of the Pythagorean theorem and elementary properties of geometry (in Euclidean space). X2 Y X2 X1 X1 O Upper side view of regression Top view of the X1, X2 plane General Results on the Frisch Theorem 10 Applies whenever there are two or more regressors (right hand side terms) in our model. Consider an arbitrary division of the regressors and write them as 6) ο ο½ ο1ο’1 ο« ο2 ο’2 ο« residuals Let ο1ο is the projection of Y to the space orthogonal to X1, or in other words, the residuals of the regression of Y onto X1 (this is analogous to the decomposition of Y into the part explained by the whole X vector, Y=PXY + MXY, but it is a different decomposition, as X1 is lower dimension than X, and M1 is higher dimension than MX): οο½ ο 1ο ο« ο 1ο ο» ο» part of y exp lained by ο1 part of y orthogonal to ο1 Similarly, if X1 has dimension n-x-z and X2 has dimension n-x-(k-z), ο X1 ο 2 are the residuals from regression(s) of ο2 on ο1 (X2 are the dependent variable(s) and X1 are the independent variables), the dimension of ο X1 being n-x-n and the dimension of X1 n-x-z so that ο1ο2 has dimension n-x-z. Then consider another regression: 7) ο X1 ο ο½ ο X1 ο 2 ο’ ο« residuals Then the Frisch Theorem says that a) the estimate of ο’ 2 from equation (6) will be identical to the estimate of ο’ from equation (7), b) the residuals from these two equations will also be identical, and c) ο’ˆ2 is the estimate of the effect of X2 on Y, after the influence of X1 has been factored out, d) and as a result of a and b, we have Y ' M X1Y ο Y ' M X Y ο½ Y ' PM X X Y or 1 || M X1Y ||2 ο || M X Y ||2 ο½|| PM X X Y ||2 , or 1 || M X1Y ||2 ο½|| PM X X Y ||2 ο« || M X Y ||2 . 1 Applying this result to the case where X1 is just the intercept (the vector of ones we denote as 1 ), and X as all the regressors, including the intercept, then result (d) becomes Y ' M 1Y ο Y ' M X Y ο½ Y ' PM 1 X Y or || M 1 Y || 2 ο || M X Y || 2 ο½|| PM 1 X Y || 2 , or || M 1Y ||2 ο½|| PM 1 X Y ||2 ο« || M X Y ||2 , that is, variation in Y (where variation is calculation by deviating Y from its mean using a vector of ones, M 1Y ) can be explained by variation in Y due to variation in the Xs (more specifically, that part of the Xs not associated with the constant, PM 1 X Y ) and the variation in Y that is unexplained by all the Xs (including the intercept, M X Y ). Equivalently, the total sum of squares for Y (SST) equals the part of the variations explained by variations in the model (SSM, where the Xs are the model, and M 1 X deviates X from their means, or takes variations of X) and the part left unexplained, captured by the sum of squared residuals (SSR). Appendix SAS code (for the sample model of ln(wages) on college): 11 data one; input y x; ones=1; cards; 1.3 .5 0 -.3 .5 1.4 ; proc reg; model y=x; output out=two r=resids; run; proc print; var y x resids; run; proc reg; model y=ones/ noint; output out=two r=residy; run; proc reg; model x=ones; output out=two r=residx; run; proc reg; model residy=residx/noint; output out=two r=resids2; run; proc print; var y x resids resids2; run; Here’s the SAS code to check for Woody’s Restaurant Data results: data one; infile "e:\classrm_data\woody.txt"; input revenue income pop competitor; ones=1; run; proc print; run; proc reg; model revenue=income pop competitor; output out=two r=resids; run; proc print; var resids; run; proc reg; model revenue=income pop; output out=two r=residy; run; proc reg; model competitor=income pop; output out=two r=residx; run; proc reg; model residy=residx/ noint; output out=two r=resids2;run; proc print; var revenue resids resids2; run; 12