A Matrix Approach to Linear Regression

Stat 404 A Matrix Approach to Linear Regression A. Elements of matrix algebra (or linear algebra) 1. A matrix is an array of numbers. (Note: Pedhazur uses bold letters to denote matrices. I shall use capital letters.) 1 2 B 3  4 1 4  A   2 3 5 6 8  7 2. Numbers within a matrix are its elements and are assigned two subscripts: the first for row, the second for column.  a11 A a21 b11 b12  b  b 21 22  B b31 b32    b41 b42  a12  a22  3. The dimension of a matrix refers to the number of its rows and columns. For example, A is of dimension 2x2; B is of dimension 4x2. 4. The transpose of a matrix is obtained by interchanging its rows and columns. That is, if bij is the element in the ith row and the jth column of B, the element in the jth row and the ith column of BT will take the value of bij. (Note: Pedhazur uses the notation, B', instead of BT.) For example, … 1 1 2 3 4 BT    5 6 8 7  5. A square matrix has the same number of columns as rows. The matrix, A, is a square matrix. 6. A symmetric matrix is a square matrix within which, for each element, aij = aji . An example of a symmetric matrix is …  1 2 3 S  2 4 5 3 5 6 7. The diagonal elements of a square matrix are those elements having the same row and column number. The elements s11 = 1 , s22 = 4 , s33 = 6 are the diagonal elements of the matrix, S. 8. A diagonal matrix is a square matrix that has all off-diagonal elements equal to zero. For example, … 1 0 0 D  0 2 0 0 0 3 9. The identity matrix is a diagonal matrix with all diagonal elements equal to one (1). For example, in a 3x3 matrix … 1 0 0 I 3  0 1 0 0 0 1 2 10. A vector is a matrix with only one column. Often vectors are assigned lower case letters with a tilde (~) underneath. (Note: Pedhazur uses bold lower case letters to denote vectors. I shall use lower case letters with tildes underneath to denote vectors.)  2  4 w  ~ 6    8  1 v  3 ~ 5 11. ADDITION and SUBTRACTION a. Only matrices of the same dimension can be added. Matrices of the same dimension are said to be conformable in addition. For example, let …  a11 a12  A a21 a22 2 x3 a13  a23  b11 b12  B b21 b22 2 x3 b13  b23  b. Their sum is … a12  b12 a  b A  B   11 11 a21  b21 a22  b22 a13  b13  a23  b23  c. Their difference is … a  b A  B   11 11 a21  b21 a12  b12 a22  b22 d. Matrix addition is … i. commutative: A+B = B+A ii. associative: (A+B) + C = A + (B+C) 3 a13  b13  a23  b23  12. MULTIPLICATION a. To multiply a matrix by a scalar, s, all elements of the matrix are multiplied by s. Thus …  sa sA  As   11  sa 21 sa12 sa 22 sa13  sa 23  b. Matrix multiplication is NOT commutative. That is, A * B ≠ B * A for all A and B. Thus the sequence in which matrices are multiplied is important. c. Two matrices are conformable in multiplication if the number of columns in the firstsequenced matrix equals the number of rows in the second-sequenced matrix. For example, … 1 2 * A  B 3 4x2 2x2  4  b11a11  b12 a 21 b a  b a   21 11 22 21 b31a11  b32 a 21  b41a11  b42 a 21 5 6 1 4 * 8 2 3  7 b11a12  b12 a 22   1 *1  5 * 2  b21a12  b22 a 22  2 *1  6 * 2   b31a12  b32 a 22   3 *1  8 * 2    b41a12  b42 a 22  4 *1  7 * 2  11 14  19  18 4 19  26  36  37 1* 4  5 * 3  2 * 4  6 * 3 3 * 4   8 * 3 4 * 4  7 * 3 d. Note that the dimension of the matrix produced has the number of rows of the first matrix and the number of columns of the second. In this case, multiplying a 4x2 matrix by a 2x2 matrix yields a 4x2 matrix. If the second matrix had been 2x3, the resulting matrix would have been 4x3. e. Although not commutative, matrix multiplication does support … i. associativity: (A * B) * C = A * (B * C) and ii. distributivity: A(B + C) = A * B + A * C . f. Also note that (A * B)T = BT * AT . 13. A determinant can be calculated for any square matrix. We shall only discuss how to calculate the determinant of a 2x2 matrix. Let …  a11  A a21 2x2 a12  a 22  . The determinant of A is det( A)  a11 * a 22   a12 * a 21  . 1 4  , then det(A) = 3  8 = 5 . 2 3 Thus if A   14. Whenever perfect collinearity exists among the rows or columns of a square matrix, its determinant equals zero. A matrix with perfect collinearity can be generated by setting one column of a matrix equal to a multiple of another. For example, note that by multiplying the first column of the matrix, A, by four yields its second column. 1 4 A  2 8 det(A) = 8  8 = 0 5 Perfect collinearity need not be simply between two columns in a matrix. If every element in one of the matrix’s columns equals exactly the weighted sum of two or more columns in the matrix, you have perfect multicollinearity. a. When the determinant of a matrix equals zero, the matrix is said to be singular. (In multiple regression, singularity results whenever two or more independent variables are perfectly collinear—an extreme case of multicollinearity. As we shall soon see, within the context of multiple regression, singularity is "undesirable.") b. Nonsingular matrices are said to be full rank. 15. For every nonsingular square matrix, A, its inverse matrix is a matrix, A1, such that A * A1 = A1 * A = I, where I is the identity matrix.  a11 a 21 a. If A   a12  , then … a 22   a 22  det( A) 1 A    a 21  det( A)  a12  det( A)   a11  det( A)  .  3 1 4  , then det(A) = 3  8 = -5 and A1    5 Thus if A    2 2 3   5 4   3 4    5   5 5  . 1   2 1     5 5   5   3 4   3 4  1 4   5 5    5 5   1 4   1 0   1  * * * A A   Now note that 2 3  2 1  2 1  2 3 0 1 as         5  5 5  5 promised. 6 b. Moreover, the inverse of a matrix does not exist whenever its determinant equals    zero. For example, notice that in the 2x2 case A1    when det(A) = 0 .    16. Finally, the trace of a square matrix is the sum of its diagonal elements. Two aspects of the trace will be used at the very end of these lecture notes: a. The trace of an identity matrix equals its dimension. For example, the identity matrix, I3, at the bottom of page 2 is of dimension 3, the sum of its diagonal elements is 3, and thus its trace equals 3. b. As long as their sequence remains the same, a series of matrices that produce a square matrix can be multiplied starting at any point in the sequence, and the traces of their products will be equivalent. Thus, for example, the trace of AB equals the trace of BA, as long as the number of columns in A equals the number of rows in B and the number of rows in A equals the number of columns in B.1 1 Let the dimension of A be m-by-n and the dimension of B be n-by-m. The diagonal elements of AB will be (a11*b11 + a12*b21 + … + a2n*bn2), (a21*b12 + a22*b22 + … + a2n*bn2), ... (am1*b1m + am2*b2m + … + amn*bnm). The diagonal elements of BA will be (b11*a11 + b12*a21 + … + b2m*am2), (b21*a12 + b22*a22 + … + b2m*am2), … (bn1*a1n + bn2*a2n + … + bnm*amn). In each case, when a trace is obtained, the same n-times-m products are being summed. The diagonal elements of the AB matrix are simply the sums of different subsets of these products than are the diagonal elements of the BA matrix. For example, a11*b11 is added into the 1,1 cell of both AB and BA, whereas a12*b21 is added into AB’s 1,1 cell but into BA’s 2,2 cell. In brief, the trace is the sum of the products between the matrices of all pairs of cells that share either the same row or the same column. 7 B. We now return to the assumptions of linear regression, except that they will now be expressed in matrix form (Part 1). y  X  1. Linearity: X and Y (or y ) are related as ~ ~ shorthand for …  Y1  Y   2      Y N  1 1    1 X 11 X 12  X 21 X 22     X N1 X N2  , which is ~ ~   X 1k   0    1   X 2k   1    2   *   2     , where            X Nk     N    k  the  i are unknown population parameters (i.e., true slopes in the population), each  i is an unknown disturbance (i.e., the true deviation of Yi from the linear equation,  0  1 X i1   2 X i 2     k X ik ), and X and Y list the knowable values for a set of variables regarding all units of analysis in a population of size, N. A few comments: a. In multiple regression analysis, we estimate the unknown  and  based on a ~ ~ sample of size, n. b. Note that when we stop talking about true characteristics of a population and begin speaking of estimates of these parameters, we shift from using Greek to Arabic letters—a shift away from their previous references to standardized and unstandardized slopes. Greek letters represent “unknowns” that are estimated in an analysis, whereas lower-case Arabic letters correspond to estimates of these unknowns that are based on a sample. Thus, bi estimates the parameter, βi, and ei 8 estimates the disturbance, εi. (Note that unlike our previous lecture notes, the “hat” symbol, ^, is no longer used to distinguish estimates from parameters. More generally, “hats” are never placed above symbols for matrices.) c. The linearity assumption presupposes that all variables are interval- or ratio-level measures (i.e., that all variables have units like number of children, degrees Centigrade, etc.). 2. Normality: The εi are normally distributed about the (true) regression line. a. One consequence of this assumption is that the expected value of e is 0 , where 0 is ~ ~ ~ a vector of zeros. b. In expectation notation this is written as E( e ) = 0 . Of course, now we need to ~ ~ become acquainted with “expectation notation.” C. Mathematical expectation 1. You will find that expectation notation and summation notation are very similar. In fact, the two notations are identical, except that expectations are based on probability theory and summations describe concrete manipulations of one’s data. 2. The fundamental principle behind the idea of expectation is the concept of a random variable, Y, that assumes its value according to the outcome of a chance event. For example, one might consider the chance event of the number of children born to whomever is the first person to be drawn at random from a specific population. In this case, a random variable, Y, might be defined as … Y = the number of children born to person #1 . 9 More precisely, if we let S be the sample space for the chance outcomes from a random sampling of persons, a random variable is a function (i.e., a rule of correspondence) that associates with each element of S exactly 1 real number. 3. In this example, S consists of all possible numbers of offspring. The random variable Y can be thought of as the rule that defines how each element of S can be paired with exactly 1 real number. For example, this rule might be, “If Person 4’s questionnaire contains a ‘2’ on the blank line at the end of Questionnaire Item #5, then assign the integer, 2, to the fourth element of S.” a. If the distribution of Y is discrete, then the expectation (or expected value) of Y is defined (≡) as follows: m Y    Yi Pr Y  Yi  , i 1 where the Yi comprise the set of all of the possible, m, distinct values of a discrete random variable (in the case of number of offspring, these values can only be nonnegative integers),2 where Pr Y  Yi   Ni (here, N i is the number of people from the N population, who are contained in the ith group [e.g., those with 2 children] and N is the population size), and where  Pr Y   1 . i 2 This discussion is limited to discrete random variables, which assume at most a finite number (although, if one considers Genghis Kahn, a potentially large one for the number of children born to males) of possible values. 10 b. Thus an expectation is a kind of weighted sum of values. To illustrate, consider a small population consisting only of 70 people: 15 with no children, 20 with one child, 20 with two children, and 15 with three children. The expected number of children from any person drawn at random from this population would be calculated as follows:  15   20   20   15  105 Y    Yi Pr Yi   0   1   2   3    1.5   ,  70   70   70   70  70 which is, of course, what you would expect. Right? c. Yet our population sizes are usually much larger than 70, and the probabilities for different outcomes are generally unknown. Such summations are usually concrete calculations performed on all cases in one’s sample. Expectation notation allows us to speak hypothetically of such calculations, “as if they were to be done.” 4. You should also be aware of additional rules of expectation mathematics. In the following rules (which are given without proof) assume that “a” is a constant real number and that X, Y, and all Xi are random variables with respective expectations E(X), E(Y), and E(Xi): a   a aX   a X   X  a    X   a  X  Y    X   Y      Var X    X2   X   X    X 2   X  2 11 2 Var aX   a 2Var  X  Var X  Y   Var X  Y   Var X   VarY  , IFF (if and only if) X and Y are statistically independent.  XY    X  * Y  , IFF X and Y are statistically independent. Cov XY    XY    X Y  Note that given the next-to-last rule, it follows that when the covariance, Cov(XY), equals zero, this indicates that X and Y are statistically independent. 5. Now we can return to our discussion of the assumptions of linear regression. D. The assumptions of linear regression in matrix form (Part 2). 2. Normality (continued) c. Recall that within the linear model (i.e., given that one is assuming a linear relation between Y and X), the normality assumption implies (in part) that … E( e ) = ~ 0 . ~ d. Differently put, if the εi are normally (actually only symmetrically, at this point) distributed and if Y and X are linearly related (such that the normal distributions of the εi are centered on the regression line for all values of the Xs), then for any combination of X-values the εi would have zero mean and consequently the expected value of e would be 0 . ~ ~ 12 3. Randomness: The Y’s (and thus the e’s) are statistically independent. a. Consider what ramifications this assumption has for …    ee T ~ ~     e12 e1e2   e e   e22  2 1     en e1  en e2     e1en    e2 en  ,       en2     e1  e  where e   2  is the vector of sampled deviations of Y from the true linear relation ~    e n  between X and Y. b. In particular, note that since E( e ) = 0 , it follows that for each ei (with its particular ~ ~     combination of values among the variables in X), Varei    ei  ei    ei 2 2 . c. Also note that since the randomness assumption implies that the e’s are statistically independent, we know that ei e j   ei  * e j   0 * 0  0 . d. Thus with the assumption that E( e ) = 0 , the randomness assumption can be stated ~ ~ as follows:    ee ~ ~ T  e21  0       0 0  e2 2  0 0   0        e2n    4. Homoscedasticity: The e’s have a constant variance, σ2. With this assumption we can further specify that … 13     ee T 2 ~ ~ In , where In is an identity matrix of dimension, n-by-n. This expression is commonly referred to with the phrase, “The ei are independently and identically distributed (or IID).” 5. It must also be assumed that the Xs are fixed, or at least that they have been measured without error. In matrix and expectation notations, this implies that …  X   X 6. One must also assume that the errors, e , are not correlated with the Xs. That is, …   ~     X T e   X T * e ~ ~ 7. Finally, we must add the assumption regarding the design matrix, X, that … X is full column rank. That is to say, no column of X is a linear combination of the other columns of X. 8. The next page is a “Handout” on which these assumptions are summarized, and where indications are given of how to verify whether or not the assumptions are met. Thereafter we shall turn to a discussion of how these assumptions are used in justifying how we estimate slopes in multiple regression analysis. 14 Stat404 Assumptions of Linear Regression: Verifying that They Are Met Assumption Verfication that assumption is met y = Xβ + ε ˜ ˜ ˜ Plot Y by each Xi (i=1...k) and examine for linearity. This assumption implies both randomness and T homoscedasticity in asserting that the ε i are IID (independently and identically distributed). a. Examine plots for tracking when the X-variable is time-related. b. Examine plots for heteroscedasticity. c. Critically inspect the sampling design for evidence that two or more values of Y may not have been the result of independent random selections. 2 E ( e e ) = σ In ˜˜ The ε i are normally distributed, and thus Examine plots for serious outliers. E(e) = 0 . ˜ ˜ Determine whether X was fixed experimentally, or whether the Xs were defensibly measured without error. E(X) = X E ( X e ) = E ( X )E ( e ) ˜ ˜ Ensure that X and ε are not related to some ˜ causally prior factor. X is full column rank. Verify that the computer can calculate b . T T ˜ 1 E. Solving linear equations 1. In regression analysis we begin with a set of “n” linear equations of the form, … Yi  b0  b1 X i1  b2 X i 2    bk X ik  ei . 2. When you have the same number of distinct equations as unknowns, you can find a unique solution for the unknowns. (By distinct equations is implied that no two equations are multiples of each other.) Note that in the set of equations described by y  X b e , there are (at most) “n” distinct equations with k+1 unknowns. ~ ~ ~ 3. Consider the following two equations with two unknowns: 3  2b1  3b2 7  4b1  5b2 4. To solve for b1 and b2, we begin by expressing the equations in matrix form: 3 2 3  b1  y   *    X b~ ~ 7  4 5 b2  The next step is to find X-1 and then, premultiplying both sides of the equation by X-1 we get X 1 y  X 1 X b  b . ~ ~ ~ 5. Here are the calculations for solving these two equations: a. det(X) = 2*5 – 4*3 = -2 b. X 1  5   2 4   2 c. b  X 1 ~ 3  2  2   2  5  y 2 4 ~   2 3   15 21 2  *  3        3  2 2   2     7   6  7   1 2 16 Thus b1 = 3 and b2 = -1 are the solution to the equations. 6. Note that whenever you have more unknowns than equations, there is NO SINGLE solution for the unknowns. For example, 2b3  3b4  3 has an infinite number of solutions. (You can pick any number for b3 and the value of b4 follows.) However, when you have more equations than unknowns, there are as many solutions as there are subsets of “as many equations as unknowns.” The problem now becomes one of selecting from among the many possible solutions. F. Ordinary Least Squares (OLS) 1. When solving “n” equations with “k” unknowns (and when n>k), MANY solutions for b ~ are possible. Statisticians have developed numerous criteria for choosing from among these many possible solutions. The two most commonly-used of these criteria are maximum likelihood (which we shall not cover in this course) and ordinary least squares (which we shall cover). T 2. In brief, when using OLS one chooses that matrix, b , that minimizes e e . (Note that ~ T ~ ~ T whereas e e is an n-by-n matrix, e e is a single number.) ~ ~ ~ ~ 3. We must use calculus to find these values for the elements of b . ~ a. We begin by expressing e e in terms of b . Noting that e  y  X b , … T ~ ~ ~ ~ ~ ~ T e e   y  X b   y  X b  ~ ~ ~  ~ ~ ~ T  y y  y X b b X T y  b X T X b T ~ T ~ T ~ ~ T ~ ~ ~ ~  y y 2 b X T y b X T X b T ~ T ~ ~ T ~ ~ ~ (Given that y X b  b X T y .) T ~ 17 T ~ ~ ~ b. Taking first derivatives with respect to b , we get ~    T e e  2 X T y  2 X T X b .3 ~ b ~ ~ ~ ~ c. The formula for b is found by setting this equal to zero, which yields ~  X T X b  X T y . Premultiplying both sides by X T X ~ ~ X X  X X b  X X  1 T T 1 T ~ 1 , we get X T y . And thus the OLS solution for b is … ~ ~  b X X ~  T  1 XT y ~ . 4. Finding b in the bivariate case ~ a. Let 1 X 1  1 X  2 , X       1 X n  1 XT y   ~ X1 Y1  Y  y   2  , and thus …  ~   Yn  1  X2   Y1   n  Yi      1  Y 2    i 1 *  X n      n      X iYi   Yn   i 1 b. Then … 1 XTX   X1 3 1 X2 1 X 1   n  1  1 X 2    *  X n      n    X i 1 X n   i 1 X i The same solution is obtained if  2 y X b is used instead of  2 b X T y in the formula for T T ~ ~ ~ T e e. ~   i 1  n 2 Xi   i 1 n ~ 18 ~   X 2 n 2 2  det X X  n X   X   n  X   n n    T X X  1 T 2 1 X 2   n SS X    X  SS X  X SS X 1 SS X    4    1 X 2   1 n SS X T T b X X X y ~ ~   X  SS X      nSS X    X SS X 1 SS X    n     Yi   *  ni 1    XY i i     i 1 n n     2  X i Yi  nXY  n Y X X X Y   i i    i 1 Y   Y  X i 1  SS X SS X      n n     nY X   X i Yi    X i Yi  nY X  i 1 i 1         SS X SS X   Y  bˆX  aˆ  b0     ˆ   ˆ    b  b   b1  c. Thus we end with the familiar formulae for the slope and constant in the bivariate case. 5. Noting that solving for b requires that det(XTX) does not equal zero, brings us back to ~ the assumption that X is full column rank. 4 The 1,1 cell in this matrix can be derived as follows: X 2 nSS X X  2  nX 2  nX 2 nSS X 19 SS X  nX 2 1 X 2    nSS X n SS X a. Recall that when det(A)=0, A-1 does not exist. Thus to show that X must be full column rank for b to be estimated, it must be demonstrated that det(XTX)=0 if and ~ only if none of the columns of X are linear combinations of other columns of X. b. This is easily illustrated using a small matrix. Let  X 11 X   X 21  X 31 kX 11  kX 21  kX 31  , which is NOT full column rank because its second column is k-times its first. X X X   11 kX 11 T X 21 kX 21 kX 11  kX 21  kX 31   X 11 X 31   * X 21 kX 31    X 31  X112  X 212  X 312 kX112  kX 212  kX 312   2 2 2 2 2 2 2 2 2  kX11  kX 21  kX 31 k X 11  k X 21  k X 31    k  2 2 2  2  , where   X 11  X 21  X 31 k k    det X T X   * k 2   k * k  0 Thus, since X T X  1  k 2  T   det X X   k  det X T X      k det X T X  det X T X      T   , the inverse of X X           does not exist. c. Accordingly (at least in the case when X has two rows), X must be full column rank in order that b can be computed. (You’ll have to take my word for it that this holds ~ 20 when X has more than two columns, and that an inverse can always be calculated when det(XTX) ≠ 0.) G. The UNBIASEDNESS of b in estimating the population parameters,  . ~ ~ 1. We begin with the assumption that the true relation between the Xs and Ys in our sample is y  X    . ~ ~ ~ 2. Applying the formula for b to the entire population, we obtain the following: ~        X X    X X  b  XTX ~  XTX  XTX 1 XT y ~ 1 X T  X      ~ ~ 1 T 1 XT  ~ ~    XT X ~ T  1 XT  ~  3. Yet what we have in practice is not   X X ~ T  1  X T  but   X T X ~ ~  1 X T e , since ~ we are only dealing with a random sample of “n” units of analysis from our population. 4. Now we get to the IMPORTANT part: What unbiasedness means is that the expected value of a sample estimator is the parameter it is supposed to estimate. That is, b is an ~ unbiased estimator of  if and only if E( b) =  ~ ~ . We can show this to be the case ~ by taking expected values of both sides of the above equality after exchanging e for  . ~    b     X T X ~ ~  1 X T e  ~ ~  X  Y    X   Y  21        X T X ~    XT X ~  1  1 XT e ~        X T e   X T * e ~   X T * e E( e ) = 0 ~ ~ ~  ~ H. If we add the assumptions here that the e’s are statistically independent (randomness) and have common variance (homoscedasticity)—that is, if we add the IID assumption that     ee T 2 ~ ~ I n , then in addition to being an unbiased estimator of  , b can be shown to ~ ~ be BLUE (i.e., the Best Linear Unbiased Estimator of  ). “Best” here means that b has the ~ ~ smallest variance among all linear unbiased estimators. And, of course, “Linear” refers to the premise that y is a linear function of X.5 ~ I. We now have enough information to find a concise expression for Var( b ) (i.e., for the ~ variance/covariance matrix of the unstandardized slope estimates) that results given the randomness and homoscedasticity assumptions (a.k.a. the IID assumption). 1. Recall the following:  a. Var X     X   X   2  b.  b   ~ ~  c. b    X X ~ ~ T  1 XT ~ 2. The formula for deriving Var( b ) is obtained as follows: ~ 5 This is the Gauss-Markov Theorem. See Johnston (1984, pp. 173-4) for a proof. 22 ~  T       Var b   b  * b       ~  ~ ~  ~ ~        X T X  X T e * eT X  X T X   1 ~  X X T  1 1 ~      T ~~   T  X T * ee * X X T X   X T * 2 I n * X X T X  XTX 1 1  T ~~   2 * XT X      2 * XT X 6 X * ee * X X X T   XTX  1  XTX XTX  1      ~  X   X  1     1  ee T 2 ~ ~  In 1 1  T 3. What this implies is that if xjj is the j+1,j+1 element from the X X variance of the slope, b̂ j , is ˆ b2ˆ  ˆ 2 x jj . x jj   1 matrix, then the Referring back to our review of multiple j regression, you will note that 1  SS x j 1  R 2 X j . X1 , X2 ,, X j 1 , X j 1 ,, Xk  J. The unbiasedness of the Mean Square Error (MSE, or ̂ 2 )8 as an estimate of  2  .7 1 T   N ~ ~ 1. Because the design matrix, X, is fixed (or measured without error),   Var b   2 * X T X ~ 6 1 2 is only measured without bias if  is measured without bias. In case you are puzzled about how last term in this expression was derived, note that since X X  T 7  1  is square and symmetric, X T X   1 T   XTX  1 8 . Instead of xjj, Pedhazur (1997, p. 151) uses xjj. Also be aware that when j=0, bˆ j  bˆ0  aˆ (i.e., the constant in the regression equation). The below proof is based on Johnston (1984, pp. 180-1). 23   X T e   X T * e ~ 2. Demonstrating this requires first noting that e  y  X b . ~   1 3. Since b~  X X T  e  y  X  X T X ~ ~   1 ~ ~ X T y , it follows that ~    X T y   I n  X X T X ~ 1  XT y . ~ 4. The expression in parentheses that is multiplied by the y-vector has the curious quality of being idempotent (i.e., multiplying it by itself, yields the original matrix). Thus, I n   X XTX  1   X T * In  X X T X 1  XT  X XTX   X T . Let’s call this matrix, M , about which we have just 1  In  X X T X 1   XT   In  2X X T X    1  XTX XTX  1 XT  shown MM  M . And since M is symmetric, M T  M and thus M T M  M . 5. Recalling y  X   ~ , if follows that ~ ~   1 e  M y  M  X   ~   MX   M ~  I n  X X T X  X T X   M ~ ~ ~ ~ ~  ~     X  X X T X  X T X   M ~  0~  M ~  M ~ . 1 ~ 6. Now let’s consider SS ERROR  e e  ~T M T M ~ . Since M is symmetric and idempotent, T ~ ~ M T M  M , and thus SS ERROR  ~T M ~ .     7. Taking expected values,  e T e   ~T M ~ . Since an expectation is being taken of a ~ ~ 1x1 matrix, we seek the expectation of a single number. Yet this number is also the trace of this matrix. Thus, … 24     e T e   tr ~T M ~ ~ ~    tr M ~ ~ T  The trace of a product of matrices remains the same no matter what (conformable) sequence in which they are multiplied.        tr M  ~ ~  tr M I n 2     e e    I    X T e   X T * e T ~ 2 T ~ ~   2 trM  ~ n  X   X 8. The trace of M is obtained as follows:   trM   tr I n  X X T X  1    X T  trI n   tr X X T X  1   X T  n  tr X T X  1 XTX   n  trI k 1   n  k  1 Of course, the trace (i.e., the sum of the diagonal elements) of an identity matrix equals the dimension of that matrix. The only other “trick” in this proof is changing the sequence in which the X matrices are multiplied when the second of the two traces is obtained.   9. What we have shown at this point is that  eT e  SS ERROR   n  k  1 2 . ~ ~ a. Since the Mean Square Error, MSE  SS ERROR , n  k 1  SS ERROR  n  k  1 2  SS MSE    ERROR      2 . Thus the MSE is an n  k 1 n  k 1  n  k 1 unbiased estimator of  2 . b. However, it should be noted that this proof only works if all assumptions hold with the exception of those aspects of the normality assumption beyond that E( e ) = 0 . ~ 25 ~ K. In fact, at this point all assumptions except the normality assumption have been required. 1. Note that the normality assumption was not needed to obtain unbiased estimates of slopes or variances. 2. The normality assumption does allow one to use one’s variance estimates in testing for   the statistical significance of these slopes, however. In particular, if the  i ~ N 0,  2 , 1 then b ~ N   ,  2 X T X   . That is, the normality assumption allows one to test the ~ ~  hypotheses, … Ho: βj = 0 HA: βj ≠ 0 with the test statistic, t 2 ,n  k 1  bˆ j MSE * xj j , where xjj is the j+1,j+1 element from X T X  . 1 3. Note that OLS slope and slope-variance estimates are very robust for departures from the normality assumption. This is why we shall not spend time discussing tests of normality. Inspections of scatter plots should be sufficient to identify large departures from normality. 4. The final page of this section is a “Handout” on which the assumptions of linear regression are listed along with indications of the beneficial consequences that result as long as they are met. 26 Stat 404 Assumptions of Linear Regression: Consequences When They Are Met Assumption(s) Consequence(s) if met ˆ The estimates, Y i , are relatively close to the observations, Y i , because OLS estimates are appropriate for estimating linear y = Xβ + ε ˜ ˜ ˜ associations. Also b and σ̂ 2 ˜ are meaningful 2 estimates respectively of β and σ . ˜ The previous assumption plus E ( e ) = 0 and ˜ ˜ T T E ( X e ) = E ( X )E ( e ) . ˜ ˜ b is unbiased as an estimator of β . ˜ ˜ The previous two assumptions plus MSE (i.e., σ̂ ) and thus σ̂ ( X X ) E ( X ) = X and unbiased as respective estimators of σ T 2 2 E ( e e ) = σ In . ˜˜ Var ( b ) ˜ X is full column rank. (X X) T –1 2 –1 are and thus b can be calculated. ˜ The previous assumptions plus the ε i are normally distributed. T 2 T –1 b ∼ N ( β, σ ( X X ) ) ˜ ˜ 2 2 and

A Matrix Approach to Linear Regression

Related documents

Products

Support

A Matrix Approach to Linear Regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib