A Matrix Approach to Linear Regression

advertisement
Stat 404
A Matrix Approach to Linear Regression
A. Elements of matrix algebra (or linear algebra)
1. A matrix is an array of numbers. (Note: Pedhazur uses bold letters to denote matrices. I
shall use capital letters.)
1
2
B
3

4
1 4 
A

 2 3
5
6
8

7
2. Numbers within a matrix are its elements and are assigned two subscripts: the first for
row, the second for column.
 a11
A
a21
b11 b12 
b

b
21
22

B
b31 b32 


b41 b42 
a12 
a22 
3. The dimension of a matrix refers to the number of its rows and columns. For example, A
is of dimension 2x2; B is of dimension 4x2.
4. The transpose of a matrix is obtained by interchanging its rows and columns. That is, if
bij is the element in the ith row and the jth column of B, the element in the jth row and the
ith column of BT will take the value of bij. (Note: Pedhazur uses the notation, B', instead
of BT.) For example, …
1
1 2 3 4
BT  

5 6 8 7 
5. A square matrix has the same number of columns as rows. The matrix, A, is a square
matrix.
6. A symmetric matrix is a square matrix within which, for each element, aij = aji . An
example of a symmetric matrix is …
 1 2 3
S  2 4 5
3 5 6
7. The diagonal elements of a square matrix are those elements having the same row and
column number. The elements s11 = 1 , s22 = 4 , s33 = 6 are the diagonal elements of the
matrix, S.
8. A diagonal matrix is a square matrix that has all off-diagonal elements equal to zero. For
example, …
1 0 0
D  0 2 0
0 0 3
9. The identity matrix is a diagonal matrix with all diagonal elements equal to one (1). For
example, in a 3x3 matrix …
1 0 0
I 3  0 1 0
0 0 1
2
10. A vector is a matrix with only one column. Often vectors are assigned lower case letters
with a tilde (~) underneath. (Note: Pedhazur uses bold lower case letters to denote
vectors. I shall use lower case letters with tildes underneath to denote vectors.)
 2
 4
w 
~
6 
 
8 
1
v  3
~
5
11. ADDITION and SUBTRACTION
a. Only matrices of the same dimension can be added. Matrices of the same dimension
are said to be conformable in addition. For example, let …
 a11 a12

A a21 a22
2 x3
a13 
a23 
b11 b12

B b21 b22
2 x3
b13 
b23 
b. Their sum is …
a12  b12
a  b
A  B   11 11
a21  b21 a22  b22
a13  b13 
a23  b23 
c. Their difference is …
a  b
A  B   11 11
a21  b21
a12  b12
a22  b22
d. Matrix addition is …
i. commutative: A+B = B+A
ii. associative:
(A+B) + C = A + (B+C)
3
a13  b13 
a23  b23 
12. MULTIPLICATION
a. To multiply a matrix by a scalar, s, all elements of the matrix are multiplied by s.
Thus …
 sa
sA  As   11
 sa 21
sa12
sa 22
sa13 
sa 23 
b. Matrix multiplication is NOT commutative. That is, A * B ≠ B * A for all A and B.
Thus the sequence in which matrices are multiplied is important.
c. Two matrices are conformable in multiplication if the number of columns in the firstsequenced matrix equals the number of rows in the second-sequenced matrix. For
example, …
1
2
* A 
B
3
4x2 2x2

4
 b11a11  b12 a 21
b a  b a
  21 11 22 21
b31a11  b32 a 21

b41a11  b42 a 21
5
6 1 4
*
8 2 3

7
b11a12  b12 a 22   1 *1  5 * 2 
b21a12  b22 a 22  2 *1  6 * 2 

b31a12  b32 a 22   3 *1  8 * 2 
 
b41a12  b42 a 22  4 *1  7 * 2 
11
14

19

18
4
19 
26

36

37
1* 4  5 * 3 
2 * 4  6 * 3
3 * 4   8 * 3
4 * 4  7 * 3
d. Note that the dimension of the matrix produced has the number of rows of the first
matrix and the number of columns of the second. In this case, multiplying a 4x2
matrix by a 2x2 matrix yields a 4x2 matrix. If the second matrix had been 2x3, the
resulting matrix would have been 4x3.
e. Although not commutative, matrix multiplication does support …
i. associativity: (A * B) * C = A * (B * C)
and
ii. distributivity: A(B + C) = A * B + A * C .
f. Also note that (A * B)T = BT * AT .
13. A determinant can be calculated for any square matrix. We shall only discuss how to
calculate the determinant of a 2x2 matrix. Let …
 a11

A a21
2x2
a12 
a 22 
.
The determinant of A is det( A)  a11 * a 22   a12 * a 21  .
1 4
 , then det(A) = 3  8 = 5 .
2 3
Thus if A  
14. Whenever perfect collinearity exists among the rows or columns of a square matrix, its
determinant equals zero. A matrix with perfect collinearity can be generated by setting
one column of a matrix equal to a multiple of another. For example, note that by
multiplying the first column of the matrix, A, by four yields its second column.
1 4
A

2 8
det(A) = 8  8 = 0
5
Perfect collinearity need not be simply between two columns in a matrix. If every
element in one of the matrix’s columns equals exactly the weighted sum of two or more
columns in the matrix, you have perfect multicollinearity.
a. When the determinant of a matrix equals zero, the matrix is said to be singular. (In
multiple regression, singularity results whenever two or more independent variables
are perfectly collinear—an extreme case of multicollinearity. As we shall soon see,
within the context of multiple regression, singularity is "undesirable.")
b. Nonsingular matrices are said to be full rank.
15. For every nonsingular square matrix, A, its inverse matrix is a matrix, A1, such that
A * A1 = A1 * A = I, where I is the identity matrix.
 a11
a 21
a. If A  
a12 
, then …
a 22 
 a 22
 det( A)
1
A 
  a 21
 det( A)
 a12 
det( A) 

a11 
det( A) 
.
 3
1 4

, then det(A) = 3  8 = -5 and A1    5
Thus if A  

2
2 3

 5
4   3 4 

 5   5 5  .
1   2
1
 
 
5
5   5

 3 4   3 4 
1
4
  5 5    5 5   1 4   1 0


1

*
*
*
A
A


Now note that
2 3  2
1  2
1  2 3 0 1 as

 
 
  
5  5
5
 5
promised.
6
b. Moreover, the inverse of a matrix does not exist whenever its determinant equals
  
zero. For example, notice that in the 2x2 case A1  
 when det(A) = 0 .
  
16. Finally, the trace of a square matrix is the sum of its diagonal elements. Two aspects of
the trace will be used at the very end of these lecture notes:
a. The trace of an identity matrix equals its dimension. For example, the identity
matrix, I3, at the bottom of page 2 is of dimension 3, the sum of its diagonal elements
is 3, and thus its trace equals 3.
b. As long as their sequence remains the same, a series of matrices that produce a square
matrix can be multiplied starting at any point in the sequence, and the traces of their
products will be equivalent. Thus, for example, the trace of AB equals the trace of
BA, as long as the number of columns in A equals the number of rows in B and the
number of rows in A equals the number of columns in B.1
1
Let the dimension of A be m-by-n and the dimension of B be n-by-m. The diagonal elements
of AB will be (a11*b11 + a12*b21 + … + a2n*bn2), (a21*b12 + a22*b22 + … + a2n*bn2), ... (am1*b1m
+ am2*b2m + … + amn*bnm). The diagonal elements of BA will be (b11*a11 + b12*a21 + … +
b2m*am2), (b21*a12 + b22*a22 + … + b2m*am2), … (bn1*a1n + bn2*a2n + … + bnm*amn). In each case,
when a trace is obtained, the same n-times-m products are being summed. The diagonal
elements of the AB matrix are simply the sums of different subsets of these products than are the
diagonal elements of the BA matrix. For example, a11*b11 is added into the 1,1 cell of both AB
and BA, whereas a12*b21 is added into AB’s 1,1 cell but into BA’s 2,2 cell. In brief, the trace is
the sum of the products between the matrices of all pairs of cells that share either the same row
or the same column.
7
B. We now return to the assumptions of linear regression, except that they will now be
expressed in matrix form (Part 1).
y  X 
1. Linearity: X and Y (or y ) are related as
~
~
shorthand for …
 Y1 
Y 
 2
  
 
Y N 
1
1



1
X 11
X 12

X 21
X 22




X N1
X N2

, which is
~
~
 
X 1k   0    1 

X 2k   1    2 
 *   2     , where
      

 

X Nk     N 
  k 
the  i are unknown population parameters (i.e., true slopes in the population),
each  i is an unknown disturbance (i.e., the true deviation of Yi from the linear equation,
 0  1 X i1   2 X i 2     k X ik ), and
X and Y list the knowable values for a set of variables regarding all units of analysis in a
population of size, N.
A few comments:
a. In multiple regression analysis, we estimate the unknown  and  based on a
~
~
sample of size, n.
b. Note that when we stop talking about true characteristics of a population and begin
speaking of estimates of these parameters, we shift from using Greek to Arabic
letters—a shift away from their previous references to standardized and
unstandardized slopes. Greek letters represent “unknowns” that are estimated in an
analysis, whereas lower-case Arabic letters correspond to estimates of these
unknowns that are based on a sample. Thus, bi estimates the parameter, βi, and ei
8
estimates the disturbance, εi. (Note that unlike our previous lecture notes, the “hat”
symbol, ^, is no longer used to distinguish estimates from parameters. More
generally, “hats” are never placed above symbols for matrices.)
c. The linearity assumption presupposes that all variables are interval- or ratio-level
measures (i.e., that all variables have units like number of children, degrees
Centigrade, etc.).
2. Normality: The εi are normally distributed about the (true) regression line.
a. One consequence of this assumption is that the expected value of e is 0 , where 0 is
~
~
~
a vector of zeros.
b. In expectation notation this is written as E( e ) = 0 . Of course, now we need to
~
~
become acquainted with “expectation notation.”
C. Mathematical expectation
1. You will find that expectation notation and summation notation are very similar. In fact,
the two notations are identical, except that expectations are based on probability theory
and summations describe concrete manipulations of one’s data.
2. The fundamental principle behind the idea of expectation is the concept of a random
variable, Y, that assumes its value according to the outcome of a chance event. For
example, one might consider the chance event of the number of children born to
whomever is the first person to be drawn at random from a specific population. In this
case, a random variable, Y, might be defined as …
Y = the number of children born to person #1 .
9
More precisely, if we let S be the sample space for the chance outcomes from a random
sampling of persons, a random variable is a function (i.e., a rule of correspondence) that
associates with each element of S exactly 1 real number.
3. In this example, S consists of all possible numbers of offspring. The random variable Y
can be thought of as the rule that defines how each element of S can be paired with
exactly 1 real number. For example, this rule might be, “If Person 4’s questionnaire
contains a ‘2’ on the blank line at the end of Questionnaire Item #5, then assign the
integer, 2, to the fourth element of S.”
a. If the distribution of Y is discrete, then the expectation (or expected value) of Y is
defined (≡) as follows:
m
Y    Yi Pr Y  Yi  ,
i 1
where the Yi comprise the set of all of the possible, m, distinct values of a
discrete random variable (in the case of number of offspring, these values
can only be nonnegative integers),2
where Pr Y  Yi  
Ni
(here, N i is the number of people from the
N
population, who are contained in the ith group [e.g., those with 2 children]
and N is the population size), and
where
 Pr Y   1 .
i
2
This discussion is limited to discrete random variables, which assume at most a finite number
(although, if one considers Genghis Kahn, a potentially large one for the number of children born
to males) of possible values.
10
b. Thus an expectation is a kind of weighted sum of values. To illustrate, consider a
small population consisting only of 70 people: 15 with no children, 20 with one child,
20 with two children, and 15 with three children. The expected number of children
from any person drawn at random from this population would be calculated as
follows:
 15   20   20   15  105
Y    Yi Pr Yi   0   1   2   3  
 1.5   ,
 70   70   70   70  70
which is, of course, what you would expect. Right?
c. Yet our population sizes are usually much larger than 70, and the probabilities for
different outcomes are generally unknown. Such summations are usually concrete
calculations performed on all cases in one’s sample. Expectation notation allows us
to speak hypothetically of such calculations, “as if they were to be done.”
4. You should also be aware of additional rules of expectation mathematics. In the
following rules (which are given without proof) assume that “a” is a constant real number
and that X, Y, and all Xi are random variables with respective expectations E(X), E(Y),
and E(Xi):
a   a
aX   a X 
 X  a    X   a
 X  Y    X   Y 

  
Var X    X2   X   X    X 2   X 
2
11
2
Var aX   a 2Var  X 
Var X  Y   Var X  Y   Var X   VarY  ,
IFF (if
and only if) X and Y are statistically independent.
 XY    X  * Y  ,
IFF X and Y are statistically
independent.
Cov XY    XY    X Y 
Note that given the next-to-last rule, it follows that when the covariance, Cov(XY),
equals zero, this indicates that X and Y are statistically independent.
5. Now we can return to our discussion of the assumptions of linear regression.
D. The assumptions of linear regression in matrix form (Part 2).
2. Normality (continued)
c. Recall that within the linear model (i.e., given that one is assuming a linear relation
between Y and X), the normality assumption implies (in part) that …
E( e ) =
~
0
.
~
d. Differently put, if the εi are normally (actually only symmetrically, at this point)
distributed and if Y and X are linearly related (such that the normal distributions of
the εi are centered on the regression line for all values of the Xs), then for any
combination of X-values the εi would have zero mean and consequently the expected
value of e would be 0 .
~
~
12
3. Randomness: The Y’s (and thus the e’s) are statistically independent.
a. Consider what ramifications this assumption has for …
 
 ee
T
~ ~
 
  e12
e1e2 

e e   e22
 2 1
 


en e1  en e2 
 
 e1en 

 e2 en 
,

 

  en2 
 
 e1 
e 
where e   2  is the vector of sampled deviations of Y from the true linear relation
~

 
e n 
between X and Y.
b. In particular, note that since E( e ) = 0 , it follows that for each ei (with its particular
~
~

  
combination of values among the variables in X), Varei    ei  ei    ei
2
2
.
c. Also note that since the randomness assumption implies that the e’s are statistically
independent, we know that ei e j   ei  * e j   0 * 0  0 .
d. Thus with the assumption that E( e ) = 0 , the randomness assumption can be stated
~
~
as follows:
 
 ee
~ ~
T
 e21

0


 

 0
0
 e2
2

0
0 

0 
  

  e2n 


4. Homoscedasticity: The e’s have a constant variance, σ2. With this assumption we can
further specify that …
13
  
 ee
T
2
~ ~
In
,
where In is an identity matrix of dimension, n-by-n.
This expression is commonly referred to with the phrase, “The ei are independently and
identically distributed (or IID).”
5. It must also be assumed that the Xs are fixed, or at least that they have been measured
without error. In matrix and expectation notations, this implies that …
 X   X
6. One must also assume that the errors, e , are not correlated with the Xs. That is, …
 
~
 

 X T e   X T * e
~
~
7. Finally, we must add the assumption regarding the design matrix, X, that …
X is full column rank.
That is to say, no column of X is a linear combination of the other columns of X.
8. The next page is a “Handout” on which these assumptions are summarized, and where
indications are given of how to verify whether or not the assumptions are met. Thereafter
we shall turn to a discussion of how these assumptions are used in justifying how we
estimate slopes in multiple regression analysis.
14
Stat404
Assumptions of Linear Regression:
Verifying that They Are Met
Assumption
Verfication that assumption is met
y = Xβ + ε
˜
˜ ˜
Plot Y by each Xi (i=1...k) and examine for
linearity.
This assumption implies both randomness and
T
homoscedasticity in asserting that the ε i
are IID (independently and identically
distributed).
a. Examine plots for tracking when the
X-variable is time-related.
b. Examine plots for heteroscedasticity.
c. Critically inspect the sampling design
for evidence that two or more values of Y
may not have been the result of independent random selections.
2
E ( e e ) = σ In
˜˜
The ε i are normally
distributed, and thus
Examine plots for serious outliers.
E(e) = 0 .
˜
˜
Determine whether X was fixed
experimentally, or whether the Xs were
defensibly measured without error.
E(X) = X
E ( X e ) = E ( X )E ( e )
˜
˜
Ensure that X and ε are not related to some
˜
causally prior factor.
X is full column rank.
Verify that the computer can calculate b .
T
T
˜
1
E. Solving linear equations
1. In regression analysis we begin with a set of “n” linear equations of the form, …
Yi  b0  b1 X i1  b2 X i 2    bk X ik  ei .
2. When you have the same number of distinct equations as unknowns, you can find a
unique solution for the unknowns. (By distinct equations is implied that no two
equations are multiples of each other.) Note that in the set of equations described by
y  X b e , there are (at most) “n” distinct equations with k+1 unknowns.
~
~
~
3. Consider the following two equations with two unknowns:
3  2b1  3b2
7  4b1  5b2
4. To solve for b1 and b2, we begin by expressing the equations in matrix form:
3 2 3  b1 
y 
 *    X b~
~
7  4 5 b2 
The next step is to find X-1 and then, premultiplying both sides of the equation by X-1 we
get X 1 y  X 1 X b  b .
~
~
~
5. Here are the calculations for solving these two equations:
a. det(X) = 2*5 – 4*3 = -2
b. X 1
 5

 2
4

 2
c. b  X 1
~
3 
2 
2
 
2
 5

y 2
4
~

 2
3 
 15 21
2  *  3        3 
2
2
 
2  
  7   6  7   1
2
16
Thus b1 = 3 and b2 = -1 are the solution to the equations.
6. Note that whenever you have more unknowns than equations, there is NO SINGLE
solution for the unknowns. For example, 2b3  3b4  3 has an infinite number of
solutions. (You can pick any number for b3 and the value of b4 follows.) However, when
you have more equations than unknowns, there are as many solutions as there are subsets
of “as many equations as unknowns.” The problem now becomes one of selecting from
among the many possible solutions.
F. Ordinary Least Squares (OLS)
1. When solving “n” equations with “k” unknowns (and when n>k), MANY solutions for b
~
are possible. Statisticians have developed numerous criteria for choosing from among
these many possible solutions. The two most commonly-used of these criteria are
maximum likelihood (which we shall not cover in this course) and ordinary least squares
(which we shall cover).
T
2. In brief, when using OLS one chooses that matrix, b , that minimizes e e . (Note that
~
T
~
~
T
whereas e e is an n-by-n matrix, e e is a single number.)
~ ~
~
~
3. We must use calculus to find these values for the elements of b .
~
a. We begin by expressing e e in terms of b . Noting that e  y  X b , …
T
~
~
~
~
~
~
T
e e   y  X b   y  X b 
~ ~
~  ~
~
~
T
 y y  y X b b X T y  b X T X b
T
~
T
~
T
~
~
T
~
~
~
~
 y y 2 b X T y b X T X b
T
~
T
~
~
T
~
~
~
(Given that y X b  b X T y .)
T
~
17
T
~
~
~
b. Taking first derivatives with respect to b , we get
~
 
 T
e e  2 X T y  2 X T X b .3
~
b ~ ~
~
~
c. The formula for b is found by setting this equal to zero, which yields
~

X T X b  X T y . Premultiplying both sides by X T X
~
~
X X  X X b  X X 
1
T
T
1
T
~
1
, we get
X T y . And thus the OLS solution for b is …
~
~

b X X
~

T

1
XT y
~
.
4. Finding b in the bivariate case
~
a. Let
1 X 1 
1 X 
2
,
X 
  


1 X n 
1
XT y  
~
X1
Y1 
Y 
y   2  , and thus …

~
 
Yn 
1

X2

 Y1   n

Yi 




1  Y 2 
  i 1
*

X n      n

    X iYi 

Yn   i 1
b. Then …
1
XTX  
X1
3
1
X2
1 X 1  
n
 1  1 X 2  

*
 X n      n

  X i
1 X n   i 1
X
i
The same solution is obtained if  2 y X b is used instead of  2 b X T y in the formula for
T
T
~
~
~
T
e e.
~


i 1

n
2
Xi


i 1
n
~
18
~

 X 2
n
2
2

det X X  n X   X   n  X 

n
n



T
X X 
1
T
2
1 X 2
 
n SS X

  X
 SS
X

X
SS X
1
SS X



4



1 X 2
 
1
n SS X
T
T
b X X X y
~
~
  X
 SS
X



  nSS
X



X
SS X
1
SS X

  n

   Yi 
 *  ni 1

  XY
i i
 

 i 1
n
n

 

2

X i Yi  nXY 
n
Y
X
X
X
Y


i i 


i 1
Y 
 Y  X i 1

SS X
SS X





n
n


  nY X   X i Yi    X i Yi  nY X

i 1
i 1

 





SS X
SS X
 
Y  bˆX  aˆ  b0 



ˆ   ˆ  
 b  b   b1 
c. Thus we end with the familiar formulae for the slope and constant in the bivariate
case.
5. Noting that solving for b requires that det(XTX) does not equal zero, brings us back to
~
the assumption that X is full column rank.
4
The 1,1 cell in this matrix can be derived as follows:
X
2
nSS X
X

2
 nX 2  nX 2
nSS X
19
SS X  nX 2 1 X 2

 
nSS X
n SS X
a. Recall that when det(A)=0, A-1 does not exist. Thus to show that X must be full
column rank for b to be estimated, it must be demonstrated that det(XTX)=0 if and
~
only if none of the columns of X are linear combinations of other columns of X.
b. This is easily illustrated using a small matrix. Let
 X 11
X   X 21
 X 31
kX 11 
kX 21 
kX 31 
, which is
NOT full column rank because its second column is k-times its first.
X
X X   11
kX 11
T
X 21
kX 21
kX 11 
kX 21 
kX 31 
 X 11
X 31  
* X 21
kX 31  
 X 31
 X112  X 212  X 312
kX112  kX 212  kX 312 
 2
2
2
2
2
2
2
2
2 
kX11  kX 21  kX 31 k X 11  k X 21  k X 31 
  k 
2
2
2

2  , where   X 11  X 21  X 31
k k 


det X T X   * k 2   k * k  0
Thus, since X T X 
1
 k 2

T
  det X X
  k
 det X T X




 k
det X T X

det X T X

   
T

 , the inverse of X X


 






does not exist.
c. Accordingly (at least in the case when X has two rows), X must be full column rank
in order that b can be computed. (You’ll have to take my word for it that this holds
~
20
when X has more than two columns, and that an inverse can always be calculated
when det(XTX) ≠ 0.)
G. The UNBIASEDNESS of b in estimating the population parameters,  .
~
~
1. We begin with the assumption that the true relation between the Xs and Ys in our sample
is y  X    .
~
~
~
2. Applying the formula for b to the entire population, we obtain the following:
~





  X X    X X 
b  XTX
~
 XTX
 XTX
1
XT y
~
1
X T  X    
 ~ ~
1
T
1
XT 
~
~

  XT X
~
T

1
XT 
~

3. Yet what we have in practice is not   X X
~
T

1

X T  but   X T X
~
~

1
X T e , since
~
we are only dealing with a random sample of “n” units of analysis from our population.
4. Now we get to the IMPORTANT part: What unbiasedness means is that the expected
value of a sample estimator is the parameter it is supposed to estimate. That is, b is an
~
unbiased estimator of  if and only if
E( b) = 
~
~
. We can show this to be the case
~
by taking expected values of both sides of the above equality after exchanging e for  .
~


 b     X T X
~
~

1
X T e 
~
~
 X  Y    X   Y 
21

      X T X
~

  XT X
~

1

1
XT e
~

 
 

 X T e   X T * e
~
 
X T * e
E( e ) = 0
~
~
~

~
H. If we add the assumptions here that the e’s are statistically independent (randomness) and
have common variance (homoscedasticity)—that is, if we add the IID assumption that
  
 ee
T
2
~ ~
I n , then in addition to being an unbiased estimator of  , b can be shown to
~
~
be BLUE (i.e., the Best Linear Unbiased Estimator of  ). “Best” here means that b has the
~
~
smallest variance among all linear unbiased estimators. And, of course, “Linear” refers to
the premise that y is a linear function of X.5
~
I. We now have enough information to find a concise expression for Var( b ) (i.e., for the
~
variance/covariance matrix of the unstandardized slope estimates) that results given the
randomness and homoscedasticity assumptions (a.k.a. the IID assumption).
1. Recall the following:

a. Var X     X   X 

2

b.  b  
~
~

c. b    X X
~
~
T

1
XT
~
2. The formula for deriving Var( b ) is obtained as follows:
~
5
This is the Gauss-Markov Theorem. See Johnston (1984, pp. 173-4) for a proof.
22
~

T






Var b   b  * b  

 

~
 ~ ~  ~ ~  


   X T X  X T e * eT X  X T X 

1
~
 X X
T

1
1
~
   

T
~~
 
T

X T * ee * X X T X


X T * 2 I n * X X T X
 XTX
1
1

T
~~

 2 * XT X




 2 * XT X
6
X * ee * X X X
T

 XTX

1

XTX XTX

1
 
 

~
 X   X

1
  

1
 ee
T
2
~ ~

In
1
1

T
3. What this implies is that if xjj is the j+1,j+1 element from the X X
variance of the slope, b̂ j , is
ˆ b2ˆ  ˆ 2 x jj .
x jj 

1
matrix, then the
Referring back to our review of multiple
j
regression, you will note that
1

SS x j 1  R
2
X j . X1 , X2 ,, X j 1 , X j 1 ,, Xk

J. The unbiasedness of the Mean Square Error (MSE, or ̂ 2 )8 as an estimate of  2 
.7
1 T
 
N ~ ~
1. Because the design matrix, X, is fixed (or measured without error),


Var b   2 * X T X
~
6
1
2
is only measured without bias if  is measured without bias.
In case you are puzzled about how last term in this expression was derived, note that since
X X 
T
7

1

is square and symmetric, X T X 

1 T

 XTX

1
8
.
Instead of xjj, Pedhazur (1997, p. 151) uses xjj. Also be aware that when j=0, bˆ j  bˆ0  aˆ (i.e.,
the constant in the regression equation).
The below proof is based on Johnston (1984, pp. 180-1).
23

 X T e   X T * e
~
2. Demonstrating this requires first noting that e  y  X b .
~


1
3. Since b~  X X
T

e  y  X  X T X
~
~


1
~
~
X T y , it follows that
~



X T y   I n  X X T X
~
1

XT y .
~
4. The expression in parentheses that is multiplied by the y-vector has the curious quality of
being idempotent (i.e., multiplying it by itself, yields the original matrix). Thus,
I
n

 X XTX

1


X T * In  X X T X
1

XT  X XTX


X T . Let’s call this matrix, M , about which we have just
1
 In  X X T X
1


XT

 In  2X X T X



1

XTX XTX

1
XT

shown MM  M . And since M is symmetric, M T  M and thus M T M  M .
5. Recalling y  X   ~ , if follows that
~
~


1
e  M y  M  X   ~   MX   M ~  I n  X X T X  X T X   M ~
~
~
~
~
 ~



 X  X X T X  X T X   M ~  0~  M ~  M ~ .
1
~
6. Now let’s consider SS ERROR  e e  ~T M T M ~ . Since M is symmetric and idempotent,
T
~
~
M T M  M , and thus SS ERROR  ~T M ~ .
  

7. Taking expected values,  e T e   ~T M ~ . Since an expectation is being taken of a
~ ~
1x1 matrix, we seek the expectation of a single number. Yet this number is also the trace
of this matrix. Thus, …
24
  
 e T e   tr ~T M ~
~ ~

  tr M ~ ~
T

The trace of a product of matrices
remains the same no matter what
(conformable) sequence in which
they are multiplied.


 


 tr M  ~ ~
 tr M I n 2
  
 e e    I


 X T e   X T * e
T
~
2
T
~ ~
  2 trM 
~
n
 X   X
8. The trace of M is obtained as follows:


trM   tr I n  X X T X

1



X T  trI n   tr X X T X

1


X T  n  tr X T X

1
XTX

 n  trI k 1   n  k  1
Of course, the trace (i.e., the sum of the diagonal elements) of an identity matrix equals
the dimension of that matrix. The only other “trick” in this proof is changing the
sequence in which the X matrices are multiplied when the second of the two traces is
obtained.
 
9. What we have shown at this point is that  eT e  SS ERROR   n  k  1 2 .
~ ~
a. Since the Mean Square Error, MSE 
SS ERROR
,
n  k 1
 SS ERROR  n  k  1 2
 SS
MSE    ERROR  

  2 . Thus the MSE is an
n  k 1
n  k 1
 n  k 1
unbiased estimator of  2 .
b. However, it should be noted that this proof only works if all assumptions hold with
the exception of those aspects of the normality assumption beyond that E( e ) = 0 .
~
25
~
K. In fact, at this point all assumptions except the normality assumption have been required.
1. Note that the normality assumption was not needed to obtain unbiased estimates of slopes
or variances.
2. The normality assumption does allow one to use one’s variance estimates in testing for


the statistical significance of these slopes, however. In particular, if the  i ~ N 0,  2 ,
1
then b ~ N   ,  2 X T X   . That is, the normality assumption allows one to test the
~
~

hypotheses, …
Ho: βj = 0
HA: βj ≠ 0
with the test statistic,
t
2
,n  k 1

bˆ j
MSE * xj j
, where xjj is the j+1,j+1 element
from X T X  .
1
3. Note that OLS slope and slope-variance estimates are very robust for departures from the
normality assumption. This is why we shall not spend time discussing tests of normality.
Inspections of scatter plots should be sufficient to identify large departures from
normality.
4. The final page of this section is a “Handout” on which the assumptions of linear
regression are listed along with indications of the beneficial consequences that result as
long as they are met.
26
Stat 404
Assumptions of Linear Regression:
Consequences When They Are Met
Assumption(s)
Consequence(s) if met
ˆ
The estimates, Y i , are relatively close to
the observations, Y i , because OLS estimates
are appropriate for estimating linear
y = Xβ + ε
˜
˜ ˜
associations. Also b and σ̂
2
˜
are meaningful
2
estimates respectively of β and σ .
˜
The previous assumption
plus E ( e ) = 0 and
˜
˜
T
T
E ( X e ) = E ( X )E ( e ) .
˜
˜
b is unbiased as an estimator of β .
˜
˜
The previous two
assumptions plus
MSE (i.e., σ̂ ) and thus σ̂ ( X X )
E ( X ) = X and
unbiased as respective estimators of σ
T
2
2
E ( e e ) = σ In .
˜˜
Var ( b )
˜
X is full column rank.
(X X)
T
–1
2
–1
are
and thus b can be calculated.
˜
The previous
assumptions plus the ε i
are normally
distributed.
T
2
T
–1
b ∼ N ( β, σ ( X X ) )
˜
˜
2
2
and
Download