THE REGRESSION MODEL IN MATRIX FORM

THE REGRESSION MODEL IN MATRIX FORM



We will consider the linear regression model in matrix form.

For simple linear regression, meaning one predictor, the model is

Y i

=



0

+



1 x i

+

 i for i = 1, 2, 3, …, n

This model includes the assumption that the ε i

’s are a sample from a population with mean zero and standard deviation σ. In most cases we also assume that this population is normally distributed.

The multiple linear regression model is

Y i

=



0

+



1 x i 1

+ β

2 x i 2

+ β

3 x i 3

+ … + β

K x iK

+

 i for i = 1, 2, 3, …, n

This model includes the assumption about the ε i

’s stated just above.

This requires building up our symbols into vectors. Thus n

Y =



1

Y

1  

Y

2

 

Y

 

 

Y n captures the entire dependent variable in a single symbol. The “ n



1” part of the notation is just a shape reminder. These get dropped once the context is clear.

For simple linear regression, we will capture the independent variable through this n



2 matrix: n

X =



2













1

1

1

1 x

1 x

2 x

3











 x n

The coefficient vector will be

β

=





0

1

and the noise vector will be

ε

= n



1





1

 

 

 

  n

.

1




The simple linear regression model is written then as n

Y



1

 n

X



β  n

ε



1

.

The product part, meaning X , is found through the usual rule for matrix multiplication as n



β n

X



β 













1

1

1

1 x x x

3

1

2







 

 

1















  

0

  

1 x x

1 2 x

1 3

1













1 x n x n

We usually write the model without the shape reminders as Y = X β +

ε

. This is a shorthand notation for

Y

 

Y

 

 













  

  

0

0

1 x x

1

1 2 x

1 3

 

 

 

1

2

3

Y n

  

1 x n

  n













It is helpful that the multiple regression story with K ≥ 2 predictors leads to the same model expression Y = X β +

ε

(just with different shapes). As a notational convenience, let p = K + 1. In the multiple regression case, we have

X = n

 p

























1

1

1

1

1

1

1 x x

11 12 x x

21 22 x

31 x

32 x

41 x

42 x

51 x

52 x

61 x

62 x x n 1 n 2 x

1 K x

2 K x

3 K x

4 K x

5 K x

6 K





















and β = p



1





 

 











3

0

1

2







 

 







K



 x nK



The detail shown here is to suggest that X is a tall, skinny matrix. We formally require n ≥ p .

In most applications, n is much, much larger than p . The ratio n p

is often in the hundreds.

2


 n

If it happens that is as small as 5, we will worry that we don’t have enough p data (reflected in n ) to estimate the number of parameters in

β

(reflected in p ).

The multiple regression model is now n

Y



1



X n

 p p

β



1

 n

ε



1

, and this is a shorthand for

Y

Y

Y

1

2

 

 

Y n













  

0

  

0

0

  

0 1 x

1 11 x

1 21

1 x

31 x n 1

  x

2 12

  x

2 22

 

2 x

32

 

2 x n 2

  x

3 13

  x

3 23

 

3 x

33

 

3 x n 3





 

K

 

K

  

K x

1 K x x

2

3

K

K

   x

K nK

 

1

 

2

 

3

  n













The model form Y = X β +

ε is thus completely general.

The assumptions on the noise terms can be written as E ε = 0 and Var ε = σ 2 I . The I here is the n

 n identity matrix. That is,

I =













1 0 0

0

0

1

0

0

1

0 0 0

0

0

0













1

The variance assumption can be written as Var ε =













 2

0

0



0

0

2

0

0

 2

0 0 0 this expressed as Cov( ε i

, ε j

) = σ

2

δ ij

, where



1

 0 if i

 j if i

 j

0

0

0

 2













. You may see

3




We will call b as the estimate for unknown parameter vector

β

.

You will also find the notation as the estimate.

Once we get b , we can compute the fitted vector Y

ˆ

= X b . This fitted value represents an ex-post guess at the expected value of Y .

The estimate b is found so that the fitted vector Y

ˆ

is close to the actual data vector

Closeness is defined in the least squares sense, meaning that we want to minimize the

Y . criterion Q , where

Q = n  i



1



Y i

 

X b

 i th entry



2

This can be done by differentiating this quantity p = K + 1 times, once with respect to b

0

, once with respect to b

1

, ….., and once with respect to b

K

. This is routine in simple regression

( K = 1), and it’s possible with a lot of messy work in general.

It happens that Q is the squared length of the vector difference Y



Xb . This means that we can write

Q =



Y



Xb

 

Y



Xb



1

 n n



1

This represents Q as a 1



1 matrix, and so we can think of Q as an ordinary number.

There are several ways to find the b that minimizes Q

. The simple solution we’ll show here

(alas) requires knowing the answer and working backward.

Define the matrix H n

 n



X n

 p



X



X p

 n n

 p

 

1

X p

 n



. We will call H as the “hat matrix,” and it has some important uses. There are several technical comments about H :

(1) Finding H requires the ability to get



X



X p

 n n

 p

 

1

. This matrix inversion is possible if and only if X has full rank p . Things get very interesting when X almost has full rank p ; that’s a longer story for another time.

(2) The matrix H is idempotent . The defining condition for idempotence is this:

The matrix C is idempotent



C C = C .

Only square matrices can be idempotent.

Since H is square (It’s n

 n .), it can be checked for idempotence. You will indeed find that H H = H .

(3) The i th

diagonal entry, that in position ( i , i ), will be identified for later use as the i th

leverage value. The notation is usually h i

, but you’ll also see h ii

.

4




Now write Y in the form H Y + ( I – H ) Y .

Now let’s develop

Q . This will require using the fact that H is symmetric, meaning H



= H .

This will also require using the transpose of a matrix product. Specifically, the property will be



X b

 

  

.

Q =



Y



Xb

 

Y



Xb



=

 

H Y +



I







Xb

   

H Y +



I







Xb



=

 

H Y

  

I

     

H Y



Xb

 

I

  

 

H Y



Xb

 

H Y



Xb















H Y



Xb



I



I



  

H Y



Xb





I



  

I

 



The second and third summands above are zero, as a consequence of

I

 

= X



H X = X

    

1 

= X



X = 0 .

 

H Y



Xb

 

H Y



Xb

 

 

I

    

I

 

If this is to be minimized over choices of b , then the minimization can only be done with regard to the first summand



H Y



Xb

 

H Y



Xb



. It is possible to make the vector

H Y



Xb equal to 0 by selecting b =

  

-1 

. This is very easy to see, as

H =

  

-1

X



.

This b =

  

-1 

is known as the least squares estimate of

β

.

5




For the simple linear regression case K = 1, the estimate b = b

0

 

and be found with relative ease. The slope estimate is b

1

=

S xy

S xx

, where S xy

= i n 



1

 x i

 x

 

Y i



Y



= n  i



1

 n x Y and where S xx

= n  i



1

 x i

 x



2

= n  i



1 x i

2  2

.

For the multiple regression case K ≥ 2, the calculation involves the inversion of the p × p matrix X



X . This task is best left to computer software.

There is a computational trick, called “mean-centering,” that converts the problem to a simpler one of inverting a K × K matrix.

The matrix notation will allow the proof of two very helpful facts:

* E b =

β

. This means that b is an unbiased estimate of

β

. This is a good thing, but there are circumstances in which biased estimates will work a little bit better.

* Var b =

 2

   

1

. This identifies the variances and covariances of the estimated coefficients. It’s critical to note that the separate entries of b are not statistically independent.

6

THE REGRESSION MODEL IN MATRIX FORM

Related documents

Products

Support

THE REGRESSION MODEL IN MATRIX FORM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib