THE REGRESSION MODEL IN MATRIX FORM

advertisement

THE REGRESSION MODEL IN MATRIX FORM



We will consider the linear regression model in matrix form.

For simple linear regression, meaning one predictor, the model is

Y i

=

0

+

1 x i

+

 i for i = 1, 2, 3, …, n

This model includes the assumption that the ε i

’s are a sample from a population with mean zero and standard deviation σ. In most cases we also assume that this population is normally distributed.

The multiple linear regression model is

Y i

=

0

+

1 x i 1

+ β

2 x i 2

+ β

3 x i 3

+ … + β

K x iK

+

 i for i = 1, 2, 3, …, n

This model includes the assumption about the ε i

’s stated just above.

This requires building up our symbols into vectors. Thus n

Y =

1

Y

1  

Y

2

 

Y

 

 

Y n captures the entire dependent variable in a single symbol. The “ n

1” part of the notation is just a shape reminder. These get dropped once the context is clear.

For simple linear regression, we will capture the independent variable through this n

2 matrix: n

X =

2

1

1

1

1 x

1 x

2 x

3

 x n

The coefficient vector will be

β

=

0

1

and the noise vector will be

ε

= n

1

1

 

 

 

  n

.

1

THE REGRESSION MODEL IN MATRIX FORM



The simple linear regression model is written then as n

Y

1

 n

X

β  n

ε

1

.

The product part, meaning X , is found through the usual rule for matrix multiplication as n

β n

X

β 

1

1

1

1 x x x

3

1

2

 

 

1

  

0

  

1 x x

1 2 x

1 3

1

1 x n x n

We usually write the model without the shape reminders as Y = X β +

ε

. This is a shorthand notation for

Y

 

Y

 

 

  

  

0

0

1 x x

1

1 2 x

1 3

 

 

 

1

2

3

Y n

  

1 x n

  n

It is helpful that the multiple regression story with K ≥ 2 predictors leads to the same model expression Y = X β +

ε

(just with different shapes). As a notational convenience, let p = K + 1. In the multiple regression case, we have

X = n

 p

1

1

1

1

1

1

1 x x

11 12 x x

21 22 x

31 x

32 x

41 x

42 x

51 x

52 x

61 x

62 x x n 1 n 2 x

1 K x

2 K x

3 K x

4 K x

5 K x

6 K

and β = p

1

 

 

3

0

1

2

 

 

K

 x nK

The detail shown here is to suggest that X is a tall, skinny matrix. We formally require n ≥ p .

In most applications, n is much, much larger than p . The ratio n p

is often in the hundreds.

2

THE REGRESSION MODEL IN MATRIX FORM

 n

If it happens that is as small as 5, we will worry that we don’t have enough p data (reflected in n ) to estimate the number of parameters in

β

(reflected in p ).

The multiple regression model is now n

Y

1

X n

 p p

β

1

 n

ε

1

, and this is a shorthand for

Y

Y

Y

1

2

 

 

Y n

  

0

  

0

0

  

0 1 x

1 11 x

1 21

1 x

31 x n 1

  x

2 12

  x

2 22

 

2 x

32

 

2 x n 2

  x

3 13

  x

3 23

 

3 x

33

 

3 x n 3

 

K

 

K

  

K x

1 K x x

2

3

K

K

   x

K nK

 

1

 

2

 

3

  n

The model form Y = X β +

ε is thus completely general.

The assumptions on the noise terms can be written as E ε = 0 and Var ε = σ 2 I . The I here is the n

 n identity matrix. That is,

I =

1 0 0

0

0

1

0

0

1

0 0 0

0

0

0

1

The variance assumption can be written as Var ε =

 2

0

0

0

0

2

0

0

 2

0 0 0 this expressed as Cov( ε i

, ε j

) = σ

2

δ ij

, where

1

 0 if i

 j if i

 j

0

0

0

 2

. You may see

3

THE REGRESSION MODEL IN MATRIX FORM



We will call b as the estimate for unknown parameter vector

β

.

You will also find the notation as the estimate.

Once we get b , we can compute the fitted vector Y

ˆ

= X b . This fitted value represents an ex-post guess at the expected value of Y .

The estimate b is found so that the fitted vector Y

ˆ

is close to the actual data vector

Closeness is defined in the least squares sense, meaning that we want to minimize the

Y . criterion Q , where

Q = n  i

1

Y i

 

X b

 i th entry

2

This can be done by differentiating this quantity p = K + 1 times, once with respect to b

0

, once with respect to b

1

, ….., and once with respect to b

K

. This is routine in simple regression

( K = 1), and it’s possible with a lot of messy work in general.

It happens that Q is the squared length of the vector difference Y

Xb . This means that we can write

Q =

Y

Xb

 

Y

Xb

1

 n n

1

This represents Q as a 1

1 matrix, and so we can think of Q as an ordinary number.

There are several ways to find the b that minimizes Q

. The simple solution we’ll show here

(alas) requires knowing the answer and working backward.

Define the matrix H n

 n

X n

 p

X

X p

 n n

 p

 

1

X p

 n

. We will call H as the “hat matrix,” and it has some important uses. There are several technical comments about H :

(1) Finding H requires the ability to get

X

X p

 n n

 p

 

1

. This matrix inversion is possible if and only if X has full rank p . Things get very interesting when X almost has full rank p ; that’s a longer story for another time.

(2) The matrix H is idempotent . The defining condition for idempotence is this:

The matrix C is idempotent

C C = C .

Only square matrices can be idempotent.

Since H is square (It’s n

 n .), it can be checked for idempotence. You will indeed find that H H = H .

(3) The i th

diagonal entry, that in position ( i , i ), will be identified for later use as the i th

leverage value. The notation is usually h i

, but you’ll also see h ii

.

4

THE REGRESSION MODEL IN MATRIX FORM



Now write Y in the form H Y + ( I – H ) Y .

Now let’s develop

Q . This will require using the fact that H is symmetric, meaning H

= H .

This will also require using the transpose of a matrix product. Specifically, the property will be

X b

 

  

.

Q =

Y

Xb

 

Y

Xb

=

 

H Y +

I

Xb

   

H Y +

I

Xb

=

 

H Y

  

I

     

H Y

Xb

 

I

  

 

H Y

Xb

 

H Y

Xb

H Y

Xb

I

I

  

H Y

Xb

I

  

I

 

The second and third summands above are zero, as a consequence of

I

 

= X

H X = X

    

1 

= X

X = 0 .

 

H Y

Xb

 

H Y

Xb

 

 

I

    

I

 

If this is to be minimized over choices of b , then the minimization can only be done with regard to the first summand

H Y

Xb

 

H Y

Xb

. It is possible to make the vector

H Y

Xb equal to 0 by selecting b =

  

-1 

. This is very easy to see, as

H =

  

-1

X

.

This b =

  

-1 

is known as the least squares estimate of

β

.

5

THE REGRESSION MODEL IN MATRIX FORM



For the simple linear regression case K = 1, the estimate b = b

0

 

and be found with relative ease. The slope estimate is b

1

=

S xy

S xx

, where S xy

= i n 

1

 x i

 x

 

Y i

Y

= n  i

1

 n x Y and where S xx

= n  i

1

 x i

 x

2

= n  i

1 x i

2  2

.

For the multiple regression case K ≥ 2, the calculation involves the inversion of the p × p matrix X

X . This task is best left to computer software.

There is a computational trick, called “mean-centering,” that converts the problem to a simpler one of inverting a K × K matrix.

The matrix notation will allow the proof of two very helpful facts:

* E b =

β

. This means that b is an unbiased estimate of

β

. This is a good thing, but there are circumstances in which biased estimates will work a little bit better.

* Var b =

 2

   

1

. This identifies the variances and covariances of the estimated coefficients. It’s critical to note that the separate entries of b are not statistically independent.

6

Download