THE REGRESSION MODEL IN MATRIX FORM
We will consider the linear regression model in matrix form.
For simple linear regression, meaning one predictor, the model is
Y i
=
0
+
1 x i
+
i for i = 1, 2, 3, …, n
This model includes the assumption that the ε i
’s are a sample from a population with mean zero and standard deviation σ. In most cases we also assume that this population is normally distributed.
The multiple linear regression model is
Y i
=
0
+
1 x i 1
+ β
2 x i 2
+ β
3 x i 3
+ … + β
K x iK
+
i for i = 1, 2, 3, …, n
This model includes the assumption about the ε i
’s stated just above.
This requires building up our symbols into vectors. Thus n
Y =
1
Y
1
Y
2
Y
Y n captures the entire dependent variable in a single symbol. The “ n
1” part of the notation is just a shape reminder. These get dropped once the context is clear.
For simple linear regression, we will capture the independent variable through this n
2 matrix: n
X =
2
1
1
1
1 x
1 x
2 x
3
x n
The coefficient vector will be
β
=
0
1
and the noise vector will be
ε
= n
1
1
n
.
1
THE REGRESSION MODEL IN MATRIX FORM
The simple linear regression model is written then as n
Y
1
n
X
β n
ε
1
.
The product part, meaning X , is found through the usual rule for matrix multiplication as n
β n
X
β
1
1
1
1 x x x
3
1
2
1
0
1 x x
1 2 x
1 3
1
1 x n x n
We usually write the model without the shape reminders as Y = X β +
ε
. This is a shorthand notation for
Y
Y
0
0
1 x x
1
1 2 x
1 3
1
2
3
Y n
1 x n
n
It is helpful that the multiple regression story with K ≥ 2 predictors leads to the same model expression Y = X β +
ε
(just with different shapes). As a notational convenience, let p = K + 1. In the multiple regression case, we have
X = n
p
1
1
1
1
1
1
1 x x
11 12 x x
21 22 x
31 x
32 x
41 x
42 x
51 x
52 x
61 x
62 x x n 1 n 2 x
1 K x
2 K x
3 K x
4 K x
5 K x
6 K
and β = p
1
3
0
1
2
K
x nK
The detail shown here is to suggest that X is a tall, skinny matrix. We formally require n ≥ p .
In most applications, n is much, much larger than p . The ratio n p
is often in the hundreds.
2
THE REGRESSION MODEL IN MATRIX FORM
n
If it happens that is as small as 5, we will worry that we don’t have enough p data (reflected in n ) to estimate the number of parameters in
β
(reflected in p ).
The multiple regression model is now n
Y
1
X n
p p
β
1
n
ε
1
, and this is a shorthand for
Y
Y
Y
1
2
Y n
0
0
0
0 1 x
1 11 x
1 21
1 x
31 x n 1
x
2 12
x
2 22
2 x
32
2 x n 2
x
3 13
x
3 23
3 x
33
3 x n 3
K
K
K x
1 K x x
2
3
K
K
x
K nK
1
2
3
n
The model form Y = X β +
ε is thus completely general.
The assumptions on the noise terms can be written as E ε = 0 and Var ε = σ 2 I . The I here is the n
n identity matrix. That is,
I =
1 0 0
0
0
1
0
0
1
0 0 0
0
0
0
1
The variance assumption can be written as Var ε =
2
0
0
0
0
2
0
0
2
0 0 0 this expressed as Cov( ε i
, ε j
) = σ
2
δ ij
, where
1
0 if i
j if i
j
0
0
0
2
. You may see
3
THE REGRESSION MODEL IN MATRIX FORM
We will call b as the estimate for unknown parameter vector
β
.
You will also find the notation as the estimate.
Once we get b , we can compute the fitted vector Y
ˆ
= X b . This fitted value represents an ex-post guess at the expected value of Y .
The estimate b is found so that the fitted vector Y
ˆ
is close to the actual data vector
Closeness is defined in the least squares sense, meaning that we want to minimize the
Y . criterion Q , where
Q = n i
1
Y i
X b
i th entry
2
This can be done by differentiating this quantity p = K + 1 times, once with respect to b
0
, once with respect to b
1
, ….., and once with respect to b
K
. This is routine in simple regression
( K = 1), and it’s possible with a lot of messy work in general.
It happens that Q is the squared length of the vector difference Y
Xb . This means that we can write
Q =
Y
Xb
Y
Xb
1
n n
1
This represents Q as a 1
1 matrix, and so we can think of Q as an ordinary number.
There are several ways to find the b that minimizes Q
. The simple solution we’ll show here
(alas) requires knowing the answer and working backward.
Define the matrix H n
n
X n
p
X
X p
n n
p
1
X p
n
. We will call H as the “hat matrix,” and it has some important uses. There are several technical comments about H :
(1) Finding H requires the ability to get
X
X p
n n
p
1
. This matrix inversion is possible if and only if X has full rank p . Things get very interesting when X almost has full rank p ; that’s a longer story for another time.
(2) The matrix H is idempotent . The defining condition for idempotence is this:
The matrix C is idempotent
C C = C .
Only square matrices can be idempotent.
Since H is square (It’s n
n .), it can be checked for idempotence. You will indeed find that H H = H .
(3) The i th
diagonal entry, that in position ( i , i ), will be identified for later use as the i th
leverage value. The notation is usually h i
, but you’ll also see h ii
.
4
THE REGRESSION MODEL IN MATRIX FORM
Now write Y in the form H Y + ( I – H ) Y .
Now let’s develop
Q . This will require using the fact that H is symmetric, meaning H
= H .
This will also require using the transpose of a matrix product. Specifically, the property will be
X b
.
Q =
Y
Xb
Y
Xb
=
H Y +
I
Xb
H Y +
I
Xb
=
H Y
I
H Y
Xb
I
H Y
Xb
H Y
Xb
H Y
Xb
I
I
H Y
Xb
I
I
The second and third summands above are zero, as a consequence of
I
= X
H X = X
1
= X
X = 0 .
H Y
Xb
H Y
Xb
I
I
If this is to be minimized over choices of b , then the minimization can only be done with regard to the first summand
H Y
Xb
H Y
Xb
. It is possible to make the vector
H Y
Xb equal to 0 by selecting b =
-1
. This is very easy to see, as
H =
-1
X
.
This b =
-1
is known as the least squares estimate of
β
.
5
THE REGRESSION MODEL IN MATRIX FORM
For the simple linear regression case K = 1, the estimate b = b
0
and be found with relative ease. The slope estimate is b
1
=
S xy
S xx
, where S xy
= i n
1
x i
x
Y i
Y
= n i
1
n x Y and where S xx
= n i
1
x i
x
2
= n i
1 x i
2 2
.
For the multiple regression case K ≥ 2, the calculation involves the inversion of the p × p matrix X
X . This task is best left to computer software.
There is a computational trick, called “mean-centering,” that converts the problem to a simpler one of inverting a K × K matrix.
The matrix notation will allow the proof of two very helpful facts:
* E b =
β
. This means that b is an unbiased estimate of
β
. This is a good thing, but there are circumstances in which biased estimates will work a little bit better.
* Var b =
2
1
. This identifies the variances and covariances of the estimated coefficients. It’s critical to note that the separate entries of b are not statistically independent.
6