RegressionReview

advertisement
Matrix Essentials – review:
 5  3 2

 0 4 3
(1) Matrix: Rectangular array of numbers. A  
 5 0


(2) Transpose: Rows become columns and vice-versa A    3 4 
 2 3


T
(3) A single row or column is called a (row or column) “Vector” R  5  3 2
Addition and subtraction: element by element, and dimensions must match.
1 0 0


 5  3 2
 2 5  2
1 0
 B  
 I 2  
 I 3   0 1 0 
Examples: A  
 0 4 3
3 0 1 
0 1
0 0 1


 7 2 0
  B  A
A  B  
  3 4 4
3  8 4

A  B  
3 4 2
(4) Multiplication:
(4A) Row times column (must have same number of entries). Multiply corresponding elements together
then add up these “cross products”
 2 
 
Example: C   5 
  2
 
 2 
 
RC  5  3 2 5   (5)( 2)  (3)(5)  (2)( 2)  9
  2
 
(4B) Matrix times matrix: Number of columns in first matrix must equal the number of rows in the
second. In other words each row of the first matrix must have the same number of entries as each
column of the second. We cannot compute AA or AB. We can compute A(BT) and ATA. The entry in row
i column j of the product matrix is row i of the first matrix times column j of the second.
 2  3
  (5)( 2)  (3)(5)  (2)( 2)  15  0  2    9  13 
 5  3 2 
 5
  

A( B )  
0   
0
4
3
0

20

6
0

0

3
14
3

  2 1  
 



T
 2  3
 10  18  5 

 5  3 2  

   25  15 10 
B A 5
0 
  2 1  0 4 3    10 10  1 




T
 5  3

0 4 
Exercise: A  
 2 5

B  
  3 0
 __
AB  
 __
__ 

__ 
 10 14 

BA  
  15 9 
Does AB = BA?
1 0 0
  5  0  0 0  3  0 0  0  2
 5  3 2 
More examples: A( I 3 )  
 0 1 0   
  A
 0 4 3  0 0 1   0  0  0 0  4  0 0  0  3 


 1 0  5  3 2   5  3 2 

  
  A
I 2 A  
 0 1  0 4 3   0 4 3 
I2B  B
B( I 3 )  B
… therefore I is called an “identity” matrix.
(5) “Division”
With numbers, to divide 7 by 0.25 we notice that 4(0.25) is 1. Now 7/0.25 is the solution X to 0.25X=7
so 4(0.25)X=4(7) or (1)X=28. The secret of dividing 7 by 0.25 is to multiply 7 by the inverse of 0.25
because then our equation becomes 1X = 28. We have seen that the matrix I is acting like the number 1
and both are called the “identity,” for numbers in one case and matrices for another because any
number times 1 is that same number and any matrix times I is that matrix. Furthermore IA = AI (for I of
the appropriate dimension) whereas in general AB is not the same as BA. The inverse of a number is
another number which, when multiplied by the original number gives the identity so 4 is the inverse of
0.25 because 4(0.25)=1. To have an inverse, a matrix must at least be square (same number of rows as
columns). The inverse of a square matrix A is another square matrix A-1 which, when multiplied by A,
produces an identity matrix as the product.
4
5 
 2


Examples: Find the inverse of  1  3  5  Answer: Not possible
  3 7 12 


No inverse exists for this matrix because one of the columns is an exact linear combination of the other
two. Whenever this happens, the matrix cannot be inverted. Such a matrix is said to not be of “full
rank” and is called a “singular” matrix. Letting the columns be C1, C2 and C3 we have
 2 
 4
 5 
 1   6 
 
 
 

 

C1   1  C2    3  C3    5   0.5C1  1.5C2    0.5     4.5 
  3
 7 
 12 
 1.5   10.5 
 
 
 

 

4
5 
 2
 0.2  0.9  0.5 




1
Examples: Find the inverse of D   1  3  5  Answer: D   0.4
3.7
1.5 
  3 7 11 
  0.2  2.6  1 




Note the notation for inverse. Because it has no “dependent columns,” that is, no columns that can be
written as linear combinations of others, this matrix D is said to be of full rank or equivalently it is said to
be “nonsingular.” I have not shown you how to compute the inverse. SAS will do that. You should
however, be able to show that the claimed inverse really is the inverse. Remember, the idea of a
number’s inverse is another number whose product with the original number is the identity. Because
(0.5)(2) = 1 we see that 2 is the inverse of 0.5 and 0.5 is the inverse of 2. We see that 4 is the inverse of
0.25 etc. Multiply to verify that the identity is the product.
4
5  1 0 0
 0.2  0.9  0.5  2


 

3.7
1.5  1  3  5    0 1 0 
 0.4
  0.2  2.6  1   3 7 11   0 0 1 


 

For example, (0.2)(2)+(-0.9)(1)+(-0.5)(-3)=0.4-0.9+1.5=1 and (0.2)(4)+(-0.9)(-3)+(-0.5)(7)= 0.8+2.7-3.5=0
are the first entries in the first row of the product matrix.
LEAST SQUARES:
We want to fit a line to a set of points (X,Y) = (1,10), (5,15), (12,22), and (8,20). That is, we want to find
b0 and b1 such that the column of residuals R, each of whose elements is Y-b0 – b1X, has the smallest sum
of squares where the column of Y values is the transpose of (10, 15, 22, 20), the column of X values is
the transpose of (1, 5, 12, 8). For example, if b0=5 and b1=2 we have residuals 10-(5+2(1)) = 3,
15 – (5+2(5)) = 0, 22-(5+2(12)) = -7, and 20 – (5+2(8)) = -1, that is, the column of residuals is
 3 
 3 
 
 
 0 
 0 
R    with sum of squares RT R  3 0  7  1   59 . We can probably do better.
7
7
 
 
 1 
 1
 
 
1 1 
 10 
 10  1 1 
 3 


 
  

 
1 5 
 15 
 15  1 5  5   0 
 
Notice that if X  
and Y    then R  Y  Xb     
is our
1 12 
22
22
1 12  2    7 


 
  

 
1 8 
 20 
 20  1 8 
 1


 
  

 
 10  1 1 
  

 15  1 5  b0 
  is the column of residuals for
column of residuals and in general, R  Y  Xb     
22
1 12  b1 
  

 20  1 8 
  

any (b0, b1) pair we pick so that finding the b0 and b1 that minimize RTR will give an intercept and slope
that cannot be beaten (in terms of minimizing the error sum of squares). We thus compute
RT R  (Y  Xb)T (Y  Xb) and set its derivatives with respect to b0 and b1 equal to 0. This results in a
matrix equation whose solution is the vector b = (b0,b1)T of desired estimates
T
T
*********************** X Xb  X Y ************************
which is the reason for all the matrix algebra we have seen thus far. Notice that if we can invert XTX
then we can solve for b and if we have a program that can invert matrices accurately then it does not
matter how many observations or explanatory variables we have, we can get the least squares
estimates!
The important equation
T
T
*********************** X Xb  X Y ************************
is called the normal equations (plural because the matrix equation has several rows) and again if we can
invert the XTX matrix, the vector of least squares solutions b is given by
*********************** b  ( X T X ) 1 X T Y ************************
Let’s try to get the unbeatable (in terms of least squares) intercept and slope for our 4 points:
1 1 


1 5 
X 
1 12 


1 8 


1 1 


1 1 1 1 1 5   4 26 
T


X X  
  
1 5 12 8 1 12   26 234 
1 8 


1

1 1  



1
 4 26 
 0.9  0.1
 1 1 1 1 1 5  
T
1





( X X )   


 26 234 
1 5 12 8  1 12  
 0.1 1/65 







1 8  




 10 
 
1
1
1
1


 15   67 
   

X T Y  
1 5 12 8  22   509 
 20 
 
(you can check this)
 0.9  0.1 67   9.4 

  

  0.1 1/65  509  1.13077 
so our solution is b  
Our predicting equation is Ypredicted = 9.4 + 1.13077X whose sum of squared residuals 3.6385 is the
smallest possible and way better than the 59 we got using Ypredicted = 5 + 2X. Suppose we try computing
the residual sum of squares across a grid of values and plotting the result:
DATA OLS;
DO D0 = -2 TO 2 BY .02;
DO D1 = -.2 TO .2 BY .002;
B0=9.4+D0; B1=1.13077+D1;
SSE = MIN( (10-B0-B1*1)**2 + (15-B0-B1*5)**2
+ (22-B0-B1*12)**2 + (20-B0-B1*8)**2, 6 ) ;
OUTPUT; END; END;
PROC G3D; PLOT B0*B1=SSE/ROTATE=10; RUN;
PROC GCONTOUR; PLOT B0*B1=SSE; RUN;
Why do we want to use least squares in the first place? What is so good about that method? The
answer is that if the errors are independent and normally distributed with constant variance then the
least squares estimated intercept and slope will vary around the true values in repeated samples, be
normally distributed in repeated samples, and will have the smallest possible variation in repeated
samples. In practice, we use a computer program to do the computations above. For example,
ods html close; ods listing;
ods listing gpath="%sysfunc(pathname(work))";
OPTIONS LS=76;
DATA OLS;
INPUT X Y @@; CARDS;
1 10 5 15 12 22 8 20
;
PROC REG DATA=OLS;
MODEL Y=X / XPX I COVB;
OUTPUT OUT=OUT1 PREDICTED=P RESIDUAL=R;
RUN;
PROC SGPLOT DATA=OUT1;
SCATTER X=X Y=Y;
SERIES X=X Y=P;
RUN;
(1) The 2 by 2 matrix X’X is in the top left, bordered by X’Y to the right. Y’Y is in the lower right corner.
The REG Procedure
Model: MODEL1
Model Crossproducts X'X X'Y Y'Y
Variable
Intercept
X
Y
4
26
67
26
234
509
67
509
1209
Intercept
X
Y
(2) The matrix X’X is inverted and bordered by b to the right. SS(error) is in the lower right corner.
X'X Inverse, Parameter Estimates, and SSE
Variable
Intercept
X
Y
0.9
-0.1
9.4
-0.1
0.0153846154
1.1307692308
9.4
1.1307692308
3.6384615385
Intercept
X
Y
Analysis of Variance
Source
Model
Error
Corrected Total
DF
1
2
3
Sum of
Squares
83.11154
3.63846
86.75000
Mean
Square
83.11154
1.81923
F Value
45.68
Pr > F
0.0212
(3) Parameter estimates are given with standard errors. Each t statistic is the ratio of the estimate to its
standard error. The slope is significantly different than 0 (p=0.0212 is less than 0.05).
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
X
1
1
9.40000
1.13077
1.27957
0.16730
7.35
6.76
0.0180
0.0212
Where did those standard errors come from?
First, multiply each entry in (X’X)-1 by MSE, the mean squared error (1.81923) which is an estimate of
the variance 2 of the errors e. The resulting matrix is called the “covariance matrix” of the parameter
estimates and is shown below as a result of the COVB option. The negative number -0.1819 is the
covariance between the slope and intercept in repeated samples. The elements on the diagonal (from
upper left to lower right) are estimated variances of the intercept and slopes – as many slopes as you
have predictor variables (just 1 in this example). The square roots of these numbers are the standard
errors. For example 1.27957 was the standard error of the intercept. This 1.27957 is the square root of
1.6373.
Covariance of Estimates
Variable
Intercept
X
Intercept
X
1.6373076923
-0.181923077
-0.181923077
0.0279881657
Download