Matrix Essentials – review: 5 3 2 0 4 3 (1) Matrix: Rectangular array of numbers. A 5 0 (2) Transpose: Rows become columns and vice-versa A 3 4 2 3 T (3) A single row or column is called a (row or column) “Vector” R 5 3 2 Addition and subtraction: element by element, and dimensions must match. 1 0 0 5 3 2 2 5 2 1 0 B I 2 I 3 0 1 0 Examples: A 0 4 3 3 0 1 0 1 0 0 1 7 2 0 B A A B 3 4 4 3 8 4 A B 3 4 2 (4) Multiplication: (4A) Row times column (must have same number of entries). Multiply corresponding elements together then add up these “cross products” 2 Example: C 5 2 2 RC 5 3 2 5 (5)( 2) (3)(5) (2)( 2) 9 2 (4B) Matrix times matrix: Number of columns in first matrix must equal the number of rows in the second. In other words each row of the first matrix must have the same number of entries as each column of the second. We cannot compute AA or AB. We can compute A(BT) and ATA. The entry in row i column j of the product matrix is row i of the first matrix times column j of the second. 2 3 (5)( 2) (3)(5) (2)( 2) 15 0 2 9 13 5 3 2 5 A( B ) 0 0 4 3 0 20 6 0 0 3 14 3 2 1 T 2 3 10 18 5 5 3 2 25 15 10 B A 5 0 2 1 0 4 3 10 10 1 T 5 3 0 4 Exercise: A 2 5 B 3 0 __ AB __ __ __ 10 14 BA 15 9 Does AB = BA? 1 0 0 5 0 0 0 3 0 0 0 2 5 3 2 More examples: A( I 3 ) 0 1 0 A 0 4 3 0 0 1 0 0 0 0 4 0 0 0 3 1 0 5 3 2 5 3 2 A I 2 A 0 1 0 4 3 0 4 3 I2B B B( I 3 ) B … therefore I is called an “identity” matrix. (5) “Division” With numbers, to divide 7 by 0.25 we notice that 4(0.25) is 1. Now 7/0.25 is the solution X to 0.25X=7 so 4(0.25)X=4(7) or (1)X=28. The secret of dividing 7 by 0.25 is to multiply 7 by the inverse of 0.25 because then our equation becomes 1X = 28. We have seen that the matrix I is acting like the number 1 and both are called the “identity,” for numbers in one case and matrices for another because any number times 1 is that same number and any matrix times I is that matrix. Furthermore IA = AI (for I of the appropriate dimension) whereas in general AB is not the same as BA. The inverse of a number is another number which, when multiplied by the original number gives the identity so 4 is the inverse of 0.25 because 4(0.25)=1. To have an inverse, a matrix must at least be square (same number of rows as columns). The inverse of a square matrix A is another square matrix A-1 which, when multiplied by A, produces an identity matrix as the product. 4 5 2 Examples: Find the inverse of 1 3 5 Answer: Not possible 3 7 12 No inverse exists for this matrix because one of the columns is an exact linear combination of the other two. Whenever this happens, the matrix cannot be inverted. Such a matrix is said to not be of “full rank” and is called a “singular” matrix. Letting the columns be C1, C2 and C3 we have 2 4 5 1 6 C1 1 C2 3 C3 5 0.5C1 1.5C2 0.5 4.5 3 7 12 1.5 10.5 4 5 2 0.2 0.9 0.5 1 Examples: Find the inverse of D 1 3 5 Answer: D 0.4 3.7 1.5 3 7 11 0.2 2.6 1 Note the notation for inverse. Because it has no “dependent columns,” that is, no columns that can be written as linear combinations of others, this matrix D is said to be of full rank or equivalently it is said to be “nonsingular.” I have not shown you how to compute the inverse. SAS will do that. You should however, be able to show that the claimed inverse really is the inverse. Remember, the idea of a number’s inverse is another number whose product with the original number is the identity. Because (0.5)(2) = 1 we see that 2 is the inverse of 0.5 and 0.5 is the inverse of 2. We see that 4 is the inverse of 0.25 etc. Multiply to verify that the identity is the product. 4 5 1 0 0 0.2 0.9 0.5 2 3.7 1.5 1 3 5 0 1 0 0.4 0.2 2.6 1 3 7 11 0 0 1 For example, (0.2)(2)+(-0.9)(1)+(-0.5)(-3)=0.4-0.9+1.5=1 and (0.2)(4)+(-0.9)(-3)+(-0.5)(7)= 0.8+2.7-3.5=0 are the first entries in the first row of the product matrix. LEAST SQUARES: We want to fit a line to a set of points (X,Y) = (1,10), (5,15), (12,22), and (8,20). That is, we want to find b0 and b1 such that the column of residuals R, each of whose elements is Y-b0 – b1X, has the smallest sum of squares where the column of Y values is the transpose of (10, 15, 22, 20), the column of X values is the transpose of (1, 5, 12, 8). For example, if b0=5 and b1=2 we have residuals 10-(5+2(1)) = 3, 15 – (5+2(5)) = 0, 22-(5+2(12)) = -7, and 20 – (5+2(8)) = -1, that is, the column of residuals is 3 3 0 0 R with sum of squares RT R 3 0 7 1 59 . We can probably do better. 7 7 1 1 1 1 10 10 1 1 3 1 5 15 15 1 5 5 0 Notice that if X and Y then R Y Xb is our 1 12 22 22 1 12 2 7 1 8 20 20 1 8 1 10 1 1 15 1 5 b0 is the column of residuals for column of residuals and in general, R Y Xb 22 1 12 b1 20 1 8 any (b0, b1) pair we pick so that finding the b0 and b1 that minimize RTR will give an intercept and slope that cannot be beaten (in terms of minimizing the error sum of squares). We thus compute RT R (Y Xb)T (Y Xb) and set its derivatives with respect to b0 and b1 equal to 0. This results in a matrix equation whose solution is the vector b = (b0,b1)T of desired estimates T T *********************** X Xb X Y ************************ which is the reason for all the matrix algebra we have seen thus far. Notice that if we can invert XTX then we can solve for b and if we have a program that can invert matrices accurately then it does not matter how many observations or explanatory variables we have, we can get the least squares estimates! The important equation T T *********************** X Xb X Y ************************ is called the normal equations (plural because the matrix equation has several rows) and again if we can invert the XTX matrix, the vector of least squares solutions b is given by *********************** b ( X T X ) 1 X T Y ************************ Let’s try to get the unbeatable (in terms of least squares) intercept and slope for our 4 points: 1 1 1 5 X 1 12 1 8 1 1 1 1 1 1 1 5 4 26 T X X 1 5 12 8 1 12 26 234 1 8 1 1 1 1 4 26 0.9 0.1 1 1 1 1 1 5 T 1 ( X X ) 26 234 1 5 12 8 1 12 0.1 1/65 1 8 10 1 1 1 1 15 67 X T Y 1 5 12 8 22 509 20 (you can check this) 0.9 0.1 67 9.4 0.1 1/65 509 1.13077 so our solution is b Our predicting equation is Ypredicted = 9.4 + 1.13077X whose sum of squared residuals 3.6385 is the smallest possible and way better than the 59 we got using Ypredicted = 5 + 2X. Suppose we try computing the residual sum of squares across a grid of values and plotting the result: DATA OLS; DO D0 = -2 TO 2 BY .02; DO D1 = -.2 TO .2 BY .002; B0=9.4+D0; B1=1.13077+D1; SSE = MIN( (10-B0-B1*1)**2 + (15-B0-B1*5)**2 + (22-B0-B1*12)**2 + (20-B0-B1*8)**2, 6 ) ; OUTPUT; END; END; PROC G3D; PLOT B0*B1=SSE/ROTATE=10; RUN; PROC GCONTOUR; PLOT B0*B1=SSE; RUN; Why do we want to use least squares in the first place? What is so good about that method? The answer is that if the errors are independent and normally distributed with constant variance then the least squares estimated intercept and slope will vary around the true values in repeated samples, be normally distributed in repeated samples, and will have the smallest possible variation in repeated samples. In practice, we use a computer program to do the computations above. For example, ods html close; ods listing; ods listing gpath="%sysfunc(pathname(work))"; OPTIONS LS=76; DATA OLS; INPUT X Y @@; CARDS; 1 10 5 15 12 22 8 20 ; PROC REG DATA=OLS; MODEL Y=X / XPX I COVB; OUTPUT OUT=OUT1 PREDICTED=P RESIDUAL=R; RUN; PROC SGPLOT DATA=OUT1; SCATTER X=X Y=Y; SERIES X=X Y=P; RUN; (1) The 2 by 2 matrix X’X is in the top left, bordered by X’Y to the right. Y’Y is in the lower right corner. The REG Procedure Model: MODEL1 Model Crossproducts X'X X'Y Y'Y Variable Intercept X Y 4 26 67 26 234 509 67 509 1209 Intercept X Y (2) The matrix X’X is inverted and bordered by b to the right. SS(error) is in the lower right corner. X'X Inverse, Parameter Estimates, and SSE Variable Intercept X Y 0.9 -0.1 9.4 -0.1 0.0153846154 1.1307692308 9.4 1.1307692308 3.6384615385 Intercept X Y Analysis of Variance Source Model Error Corrected Total DF 1 2 3 Sum of Squares 83.11154 3.63846 86.75000 Mean Square 83.11154 1.81923 F Value 45.68 Pr > F 0.0212 (3) Parameter estimates are given with standard errors. Each t statistic is the ratio of the estimate to its standard error. The slope is significantly different than 0 (p=0.0212 is less than 0.05). Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept X 1 1 9.40000 1.13077 1.27957 0.16730 7.35 6.76 0.0180 0.0212 Where did those standard errors come from? First, multiply each entry in (X’X)-1 by MSE, the mean squared error (1.81923) which is an estimate of the variance 2 of the errors e. The resulting matrix is called the “covariance matrix” of the parameter estimates and is shown below as a result of the COVB option. The negative number -0.1819 is the covariance between the slope and intercept in repeated samples. The elements on the diagonal (from upper left to lower right) are estimated variances of the intercept and slopes – as many slopes as you have predictor variables (just 1 in this example). The square roots of these numbers are the standard errors. For example 1.27957 was the standard error of the intercept. This 1.27957 is the square root of 1.6373. Covariance of Estimates Variable Intercept X Intercept X 1.6373076923 -0.181923077 -0.181923077 0.0279881657