This is page i Printer: Opaque this Contents A Linear algebra basics A.1 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . A.2 Fundamental theorem of linear algebra . . . . . . . . . . . . A.2.1 Part one . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Part two . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Nature of the solution . . . . . . . . . . . . . . . . . . . . . A.3.1 Exactly-identi…ed . . . . . . . . . . . . . . . . . . . . A.3.2 Under-identi…ed . . . . . . . . . . . . . . . . . . . . A.3.3 Over-identi…ed . . . . . . . . . . . . . . . . . . . . . A.4 Matrix decomposition and inverse operations . . . . . . . . A.4.1 LU factorization . . . . . . . . . . . . . . . . . . . . A.4.2 Cholesky decomposition . . . . . . . . . . . . . . . . A.4.3 Singular value decomposition . . . . . . . . . . . . . A.4.4 Spectral decomposition . . . . . . . . . . . . . . . . A.4.5 quadratic forms, eigenvalues, and positive de…niteness A.4.6 similar matrices and Jordan form . . . . . . . . . . . A.5 Gram-Schmidt construction of an orthogonal matrix . . . . A.5.1 QR decomposition . . . . . . . . . . . . . . . . . . . A.5.2 Gram-Schmidt QR algorithm . . . . . . . . . . . . . A.5.3 Accounting example . . . . . . . . . . . . . . . . . . A.5.4 The Householder QR algorithm . . . . . . . . . . . . A.5.5 Accounting example . . . . . . . . . . . . . . . . . . A.6 Some determinant identities . . . . . . . . . . . . . . . . . . A.6.1 Determinant of a square matrix . . . . . . . . . . . . 1 1 3 4 5 6 6 7 9 10 10 14 15 21 21 21 23 25 26 26 28 28 30 30 ii Contents A.6.2 Identities . . . . . . . . . . . . . . . . . . . . . . . . 31 B Iterated expectations B.1 Decomposition of variance . . . . . . . . . . . . . . . . . . . 33 34 C Multivariate normal theory C.1 Conditional distribution . . . . . . . . . . . . . . . . . . . . C.2 Special case of precision . . . . . . . . . . . . . . . . . . . . 35 36 40 D Generalized least squares (GLS) 43 E Two stage least squares IV (2SLS-IV) E.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Special case . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 46 F Seemingly unrelated regression (SUR) F.1 Classical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2 Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3 Bayesian treatment e¤ect application . . . . . . . . . . . . . 49 50 50 51 G Maximum likelihood estimation of discrete choice models 53 H Optimization H.1 Linear programming . . . . . . . . . . . . . . . . . . H.1.1 basic solutions or extreme points . . . . . . . H.1.2 fundamental theorem of linear programming . H.1.3 duality theorems . . . . . . . . . . . . . . . . H.1.4 example . . . . . . . . . . . . . . . . . . . . . H.1.5 complementary slackness . . . . . . . . . . . H.2 Nonlinear programming . . . . . . . . . . . . . . . . H.2.1 unconstrained . . . . . . . . . . . . . . . . . . H.2.2 convexity and global minima . . . . . . . . . H.2.3 example . . . . . . . . . . . . . . . . . . . . . H.2.4 constrained — the Lagrangian . . . . . . . . H.2.5 Karash-Kuhn-Tucker conditions . . . . . . . . H.2.6 example . . . . . . . . . . . . . . . . . . . . . I Quantum information I.1 Quantum information axioms . . . . . . I.1.1 The superposition axiom . . . . . I.1.2 The transformation axiom . . . . I.1.3 The measurement axiom . . . . . I.1.4 The combination axiom . . . . . I.1.5 Summary of quantum "rules" . . I.1.6 Observables and expected payo¤s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 56 57 57 58 58 59 59 59 60 60 61 62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 63 64 64 65 66 66 Contents J Common distributions iii 69 iv Contents This is page 1 Printer: Opaque this Appendix A Linear algebra basics A.1 Basic operations We frequently envision or frame problems as linear systems of equations.1 It is useful to write this compactly in matrix notation, say Ay = x where A is an m n (rows columns) matrix (a rectangular array of elements), y is an n-element vector, and x is an m-element vector. This statement compares the result on the left with that on the right, elementby-element. The operation on the left is matrix multiplication or each element is recovered by a vector inner product of the corresponding row from A with the vector y. That is, the …rst element of the product vector Ay is the vector inner product of the …rst row A with y, the second element of the product vector is the inner product of the second row A with y, and so on. A vector inner product multiplies the same position element of the leading row and trailing column and sums over the products. Of course, this means that the operation is only well-de…ned if the number of columns in the leading matrix, A, equals the number of rows of the trailing, y. Further, the product matrix has the same number of rows as the leading matrix and 1 G. Strang, Linear Algebra and its Applications, Harcourt Brace Jovanovich College Publishers, or Introduction to Linear Algebra, Wellesley-Cambridge Press o¤ers a mesmerizing discourse on linear algebra. 2 Appendix A. Linear algebra basics columns of the trailing. For example, let 2 3 a11 a12 a13 a14 A = 4 a21 a22 a23 a24 5 ; a31 a32 a33 a34 2 3 y1 6 y2 7 7 y=6 4 y3 5 ; y4 and then 2 2 3 x1 x = 4 x2 5 x3 3 a11 y1 + a12 y2 + a13 y3 + a14 y4 Ay = 4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5 a31 y1 + a32 y2 + a33 y3 + a34 y4 The system of equations also covers matrix addition and scalar multiplication by a matrix in the sense that we can rewrite the equations as Ay x=0 First, multiplication by a scalar or constant simply multiplies each element of the matrix by the scalar. In this instance, we multiple the elements of x by 1. 2 3 2 3 2 3 a11 y1 + a12 y2 + a13 y3 + a14 y4 x1 0 4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5 4 x2 5 = 4 0 5 a31 y1 + a32 y2 + a33 y3 + a34 y4 0 x3 2 3 2 3 2 3 a11 y1 + a12 y2 + a13 y3 + a14 y4 x1 0 4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5 + 4 x2 5 = 4 0 5 a31 y1 + a32 y2 + a33 y3 + a34 y4 0 x3 Then, we add the m-element vector x to the same position elements are summed. 2 a11 y1 + a12 y2 + a13 y3 + a14 y4 4 a21 y1 + a22 y2 + a23 y3 + a24 y4 a31 y1 + a32 y2 + a33 y3 + a34 y4 m-element vector Ay where 3 2 3 x1 0 x2 5 = 4 0 5 x3 0 Again, the operation is only well-de…ned for equal size matrices and, unlike matrix multiplication, matrix addition always commutes. Of course, the m-element vector (the additive identity) on the right has all zero elements. By convention, vectors are represented in columns. So how do we represent an inner product of a vector with itself? We create a row vector by A.2 Fundamental theorem of linear algebra 3 transposing the original. Transposition simply puts columns of the original into same position rows of the transposed. For example, y T y represents the vector inner product (the product is a scalar) of y with itself where the superscript T represents transposition. 2 3 y1 6 y2 7 7 y1 y 2 y3 y 4 6 yT y = 4 y3 5 y4 = y12 + y22 + y32 + y42 Similarly, we might be interested in AT A. 2 3 a11 a21 a31 2 6 a12 a22 a32 7 a11 a12 a13 74 AT A = 6 4 a13 a23 a33 5 a21 a22 a23 a31 a32 a33 a14 a24 a34 ! 2 a2 + a2 a a +a a a a +a a 6 6 6 =6 6 4 11 13 21 23 +a31 a33 a11 a14 + a21 a24 +a31 a34 a12 a14 + a22 a24 +a32 a34 a11 a12 + a21 a22 +a31 a32 11 12 21 22 +a31 a32 ! 2 a12 + a2 22 +a2 32 a11 a13 + a21 a23 +a31 a33 a12 a13 + a22 a23 +a32 a33 a12 a13 + a22 a23 +a32 a33 ! 2 a2 13 + a23 +a2 33 a11 a14 + a21 a24 +a31 a34 a12 a14 + a22 a24 +a32 a34 a13 a14 + a23 a24 +a33 a34 11 21 +a2 31 3 a14 a24 5 a34 a13 a14 + a23 a24 +a33 a34 ! 2 a2 14 + a24 +a2 34 This yields an n n symmetric product matrix. A matrix is symmetric if the matrix equals its transpose, A = AT . Or, AAT which yields an m m product matrix. 2 3 2 3 a11 a21 a31 a11 a12 a13 a14 6 a12 a22 a32 7 7 AAT = 4 a21 a22 a23 a24 5 6 4 a13 a23 a33 5 a31 a32 a33 a34 a14 a24 a34 2 3 2 2 6 6 6 =6 6 6 4 a11 + a12 +a213 + a214 a11 a21 + a12 a22 +a13 a23 + a14 a24 a11 a31 + a12 a22 +a13 a33 + a14 a34 a11 a21 + a12 a22 +a13 a23 + a14 a24 a221 + a222 +a223 + a224 a21 a31 + a22 a32 +a23 a33 + a24 a34 a11 a31 + a12 a22 +a13 a33 + a14 a34 a21 a31 + a22 a32 +a23 a33 + a24 a34 a231 + a232 +a233 + a234 A.2 Fundamental theorem of linear algebra With these basic operations in hand, return to Ay = x When is there a unique solution, y? The answer lies in the fundamental theorem of linear algebra. The theorem has two parts. 7 7 7 7 7 7 5 3 7 7 7 7 7 5 4 Appendix A. Linear algebra basics A.2.1 Part one First, the theorem says that for every matrix the number of linearly independent rows equals the number of linearly independent columns. Linearly independent vectors are the set of vectors such that no one of them can be duplicated by a linear combination of the other vectors in the set. A linear combination of vectors is the sum of scalar-vector products where each vector may have a di¤erent scalar multiplier. For example, Ay is a linear combination of the columns of A with the scalars in y. Therefore, if there exists some (n 1)-element vector, w, when multiplied by an (m (n 1)) submatrix of A, call it B, such that Bw produces the dropped column from A then the dropped column is not linearly independent of the other columns. To reiterate, if the matrix A has r linearly independent columns it also has r linearly independent rows. r is referred to as the rank of the matrix and dimension of the rowspace and columnspace (the spaces spanned by all possible linear combination of the rows and columns, respectively). Further, r linearly independent rows of A form a basis for its rowspace and r linearly independent columns of A form a basis for its columnspace. Accounting example Consider an incidence matrix describing the journal entry properties of accounting in its columns (each column has one +1 and 1 in it with the remaining elements equal to zero) and T accounts in its rows. The rows capture changes in account balances when multiplied by a transaction amounts vector y. By convention, we assign +1 for a debit entry and 1 for a credit entry. Suppose 2 1 6 1 A=6 4 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 3 0 0 7 7 1 5 1 where the rows represent cash, noncash assets, liabilities, and owners’equity, for instance. Notice, 1 times the sum of any three rows produces the remaining row. Since we cannot produce another row from the remaining two rows, the number of linearly independent rows is 3. By the fundamental theorem, the number of linearly independent columns must also be 3. Let’s check. Suppose the …rst three columns is a basis for the columnspace. Column 4 is the negative of column 3, column 5 is the negative of the sum of columns 1 and 3, and column 6 is the negative of the sum of columns 2 and 3. Can any of columns 1, 2 and 3 be produced as a linearly combination of the remaining two columns? No, the zeroes in rows 2 through 4 rule it out. For this matrix, we’ve con…rmed the number of linearly independent rows and columns is the same. A.2 Fundamental theorem of linear algebra 5 A.2.2 Part two The second part of the fundamental theorem describes the orthogonal complements to the rowspace and columnspace. Two vectors are orthogonal if they are perpendicular to one another. As their vector inner product is proportional to the cosine of the angle between them, if their vector inner product is zero they are orthogonal.2 n-space is spanned by the rowspace (with dimension r) plus the n r dimension orthogonal complement, the nullspace where AN T = 0 N is an (n r) n matrix whose rows are orthogonal to the rows of A and 0 is an m (n r) matrix of zeroes. Accounting example continued For the A matrix above, a basis for the nullspace is 2 1 N =4 0 0 1 1 0 0 0 1 1 0 0 1 1 0 3 1 1 5 0 and AN T 2 1 6 1 = 6 4 0 0 2 0 6 0 = 6 4 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 3 0 0 7 7 0 5 0 3 2 6 0 6 6 0 7 76 1 56 6 4 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 0 3 7 7 7 7 7 7 5 Similarly, m-space is spanned by the columnspace (with dimension r) plus the m r dimension orthogonal complement, the left nullspace where T (LN ) A = 0 LN is an m (m r) matrix whose rows are orthogonal to the columns of A and 0 is an (m r) n matrix of zeroes. The origin is the only point in common to the four subspaces: rowspace, columnspace, nullspace, and left 2 The inner product of a vector with itself is the squared length (or squared norm) of the vector. 6 Appendix A. Linear algebra basics nullspace. 2 3 1 6 1 7 7 LN = 6 4 1 5 1 and (LN ) A = 1 1 1 1 2 = 0 0 0 0 0 T 1 6 1 6 4 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 3 0 0 7 7 1 5 1 A.3 Nature of the solution A.3.1 Exactly-identi…ed If r = m = n, then there is a unique solution, y, to By = x, and the problem is said to be exactly-identi…ed.3 Since B is square and has a full set of linearly independent rows and columns, the rows and columns of B span r space (including x) and the two nullspaces have dimension zero. Consequently, there exists a matrix B 1 , the inverse of B, when multiplied by B produces the identity matrix, I. The identity matrix is a matrix when multiplied (on the left or on the right) by any other vector or matrix leaves that vector or matrix unchanged. The identity matrix is a square matrix with ones along the principal diagonal and zeroes on the o¤-diagonals. 3 2 1 0 0 6 .. 7 6 0 1 . 7 7 I=6 7 6 . . .. .. 4 .. . 0 5 0 0 1 Hence, B 1 By Iy y 1 = B = B = B x x 1 x 3 2 ; 1 Suppose B= 3 Both y and x are r-element vectors. 1 4 A.3 Nature of the solution 7 and x= 6 5 then y 4 10 2 10 1 10 3 10 3 2 1 4 y1 y2 1 0 0 1 y1 y2 = B " = = y1 y2 = " " 1 x 4 10 2 10 24 5 10 12+15 10 # 19 10 3 10 1 10 3 10 # # 6 5 A.3.2 Under-identi…ed However, it is more common for r m; n with one inequality strict. In this case, spanning m-space draws upon both the columnspace and left nullspace and spanning n-space draws from both the rowspace and the nullspace. If the dimension of the nullspace is greater than zero, then it is likely that there are many solutions, y, that satisfy Ay = x. On the other hand, if the dimension of the left nullspace is positive and the dimension of the nullspace is zero, then typically there is no exact solution, y, for Ay = x. When r < n, the problem is said to be under-identi…ed (there are more unknown parameters than equations) and a complete set of solutions can be described by the solution that lies entirely in the rows of A (this is often called the row component as it is a linear combination of the rows) plus arbitrary weights on the nullspace of A. The row component, y RS(A) , can be found by projecting any consistent solution, y p , onto a basis for the rows (any linearly independent set of r rows) of A. Let Ar be a submatrix derived from A with r linearly independent rows. Then, y RS(A) T = (Ar ) = (PAr ) y p T Ar (Ar ) 1 Ar y p and y = y RS(A) + N T k T T 1 where PAr is the projection matrix, (Ar ) Ar (Ar ) Ar , onto the rows of A and k is any n-element vector of arbitrary weights. 8 Appendix A. Linear algebra basics Return to our accounting example above. Suppose the changes in account balances are 2 3 2 6 1 7 7 x=6 4 1 5 2 Then, a particular solution can be found by setting, for example, the last three elements of y equal to zero and solving for the remaining elements. 2 1 1 2 0 0 0 6 6 6 yp = 6 6 6 4 3 7 7 7 7 7 7 5 so that 2 1 6 1 6 4 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 2 3 6 0 6 7 0 76 6 1 56 6 4 1 Ay p = x 3 1 3 2 2 1 7 7 7 6 2 7 7 = 6 1 7 7 4 1 5 0 7 5 2 0 0 Let Ar be the …rst three rows. 2 1 A =4 1 0 r 2 PAr 6 6 1 6 6 = 12 6 6 4 7 1 2 2 5 1 1 1 0 0 1 0 1 7 2 2 1 5 1 0 0 2 2 4 4 2 2 3 0 0 5 1 0 1 0 2 2 4 4 2 2 5 1 2 2 7 1 1 5 2 2 1 7 3 7 7 7 7 7 7 5 A.3 Nature of the solution 9 and y RS(A) (PAr ) y p 2 7 6 1 6 1 6 6 2 = 12 6 6 2 4 5 1 3 2 1 6 5 7 7 6 7 16 6 4 7 = 6 66 4 7 7 4 5 5 1 = 1 7 2 2 1 5 2 2 4 4 2 2 2 2 4 4 2 2 5 1 2 2 7 1 1 5 2 2 1 7 32 76 76 76 76 76 76 54 1 1 2 0 0 0 3 7 7 7 7 7 7 5 The complete solution, with arbitrary weights k, is y y 2 6 6 6 6 6 6 4 y1 y2 y3 y4 y5 y6 3 7 7 7 7 7 7 5 = y RS(A) + N T k 3 2 2 1 6 5 7 6 7 6 6 7 6 16 6 4 7+6 = 6 6 66 4 7 7 6 4 5 5 4 1 3 2 2 1 6 5 7 6 7 6 6 7 6 16 6 4 7+6 = 6 6 66 4 7 7 6 4 5 5 4 1 1 1 0 0 1 1 3 0 0 2 3 1 0 7 7 k1 7 1 1 74 k2 5 0 1 7 7 k3 0 0 5 1 0 3 k1 k1 + k2 k2 + k3 k3 k1 k1 + k2 7 7 7 7 7 7 5 A.3.3 Over-identi…ed In the case where there is no exact solution, m > r = n, the vector that lies entirely in the columns of A which is nearest x is frequently identi…ed as the best approximation. This case is said to be over-identi…ed (there are more equations than unknown parameters) and this best approximation is the column component, y CS(A) , and is found via projecting x onto the columns of A. A common variation on this theme is described by Y =X where Y is an n-element vector and X is an n p matrix. Typically, no exact solution for (a p-element vector) exists, p = r (X is composed of 10 Appendix A. Linear algebra basics linearly independent columns), and b= CS(A) = XT X 1 XT Y is known as the ordinary least squares (OLS ) estimator of and the estimated conditional expectation function is the projection of Y into the columns of X 1 Xb = X X T X X T Y = PX Y T For example, let X = (Ar ) and Y = y P . then 2 3 4 1 b= 4 5 5 6 1 and Xb = PX Y = y RS(A) . 2 6 6 16 Xb = PX Y = 6 66 6 4 1 5 4 4 5 1 3 7 7 7 7 7 7 5 A.4 Matrix decomposition and inverse operations Inverse operations are inherently related to the fundamental theorem and matrix decomposition. There are a number of important decompositions, we’ll focus on four: LU factorization, Cholesky decomposition, singular value decomposition, and spectral decomposition. A.4.1 LU factorization Gaussian elimination is the key to solving systems of linear equations and gives us LU decomposition. Nonsingular case Any square, nonsingular matrix A (has linearly independent rows and columns) can be written as the product of a lower triangular matrix, L, times an upper triangular matrix, U .4 A = LU 4 The general case, A is a m n matrix, is discussed below. A.4 Matrix decomposition and inverse operations 11 where L is lower triangular meaning that it has all zero elements above the main diagonal and U is upper triangular meaning that it has all zero elements below the main diagonal. Gaussian elimination says we can write any system of linear equations in triangular form so that by backward recursion we solve a series of one equation, one variable problems. This is accomplished by row operations: row eliminations and row exchanges. Row eliminations involve a series of operations where a scalar multiple of one row is added to a target row so that a revised target row is produced until a triangular matrix, L or U , is generated. As the same operation is applied to both sides (the same row(s) of A and x) equality is maintained. Row exchanges simply revise the order of both sides (rows of A and elements of x) to preserve the equality. Of course, the order in which equations are written is ‡exible. In principle then, Gaussian elimination on Ay = x involves, for instance, multiplication of both sides by the inverse of L, provided the inverse exists (m = r), L 1 Ay 1 L LU y Uy 1 = L = L = L x x 1 x 1 As Gaussian elimination is straightforward, we have a simple approach for …nding whether the inverse of the lower triangular matrix exists and, if so, its elements. Of course, we can identify L in similar fashion L = AU 1 = LU U 1 T Let A = Ar (Ar ) the 3 3 full rank matrix from the accounting example above. 2 3 4 1 1 0 5 A=4 1 2 1 0 2 We …nd U by Gaussian elimination. Multiply row 1 by 1=4 and add to rows 2 and 3 to revise rows 2 and 3 as follows. 2 3 4 1 1 4 0 7=4 1=4 5 0 1=4 7=4 Now, multiply row 2 by 1/7 and add this result to row 3 to identify U . 2 3 4 1 1 1=4 5 U = 4 0 7=4 0 0 12=7 12 Appendix A. Linear algebra basics Notice we have constructed L 1 in the process. 2 32 3 1 0 0 1 0 0 1 0 5 4 1=4 1 0 5 L 1 = 4 0 0 1=7 1 1=4 0 1 2 3 1 0 0 1 0 5 = 4 1=4 2=7 1=7 1 so that 2 1 4 1=4 2=7 0 1 1=7 L 32 0 4 0 54 1 1 1 1 A 3 = U 2 1 4 1 0 5 = 4 0 7=4 2 0 0 1 2 0 Also, we have L in hand. L= L 1 3 1 1=4 5 12=7 1 and 2 1 4 1=4 2=7 0 1 1=7 32 0 `11 0 5 4 `21 1 `31 0 `22 `32 L 1L 3 = I2 0 1 0 5 = 4 0 `33 0 0 1 0 3 0 0 5 1 From the …rst row-…rst column, `11 = 1. From the second row-…rst column, 1=4`11 +1`21 = 0, or `21 = 1=4. From the third row-…rst column, 2=7`11 + 1=7`21 + 1`31 = 0, or `31 = 2=7 + 1=28 = 1=4. From the second rowsecond column, `22 = 1. From the third row-second column, 1=7`22 +1`32 = 0, or `32 = 1=7. And, from the third row-third column, `33 = 1. Hence, 3 2 1 0 0 1 0 5 L = 4 1=4 1=4 1=7 1 and 2 4 For 1 1=4 1=4 32 0 0 4 1 1 0 5 4 0 7=4 1=7 1 0 0 2 LU 3 = A 2 1 4 1=4 5 = 4 1 12=7 1 3 1 x=4 3 5 5 1 2 0 3 1 0 5 2 A.4 Matrix decomposition and inverse operations 13 the solution to Ay = x is 2 4 4 0 0 1 7=4 0 Ay LU y Uy 32 3 1 y1 1=4 5 4 y2 5 12=7 y3 = x = x = L 2 = 4 1 x 5 3 2 3 1 1 5 = 4 11=4 5 3 1=4 1=4 + 11=28 36=7 Backward recursive substitution solve for y. From row three, 12=7y3 = 36=7 or y3 = 7=12 36=7 = 3. From row two, 7=4y2 1=4y3 = 11=4, or y2 = 4=7 (11=4 + 3=4) = 2. And, from row one, 4y1 y2 y3 = 1, or y1 = 1=4 ( 1 + 2 + 3) = 1. Hence, 2 3 1 y=4 2 5 3 General case If the inverse of A doesn’t exist (the matrix is singular), we …nd some equations after elimination are 0 = 0, and possibly, some elements of y are not uniquely determinable as discussed above for the under-identi…ed case. For an m n matrix A, the general form of LU factorization may involve row exchanges via a permutation matrix, P . P A = LU where L is lower triangular with ones on the diagonal and U is an m n upper echelon matrix with the pivots along the main diagonal. LU decomposition can also be written as LDU factorization where, as before, L and U are lower and upper triangular matrices but now have ones along their diagonals and D is a diagonal matrix with the pivots of A along its diagonal. Returning to the accounting example, we utilize LU factorization to solve for y, a set of transactions amounts that are consistent with the …nancial statements. 2 1 6 1 6 4 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 2 3 6 0 6 7 0 76 6 1 56 6 4 1 Ay = x 3 y1 2 3 y2 7 2 7 6 7 y3 7 7 = 6 1 7 7 4 y4 7 1 5 5 y5 2 y6 14 Appendix A. Linear algebra basics For this A matrix, P = I4 (no row exchanges are called for), and row one is added to row two, the revised row two is added to row three, and the revised row three is added to row four, which gives 2 3 2 3 2 3 y1 6 y2 7 2 1 1 1 1 0 0 6 7 6 0 7 6 y3 7 6 3 7 1 1 1 1 0 6 76 7=6 7 7 4 2 5 4 0 0 1 1 1 1 56 6 y4 7 4 0 y5 5 0 0 0 0 0 0 y6 The last row conveys no information and the third row indicates we have three free variables. Recall, for our solution, y p above, we set y4 = y5 = y6 = 0 and solved. From row three, y3 = 2. From row two, y2 = (3 2) = 1. And, from row one, y1 = (2 1 2) = 1. Hence, 3 2 1 6 1 7 7 6 6 2 7 p 7 6 y =6 7 6 0 7 4 0 5 0 A.4.2 Cholesky decomposition If the matrix A is symmetric, positive de…nite5 as well as nonsingular, then we have A = LDLT as U = LT . In this symmetric case, we identify another useful factorization, Cholesky decomposition. Cholesky decomposition writes T A= 1 1 where = LD 2 and D 2 has the square root of the pivots on the diagonal. Since A is positive de…nite, all of its pivots are positive and their square root is real so, in turn, is real. Of course, we now have 1 A = = 1 T T T T or A 5A T 1 = = 1 matrix, A, is positive de…nite if its quadratic form is strictly positive xT Ax > 0 for all nonzero x. A.4 Matrix decomposition and inverse operations 15 T For the example above, A = Ar (Ar ) , A is symmetric, positive de…nite and we found 2 4 4 1 1 1 2 0 A 3 = LU 2 1 1 0 5 = 4 1=4 2 1=4 32 0 0 4 1 1 0 5 4 0 7=4 1=7 1 0 0 Factoring the pivots from U gives D and LDLT . T A = LDL 2 1 = 4 1=4 1=4 32 0 0 4 1 0 54 0 1=7 1 0 0 7=4 0 And, the Cholesky decomposition is 32 0 1 0 54 0 12=7 0 3 1 1=4 5 12=7 1=4 1 0 3 1=4 1=7 5 1 1 = LD 2 2 = 4 2 6 = 6 4 32 3 2 p0 0 0 0 5 1 0 54 0 7=4 p 0 1=7 1 0 0 12=7 3 0 0 p 7 7 0 7 2 q 5 1 p 2 37 2 7 1 1=4 1=4 2 1=2 1=2 so that 2 4 4 1 1 1 2 0 A = 2 3 1 6 0 5 = 6 4 2 T 2 1=2 1=2 0 p 7 2 1 p 2 7 0 32 2 76 0 76 0 q 54 2 37 0 1=2 1=2 p 7 2 0 1 p 2 7 2 q 3 7 3 7 7 5 A.4.3 Singular value decomposition Now, we introduce a matrix factorization that exists for every matrix. Singular value decomposition says every m n matrix, A, can be written as the product of a m m orthogonal matrix, U , multiplied by a diagonal m n matrix, , and …nally multiplied by the transpose of a n n orthogonal matrix, V .6 U is composed of the eigenvectors of AAT , V is composed of 6 An orthogonal (or unitary ) matrix is comprised of orthonormal vectors. That is, mutually orthogonal, unit length vectors so that U 1 = U T and U U T = U T U = I. 16 Appendix A. Linear algebra basics the eigenvectors of AT A, and contains the singular values (the square root of the eigenvalues of AAT or AT A) along the diagonal. A=U VT Further, singular value decomposition allows us to de…ne a general inverse or pseudo-inverse, A+ . A+ = V + U T where + is an n m diagonal matrix with nonzero elements equal to the reciprocal of those for . This implies AA+ A = A A+ AA+ = A+ A+ A T = A+ A AA+ T = AA+ and Also, for the system of equations Ay = x the least squares solution is y CS(A) = A+ x and AA+ is always the projection onto the columns of A. Hence, AA+ = PA = A AT A 1 AT if A has linearly independent columns. Or, AT AT + = A+ A = V + T UT U V T T T = V T UT U + V T = A+ A 1 = PAT = AT AAT A if A has linearly independent rows (if AT has linearly independent columns). For the accounting example, recall the row component is the consistent solution to Ay = x that is only a linearly combination of the rows of A; A.4 Matrix decomposition and inverse operations 17 that is, it is orthogonal to the nullspace. Utilizing the pseudo-inverse we have y RS(A) + = AT AT = PAT y p T 1 T (Ar ) = yp Ar (Ar ) Ar y p = A+ Ay p or simply, since Ay p = x y RS(A) = A+ x 2 = 6 6 1 6 6 24 6 6 4 2 = 6 6 16 6 66 6 4 5 5 4 4 1 1 3 1 5 7 7 4 7 7 4 7 7 5 5 1 9 3 0 0 9 3 3 9 0 0 3 9 1 1 4 4 5 5 3 2 3 7 2 7 76 1 7 76 7 74 1 5 7 5 2 The beauty of singular value decomposition is that any m A, can be factored as AV = U n matrix, since AV V T = A = U V T where U and V are m m and n n orthogonal matrices (of eigenvectors), respectively, and is a m n matrix with singular values along its main diagonal. Eigenvalues are characteristic values or singular values of a square matrix and eigenvectors are characteristic vectors or singular vectors of the matrix such that AAT u = u or we can work with AT Av = v where u is an m-element unitary (uT u = 1) eigenvector (component of Q1 ), v is an n-element unitary (v T v = 1) eigenvector (component of Q2 ), and is an eigenvalue of AAT or AT A. We can write AAT u = u as T AA AAT u I u = = Iu 0 18 Appendix A. Linear algebra basics or write AT Av = v as AT Av A A I v T = = Iv 0 then solve for unitary vectors u, v, and and roots . For instance, once we have i and ui in hand. We …nd vi by uTi A = i vi such that vi is unit length, viT vi = 1. The sum of the eigenvalues equals the trace of the matrix (sum of the main diagonal elements) and the product of the eigenvalues equals the determinant of the matrix. A singular matrix has some zero eigenvalues and pivots (the det (A) = [product of the pivots]), hence the determinant of a singular matrix, det (A), is zero.7 The eigenvalues can be found by solving det AAT I = 0. Since this is an m order polynomial, there are m eigenvalues associated with an m m matrix. Accounting example Return to the accounting example for an illustration. The singular value decomposition (SVD) of A proceeds as follows. We’ll work with the square, symmetric matrix AAT . Notice, by SVD, AAT = U VT U VT = U V T V T UT T T = U U 7 The determinant is a value associated with a square matrix with many (some useful) properties. For instance, the determinant provides a test of invertibility (linear independence). If det (A) = 0, then the matrix is singular and the inverse doesn’t exist; otherwise det (A) 6= 0, the matrix is nonsingular and the inverse exists. The determinant is the volume of a parallelpiped in n-dimensions where the edges come from the rows of A. The determinant of a triangular matrix is the product of the main diagonal elements. Determinants are unchanged by row eliminations and their sign is changed by row exchanges. The determinant of the transpose of a matrix equals the determinant of the matrix, det (A) = det AT . The determinant of the product of matrices is the product of their determinants, det (AB) = det (A) det (B). Some useful determinant identities are reported in section …ve of the appendix. A.4 Matrix decomposition and inverse operations 19 so that the eigenvalues of AAT are the squared singular values of A, The eigenvalues are found by solving for the roots of8 2 6 det 6 4 det AAT 4 1 1 1 2 1 0 2 0 1 2 1 1 2 1 48 + 44 2 Im = 3 4 12 3 + 7 7 = 5 4 = T . 0 0 0 Immediately, we see that one of the roots is zero,9 and 48 + 44 2 ( 2) ( 12 3 + 4 = 0 4) ( 6) = 0 or = f6; 4; 2; 0g for AAT .10 The eigenvectors for AAT are found by solving (employ Gaussian elimination and back substitution) AAT i I4 ui = 0 Since there is freedom in the solution, we can make the vectors orthonormal (see Gram-Schmidt discussion below). For instance, AAT h6I4 u1 = 0 p1 0 a 0 0 a , so we make a = p12 and uT1 = leads to uT1 = 2 Now, the complementary right hand side eigenvector, v1 , is found by uT1 A = v1 = p 1 v1 2 6 6 6 6 6 1 p uT1 A = 6 6 6 6 6 6 4 1 p 2 3 1 p 2 3 p1 3 p1 3 1 p 2 3 1 p 2 3 3 7 7 7 7 7 7 7 7 7 7 5 8 Below we show how to …nd the determinant of a square matrix and illustrate with this example. 9 For det AT A I6 = 0, we have 48 3 + 44 4 12 5 + 6 = 0. Hence, there are at least three zero roots. Otherwise, the roots are the same as for AAT . 1 0 Clearly, = f6; 4; 2; 0; 0; 0g for AT A. 0 p1 2 i . 20 Appendix A. Linear algebra basics Repeating these steps for the remaining eigenvalues (in descending order; remember its important to match eigenvectors with eigenvalues) leads to 2 p1 2 6 6 U =6 6 4 and 2 0 0 p1 2 1 p 1 2 1 2 2 3 1 p 2 3 p1 3 6 6 6 6 6 6 V =6 6 6 6 6 4 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 0 p1 2 p1 2 0 p1 3 p1 3 p p3 2 2 1 p 2 6 p1 6 0 0 0 0 p1 3 0 0 0 1 p 2 3 1 p 2 3 1 2 1 2 1 2 1 2 0 3 7 7 7 7 5 1 p 2 6 1 p 2 6 1 6 q p p3 2 2 1 p 2 6 p1 3 2 3 1 p 2 6 1 p 2 6 3 7 7 7 7 7 7 7 7 7 7 7 5 where U U T = U T U = I4 and V V T = V T V = I6 .11 Remarkably, A = U VT 2 p1 6 6 = 6 6 4 2 1 2 1 2 1 2 1 2 2 0 0 p1 2 2 6 6 6 6 6 6 6 6 6 6 6 4 1 p 2 3 1 p 2 3 p1 3 0 p1 2 p1 2 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 3 2 p 6 7 76 0 76 74 0 5 0 p1 3 p1 3 p p3 2 2 1 p 2 6 p1 6 0 0 0 0 p1 3 0 0 0 1 p 2 3 1 p 2 3 1 2 1 2 1 2 1 2 0 1 6 1 = 6 4 0 0 1 0 1 0 1 0 0 1 1 0 0 1 p1 3 0 1 0 1 p p3 2 2 1 p 2 6 3 0 0 7 7 1 5 1 3 0 0 0 0 0 2 p0 0 0 0 7 7 0 2 0 0 0 5 0 0 0 0 0 1 p 2 6 1 p 2 6 1 6 q 2 3 1 p 2 6 1 p 2 6 3T 7 7 7 7 7 7 7 7 7 7 7 5 1 1 There are many choices for the eigenvectors associated with zero eigenvalues. We select them so that they orthonormal. As with the other eigenvectors, this is not unique. A.4 Matrix decomposition and inverse operations 21 where is m n (4 6) with the square root of the eigenvalues (in descending order) on the main diagonal. A.4.4 Spectral decomposition When A is a square, symmetric matrix, singular value decomposition can be expressed as spectral decomposition. A = U UT where U is an orthogonal matrix. Notice, the matrix on the right is the transpose of the matrix on the left. This follows as AAT = AT A when A = AT . We’ve illustrated this above if when we decomposed AAT , a square symmetric matrix. AAT = U UT 2 p1 6 6 = 6 4 2 6 = 6 4 2 0 0 1 p 2 4 1 1 2 1 2 1 2 1 2 1 2 1 2 0 1 0 p1 2 1 p 2 0 1 0 2 1 1 2 1 2 1 2 1 2 3 2 1 7 7 1 5 4 3 2 6 7 76 0 74 0 5 0 0 4 0 0 0 0 2 0 3 2 0 6 0 76 6 5 0 4 0 1 p 2 0 0 p1 2 1 2 1 2 1 2 1 2 0 1 p 2 1 p 2 0 A.4.5 quadratic forms, eigenvalues, and positive de…niteness A symmetric matrix A is positive de…nite if the quadratic form xT Ax is positive for every nonzero x. Positive semi-de…niteness follows if the quadratic form is non-negative, xT Ax 0 for every nonzero x. Negative de…nite and negative semi-de…nite symmetric matrices follow in analogous fashion where the quadratic form is negative or non-positive, respectively. A positive (semi-) de…nite matrix has positive (non-negative) eigenvalues. This result follows immediately from spectral decomposition. Let y = Qx (y is arbitrary since x is) and write the spectral decomposition of A as QT Q where Q is an orthogonal matrix and is a diagonal matrix composed of the eigenvalues of A. Then the quadratic form xT Ax > 0 can be written as xT QT Qx > 0 or y T y > 0. Clearly, this is only true if , the eigenvalues, are all positive. A.4.6 similar matrices and Jordan form Now, we provide some support for properties associated with eigenvalues. Namely, for any square matrix the sum of the eigenvalues equals the trace 1 2 1 2 1 2 1 2 3T 7 7 7 5 22 Appendix A. Linear algebra basics of the matrix and the product of the eigenvalues equals the determinant of the matrix. To aid with this discussion we …rst develop the idea of similar matrices and the Jordan form of a matrix. Two matrices, A and B, are similar if there exists M and M 1 such that B = M 1 AM . Similar matrices have the same eigenvalues as seen from Ax = x where x is an eigenvector of A associated with . AM M Ax = 1 x = x x Since M B = AM , we have M BM 1 x = M 1 M BM 1 x = B M 1x = x M 1x M 1x Hence, A and B have the same eigenvalues where x is the eigenvector of A and M 1 x is the eigenvector of B. From here we can see A and B have the same trace and determinant. First, we’ll demonstrate, Xvia example, the trace of a matrix equals the sum of its eigenvalues, i = tr (A) for any square matrix A. Consider a11 a12 A = where tr (A) = a11 + a22 . The eigenvalues of A are a21 a22 determined from solving det (A I) = 0. 2 (a11 ) (a22 ) (a11 + a22 ) + a11 a22 The two roots or eigenvalues are q 2 (a11 + a22 ) a11 + a22 = 2 a12 a21 a12 a21 4 (a11 a22 = 0 = 0 a12 a21 ) and their sum is 1 + 2 = a11 + a22 = tr (A). The idea extends to any X square matrix A such that I) for i = tr (A). This follows as det (A any n n matrix A has coe¢ cient on the n 1 term equal to minus the X coe¢ cient on n times 2 example above.12 i , as in the 2 We’ll demonstrate the determinant result in two parts: one for diagonalizable matrices and one for non-diagonalizable matrices using their Jordan form. Any diagonalizable matrix can be written as A = S S 1 . The deter1 minant of A is then jAj = S S 1 = jSj j j S 1 = j j since S 1 = jSj Y which follows from SS 1 = S 1 jSj = jIj = 1. Now, we have j j = i. 1 2 For n even the coe¢ cient on n is 1 and for n odd the coe¢ cient on the coe¢ cient on n 1 of opposite sign. n is 1 with A.5 Gram-Schmidt construction of an orthogonal matrix 23 The second part follows from similar matrices and the Jordan form. When a matrix is not diagonalizable because it doesn’t have a complete set of linearly independent eigenvectors, we say it is nearly diagonalizable when it’s in Jordan form. Jordan form means the matrix is nearly diagonal except for perhaps ones immediately above the diagonal. 1 0 For example, the identity matrix, , is in Jordan form as well as 0 1 1 1 being diagonalizable while is in Jordan form but not diagonaliz0 1 able. Even though both matrices have the same eigenvalues they are not 1 1 similar matrices as there exists no M such that equals M 1 IM . 0 1 Nevertheless, the Jordan form is the characteristic form for a family of similar matrices as there exists P such that P 1 AP = J where J 1 4 has is the Jordan form for the family. For instance, A = 13 1 5 1 1 2 1 Jordan form with P = . Consider another example, 0 1 1 2 13 6 3 1 3 1 A = 15 has Jordan form with P = . Since 1 12 0 2 1 2 they are similar matrices, A and J have the same eigenvalues. Plus, as in the above examples, the eigenvalues lie on the diagonal of J Y in general. The determinant of A = P JP 1 = jP j jJj P 1 = jJj = i . This completes the argument. To summarize, for any n n matrix A: (1) jAj = Y and (2) tr (A) = X i; i: A.5 Gram-Schmidt construction of an orthogonal matrix Before we put this section to bed, we’ll undertake one more task. Construction of an orthogonal matrix (that is, a matrix with orthogonal, unit length vectors so that QQT = QT Q = I). Suppose we have a square, symmetric matrix 2 3 2 1 1 A=4 1 3 2 5 1 2 3 24 Appendix A. Linear algebra basics 1 2 with eigenvalues columns) v1 v2 v3 7+ p p 17 ; 12 7 2 1 2 =4 3+ 1 1 p 17 ; 1 1 2 17 and eigenvectors (in the p 3 17 1 1 3 0 1 5 1 The …rst two columns are not orthogonal to one another and none of the columns are unit length. First, the Gram-Schmidt procedure normalizes the length of the …rst vector v1 q1 = p v1T v1 p 3 2 17 p 3+ p 6 q 34 6 17 7 6 7 2p 7 = 6 6 17 3 17 7 4 q 5 2p 17 3 17 2 3 0:369 4 0:657 5 0:657 Then, …nds the residuals (null component) of the second vector projected onto q1 .13 r2 = 4 Now, normalize r2 q1T v2 3 p 3 17 5 1 1 = v2 1 2 1 2 q2 = p r2 r2T r2 so that q1 and q2 are orthonormal vectors. Let Q12 = 2 q1 q2 p 3+ 17 p p 34 6 17 6 q 6 = 6 6 17 4 q 2 2p 17 3 17 0:369 4 0:657 0:657 1 3 Since the v1T v1 1 2p 3 17 p 3 p3+ 17 p 34+6 17 7 q 7 2p 7 17+3 17 7 5 q 2p 17+3 17 3 0:929 0:261 5 0:261 term is the identity, we omit it in the development. A.5 Gram-Schmidt construction of an orthogonal matrix 25 Finally, compute the residuals of v3 projected onto Q12 r3 = v3 2 = 4 and normalize its length.14 Q12 QT12 v3 3 0 1 5 1 r3 q3 = p r3T r3 Then, Q = 2 q1 q2 q3 p 3+ 17 p p 34 6 17 6 q 6 = 6 6 17 4 q 2 2p 3 17 2p 17 3 17 0:369 4 0:657 0:657 p p3+ 17 p 34+6 17 q q 0:929 0:261 0:261 2p 17+3 17 2p 17+3 17 3 0 0:707 5 0:707 0 p1 2 p1 2 3 7 7 7 7 5 and QQT = QT Q = I. If there are more vectors then we continue along the same lines with the fourth vector made orthogonal to the …rst three vectors (by …nding its residual from the projection onto the …rst three columns) and then normalized to unit length, and so on. A.5.1 QR decomposition QR is another important (especially for computation) matrix decomposition. QR combines Gram-Schmidt orthogonalization and Gaussian elimination to factor an m n matrix A with linearly independent columns into a matrix composed of orthonormal columns, Q such that QT Q = I, multiplied by a square, invertible upper triangular matrix R. This provides distinct advantages when dealing with projections into the column space of A. Recall, this problem takes the form Ay = b where the objective is to …nd y that minimizes the distance to b. Since A = QR, we have QRy = b and R 1 QT QRy = y = R 1 QT b. Next, we summarize the steps for two QR algorithms: the Gram-Schmidt approach and the Householder approach. 1 1 4 Again, QT Q = I so it is omitted in the expression. In this example, v3 is 12 12 orthogonal to Q12 (as well as v1 and v2 ) so it is unaltered. 26 Appendix A. Linear algebra basics A.5.2 Gram-Schmidt QR algorithm The Gram-Schmidt algorithm proceeds as described above to form Q. Let a denote the …rst column of A and construct a1 = p aT to normalize the a a …rst column. Construct the projection matrix for this column, P1 = aT1 a1 (since a1 is normalized the inverse of aT1 a1 is unity so it’s dropped from the expression). Now, repeat with the second column. Let a denote the second column of A and make it orthogonal to a1 by rede…ning it as a = (I P1 ) a. Then normalize via a2 = p aT . Construct the projection matrix for this a a column, P2 = aT2 a2 . The third column is made orthonormal in similar fashion. Let a denote the third column of A and make it orthogonal to a1 and a2 by rede…ning it as a = (I P1 P2 ) a. Then normalize via a3 = p aT . Construct the projection matrix for this column, P3 = aT3 a3 . a a Repeat this for all n columns of A. Q is constructed by combining the an such that QT Q = I. R is constructed columns Q = a1 a2 T as R = Q A. To see that this is upper triangular let the columns of A be denoted A1 , A2 , . . . ,An . Then, 2 T 3 a1 A1 aT1 A2 aT1 An 6 aT2 A1 aT2 A2 aT2 An 7 6 7 QT A = 6 7 .. .. .. . . 4 5 . . . . T T T an An an A1 an A2 The terms below the main diagonal are zero since aj for j = 2; : : : ; n are constructed to be orthogonal to A1 , aj for j = 3; : : : ; n are constructed to be orthogonal to A2 = a2 + P1 A2 , and so on. Notice, how straightforward it is to solve Ay = b for y. Ay QRy 1 T R Q QRy = b = b = y=R 1 QT b A.5.3 Accounting example Return to the 4 accounts by 6 journal entries A matrix. This matrix clearly does not have linearly independent columns (or for that matter rows) but we’ll drop a redundant row (the last row) and denote the resultant matrix A0 . Now, we’ll …nd the QR decomposition of the 6 3 AT0 , AT0 = QR by the Gram-Schmidt process. 2 3 1 1 0 6 1 0 1 7 6 7 6 1 0 0 7 7 AT0 = 6 6 1 0 0 7 6 7 4 0 1 0 5 0 0 1 A.5 Gram-Schmidt construction of an orthogonal matrix 2 6 6 16 a1 = 6 26 6 4 2 and 6 6 16 a2 = 6 26 6 4 1 1 1 1 0 0 3 2 7 6 7 6 7 6 7 ; P1 = 1 6 7 46 7 6 5 4 0:567 0:189 0:189 0:189 0:756 0 3 1 1 1 1 0 0 2 7 6 7 6 7 6 7 ; P2 = 1 6 7 46 7 6 5 4 2 so that 6 6 16 a3 = 6 26 6 4 2 6 6 6 Q=6 6 6 4 and 0:5 0:5 0:5 0:5 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0:109 0:546 0:218 0:218 0:109 0:764 0:567 0:189 0:189 0:189 0:756 0 2 2 R = QT A = 4 0 0 0:5 1:323 0 The projection solution to Ay = x or A0 y 2 3 2 x0 = 4 1 5 is yrow = Q RT 1 1 1 1 1 0 0 2 6 6 6 1 x0 = 61 6 6 6 4 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 27 3 7 7 7 7; 7 7 5 3 1 0 0 1 0 0 7 7 1 0 0 7 7; 1 0 0 7 7 0 0 0 5 0 0 0 1 1 1 1 0 0 3 7 7 7 7 7 7 5 0:109 0:546 0:218 0:218 0:109 0:764 3 7 7 7 7 7 7 5 3 0:5 0:189 5 1:309 2 3 2 6 1 7 7 = x0 where x = 6 4 1 5 and 2 3 1 5 7 7 4 7 7. 4 7 7 5 5 1 28 Appendix A. Linear algebra basics A.5.4 The Householder QR algorithm The Householder algorithm is not as intuitive as the Gram-Schmidt algorithm but is computationally more stable. Let a denote the …rst column of A and except the …rst element is one. De…ne p z be a vector of zeros vv T T v = a + a az and H1 = I 2 vT v . Then, H1 A puts the …rst column of A in upper triangular form. Now, repeat the process where a is now de…ned to be the second column of H1 A whose …rst element is set to zero and z is de…ned to be a vector of zeros except the second element is one. Utilize these components to create v in the same form as before and to construct H2 in the same form as H1 . Then, H2 H1 A puts the …rst two columns of A in upper triangular form. Next, we work with the third column of H2 H1 A where the …rst two elements of a are set to zero and repeat for all n columns. When complete, R is constructed from the …rst n rows of Hn H2 H1 A. and QT is constructed from the …rst n rows of Hn H2 H 1 . A.5.5 Accounting example Again, return to the 4 accounts by 6 journal entries A matrix and work with A0 . Now, we’ll …nd the QR decomposition of the 6 3 AT0 , AT0 = QR by Householder transformation. 2 6 6 6 AT0 = 6 6 6 4 1 1 1 1 0 0 2 6 6 6 In the construction of H1 , a = 6 6 6 4 2 6 6 6 H1 = 12 6 6 6 4 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 1 0 1 1 1 1 0 0 3 3 0 1 0 0 0 1 3 7 7 7 7 7 7 5 2 1 0 0 0 0 0 3 2 1 2 3 7 6 1 7 7 7 6 7 7 6 7, v = 6 1 7, 7 6 1 7 7 7 6 5 4 0 5 0 2 4 1 7 6 0 1 6 7 7 6 0 1 1 T 7. Then, H1 A0 = 6 2 6 0 7 1 7 6 5 4 0 2 0 0 7 6 7 6 7 6 7, z = 6 7 6 7 6 5 4 and 1 1 1 1 0 2 3 7 7 7 7. 7 7 5 A.5 Gram-Schmidt construction of an orthogonal matrix 2 6 6 6 1 6 For the construction of H2 , a = 2 6 6 4 2 6 6 6 and H2 = 6 6 6 4 2 6 6 6 6 6 6 4 2 0 0 0 0 0 0 0:378 0:378 0:378 0:756 0 0:5 0:189 0:585 0:585 0:171 1 0:5 1:323 0 0 0 0 6 6 6 and H3 = 6 6 6 4 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0:5 1:323 0 0 0 0 2 0:5 6 0:5 6 6 0:5 and Q = 6 6 0:5 6 4 0 0 6 6 6 6 6 6 4 0 0:378 0:896 0:104 0:207 0 3 7 7 7 7. 7 7 5 0 0:378 0:104 0:896 0:207 0 2 6 6 6 For the construction of H3 , a = 6 6 6 4 2 2 1 0 0 0 0 0 0 0 0:447 0:447 0:130 0:764 3 0:5 0:189 7 7 1:309 7 7. 7 0 7 5 0 0 0:567 0:189 0:189 0:189 0:756 0 0 1 1 1 2 0 0 0 0:447 0:862 0:040 0:236 2 3 6 7 6 7 6 7 7, z = 6 6 7 6 7 4 5 0 0:756 0:207 0:207 0:585 0 3 2 0 0 7 6 0 0 7 6 6 0:585 7 7, z = 6 1 7 6 0 0:585 7 6 4 0 0:171 5 0 1 0 0 0 0 0:130 0:764 0:040 0:236 0:988 0:069 0:069 0:597 2 2 This leads to R = 4 0 0 0:109 0:546 0:218 0:218 0:109 0:764 0 1 0 0 0 0 0 0 0 0 0 1 3 7 7 7 7. 7 7 5 3 2 6 7 6 7 6 7 7, v = 6 6 7 6 7 4 5 3 0 1:823 0:5 0:5 1 0 29 3 7 7 7 7, 7 7 5 7 7 7 7. Then, H2 H1 AT0 = 7 7 5 3 2 7 6 7 6 7 6 7, v = 6 7 6 7 6 5 4 3 0 0 1:894 0:585 0:171 1 3 7 7 7 7, 7 7 5 7 7 7 7. Then, H3 H2 H1 AT0 = 7 7 5 0:5 1:323 0 3 0:5 0:189 5 1:309 30 Appendix A. Linear algebra basics 2 Finally, the projection solution 3 0:5 0:567 0:109 02 6 0:5 0:189 0:546 7 6 7 6 0:5 0:189 0:218 7 6 7B 4 6 0:5 7@ 0:189 0:218 6 7 4 0 0:756 0:109 5 0 0 0:764 2 3 1 6 5 7 6 7 6 7 1 6 4 7 6 6 7. 4 6 7 4 5 5 1 to A0 y = x0 is yrow = Q RT 2 0 0 3T 1 0:5 C 0:189 5 A 1:309 0:5 1:323 0 1 1 x0 = 2 3 2 4 1 5= 1 A.6 Some determinant identities A.6.1 Determinant of a square matrix We utilize the fact that det (A) = det (LU ) = det (L) det (U ) and the determinant of a triangular matrix is the product of the diagonal elements. Since L has ones along its diagonal, det (A) = det (U ). Return to the example above 2 6 I4 = det 6 4 det AAT 4 1 1 1 2 1 0 2 0 1 2 1 1 2 1 4 3 7 7 5 Factor AAT I4 into its upper and lower triangular components via Gaussian elimination (this step can be computationally intensive). 2 6 6 L=6 6 4 1 0 0 1 4+ 1 4+ 2 4+ 1 0 1 7 6 + 2 6+ 7 6 + 2 0 1 6+ 6 6 + 2 3 7 0 7 7 0 7 5 1 A.6 Some determinant identities 31 and 2 4 6 6 6 U =6 6 4 1 1 7 6 + 4 0 0 0 0 0 2 3 2 1 4+ 12 18 +8 7 6 + 2 6 4+ 12 8 + 7 6 + 3 2 2 2 (24 10 + 6 6 + 2 0 7 7 7 7 7 5 2 ) The determinant of A equals the determinant of U which is the product of the diagonal elements. det AAT I4 = det (U ) = (4 7 ) 2 6 + 4 24 10 + 6 12 2 2 6 + 18 + 8 7 6 + ! 2 3 2 which simpli…es as det AAT I4 = 48 + 44 2 12 3 4 + Of course, the roots of this equation are the eigenvalues of A. A.6.2 Identities Below the notation jAj refers to the determinant of matrix A. Am m Bm Cn m Dn and D 1 exist. Theorem 1 where A 1 n = jAj D n CA 1 B = jDj A BD 1 C Proof. A C B D = A C 0 I I 0 A 1B D CA 1 B = I 0 B D A BD 1 C D 1C 0 I Since the determinant of a block triangular matrix is the product of the determinants of the diagonal blocks and the determinant of the product of matrices is the product of their determinants, Am Cn m m Bm Dn n n = jAj jIj jIj D = jAj D CA CA 1 1 B = jDj jIj A B = jDj A BD 1 BD C 1 C jIj 32 Appendix A. Linear algebra basics Theorem 2 For A and B m n matrices, In + AT B = Im + BAT = In + B T A = Im + AB T Proof. Since the determinant of the transpose of a matrix equals the determinant of the matrix, In + AT B = Im AT From theorem 1, In + AT B B In I + AT B = I + BAT = T = In + B T A = jIj I + AT IB = jIj I + BIAT . Hence, I + BAT T = I + AB T Theorem 3 For vectors x and y, I + xy T = 1 + y T x. Proof. From theorem 2, I + xy T = I + y T x = 1 + y T x. Theorem 4 An Proof. A yT A yT x 1 0 1 I 0 n + xy T = jAj 1 + y T A = A yT 0 1 A 1x 1 + yT A 1 x I 0 1 x where A A 1x 1 + yT A 1 x = jAj 1 + y T A = = I 0 x 1 1 A + xy T = 1 1 exists. I 0 A + x1y T 1y T x 1 x A + x1y T 1y T 0 1 0 1 . This is page 33 Printer: Opaque this Appendix B Iterated expectations Along with Bayes’ theorem (the glue holding consistent probability assessment together), iterated expectations is extensively employed for connecting conditional expectation (regression) results with causal e¤ects of interest. Theorem 5 Law of iterated expectations E [Y ] = EX [E [Y j X]] Proof. EX [E [Y j X]] = = Z x E [Y j X] f (x) dx x Z x "Z yf (y j x) dy f (x) dx y x # y By Fubini’s theorem, we can change the order of integration # # Z x "Z y Z y "Z x yf (y j x) dy f (x) dx = y f (y j x) f (x) dx dy x y y x The product rule of Bayes’theorem, f (y j x) f (x) = f (y; x), implies iterated expectations can be rewritten as "Z # Z y EX [E [Y j X]] = x y y f (y; x) dx dy x 34 Appendix B. Iterated expectations Finally, the summation rule integrates out X, produces the result. EX [E [Y j X]] = Z Rx x f (y; x) dx = f (y), and y yf (y) dy = E [Y ] y B.1 Decomposition of variance Corollary 1 Decomposition of variance. V ar [Y ] = EX [V ar [Y j X]] + V arX [E [Y j X]] Proof. V ar [Y ] = E Y2 = EX E Y 2 j X V ar [Y ] V ar [Y ] 2 E [Y ] 2 EX [E [Y j X]] 2 = EX E Y 2 j X E [Y ] h i 2 2 = EX V ar [Y j X] + E [Y j X] EX [E [Y j X]] h i 2 2 = EX [V ar [Y j X]] + EX E [Y j X] EX [E [Y j X]] = EX [V ar [Y j X]] + V arX [E [Y j X]] The second line draws from iterated expectations while the fourth line is the decomposition of the second moment. In analysis of variance language, the …rst term is the residual variation (or variation unexplained)and the second term is the regression variation (or variation explained). This is page 35 Printer: Opaque this Appendix C Multivariate normal theory The Gaussian or normal probability distribution is ubiquitous when the data have continuous support as seen in the Central Limit theorems and the resilience to transformation of Gaussian random variables. When confronted with a vector of variables (multivariate), it often is sensible to think of a joint normal probability assignment to describe their stochastic propX erties. Let W = be an m-element vector (X has m1 elements and Z Z has m2 variables such that m1 + m2 = m) with joint normal probability, then the density function is fW (w) = 1 m=2 (2 ) 1=2 j j 1 (w 2 exp T ) 1 (w ) where X = Z is the m-element vector of means for W and = XX XZ ZX ZZ is the m m variance-covariance matrix for W with m linearly independent rows and columns. R R Of course, the density integrates to unity, X Z fW (w) dzdx = 1. And, the marginal densities R are found by integratingR out the other variables, for example, fX (x) = Z fW (w) dz and fZ (z) = X fW (w) dx. 36 Appendix C. Multivariate normal theory Importantly, as it uni…es linear regression, the conditional distributions are also Gaussian. fW (w) fX (x j Z = z) = fZ (z) N (E [X j Z = z] ; V ar [X j Z]) where E [X j Z = z] = and V ar [X j Z] = Also, where fZ (z j X = x) 1 ZZ XZ XX XZ (z 1 ZZ Z) ZX N (E [Z j X = x] ; V ar [Z j X]) E [Z j X = x] = and + X Z V ar [Z j X] = + ZX ZZ 1 XX ZX (x 1 XX X) XZ From the conditional expectation (or regression) function we see when the data are Gaussian, linearity imposes no restriction. E [Z j X = x] = Z + ZX 1 XX (x X) is often written E [Z j X = x] = I = 1 XX ZX T + + Z 1 XX x ZX x where corresponds to an intercept (or vector of intercepts) and T x corresponds to weighted regressors. Applied linear regression estimates the sample analogs to the above parameters, and . C.1 Conditional distribution Next, we develop more carefully the result for fX (x j Z = z); the result for fZ (z j X = x) follows in analogous fashion. fX (x j Z) = fW (w) fZ (z) 1 = = (2 ) m=2 1 (2 )m2 =2 j m1 =2 (2 ) exp j j 1=2 exp h h 1 2 (w T 1 ) T 1 ZZ (z jV ar [X j Z]j 1 T (x E [X j Z]) V ar [X j Z] 2 1 ZZ j 1=2 1 exp 1 2 (z Z) i ) (w i ) z 1=2 (x E [X j Z]) C.1 Conditional distribution 37 The normalizing constants are identi…ed almost immediately since m=2 (m1 +m2 )=2 (2 ) = m2 =2 (2 ) (2 ) m1 =2 = (2 ) m2 =2 (2 ) for the leading term and by theorem 1 in section A.6.2 we have j j=j ZZ j XX 1 ZZ XZ ZX since XX , ZZ , and are positive de…nite, their determinants are positive and their square roots are real. Hence, 1 j j j2 ZZ j 1 2 j = ZZ j 1 2 XX j = XX XZ XZ ZZ j 1 ZZ 1 ZZ ZX 1 2 1 2 ZX 1 2 1 = jV ar [X j Z]j 2 This leaves the exponential terms h T 1 1 ) (w exp 2 (w h T 1 1 exp Z) ZZ (z 2 (z = exp 1 (w 2 T ) 1 (w i ) i ) z )+ 1 (z 2 T Z) 1 ZZ (z z) which require a bit more foundation. We begin with a lemma for the inverse of a partitioned matrix. Lemma 1 Let a symmetric, positive de…nite matrix H be partitioned as A BT where A and C are square n1 n1 and n2 n2 positive de…nite B C matrices (their inverses exist). Then, " # 1 1 A BT C 1B A 1 B T C BA 1 B T 1 H = 1 1 C 1B A BT C 1B C BA 1 B T # " 1 1 T A BT C 1B A BT C 1B B C 1 = 1 1 C 1B A BT C 1B C BA 1 B T " # 1 1 A BT C 1B A 1 B T C BA 1 B T = 1 1 C BA 1 B T BA 1 C BA 1 B T Proof. H is symmetric and the inverse of a symmetric matrix is also symmetric. Hence, the second and third lines follow from symmetry and the 38 Appendix C. Multivariate normal theory …rst line. Since H is symmetric, positive de…nite, = LDLT I = BA H A 0 C 0 I 1 0 BA 1 B I 0 T 1 A BT I and H 1 = LT = I 0 1 H = 1 A 1 L A 1 0 BT I Expanding gives " 1 1 D X BA 1 B T C 0 BA 1 B T C 1 A 1 1 BA I BA 1 0 I 1 1 B T C BA 1 B T 1 C BA 1 B T # where X=A 1 1 +A BT C 1 BA 1 BT 1 BA BT C = A 1 1 B The latter equality follows from some linear algebra. Suppose it’s true BT C A 1 B 1 1 =A 1 +A BT C 1 BA 1 BT BA 1 pre- and post-multiply both sides by A BT C A A 1 1 B A = A + BT C 1 post multiply both sides by A A = +B 0 = BT C A T C BT C 1 1 1 BT C 1 1 BT B B B 1 BA 1 BT C A BA 1 BT B + BT C 1 BA 1 BA A 1 BT BA 1 B BT C A 1 B Expanding the right hand side gives 0 = BT C B T 1 B + BT C C 1 BA B BA 1 T BA 1 1 BT 1 B T 1 B C B Collecting terms gives BT C 1 Rewrite I as CC 1 0= 0= BT C 1 B + BT C BA 1 1 BT I BA 1 BT C 1 B and substitute B + BT C BA 1 BT 1 CC 1 BA 1 BT C 1 B C.1 Conditional distribution 39 Factor 0 = BT C 1 B + BT C 0 = BT C 1 B + BT C 1 BA 1 1 BT C 1 BA BT C 1 B B=0 This completes the lemma. Now, we write out the exponential terms and utilize the lemma to simplify. exp = = = exp exp 8 > > > < > > > : 8 > > > > > < 1 (w 2 T 1 ) (w h 1 2 1 2 6 6 6 4 > > > > > : 8 2 > > > < 16 6 exp 6 > 2 4 > > : 1 (z 2 T X) (x T Z) (z 1 XX Z 1 ZZ T Z) 1 XZ ZZ (z i z) 1 x XX Z 1 1 z ZX XX Z ZZ X T 1 ) (z ) + 21 (z Z z ZZ 3 9 T 1 > (x (x ) X) X > XX Z 7 > T > 1 1 > (x ) (z ) 7 = X Z XX Z XZ ZZ 7 T 1 1 5 (z ) (x ) Z X ZZ ZX XX Z > T > 1 > + (z (z ) > Z) Z ZZ X > ; T 1 1 + 2 (z ) (z ) Z z ZZ 9 3 T 1 > (x > X) X) XX Z (x = 7> T 1 1 (x ) (z ) 7 XZ X Z XX Z ZZ 7 T 1 1 (z > Z) X ) 5> ZZ ZX XX Z (x > ; T 1 1 + (z Z) Z) ZZ X ZZ (z 1 ZZ 2 )+ where XX Z = XX XZ From the last term, write 1 ZZ X 1 ZZ 1 ZZ ZX and ZZ X = 1 ZZ ZX 1 XX Z = ZZ XZ X Z 1 XX ZX 9 > > > = > > > ; XZ . 1 ZZ To see this utilize the lemma for the inverse of a partitioned matrix. By symmetry 1 1 1 1 ZZ ZX XX Z = ZZ X ZX XX Post multiply both sides by 1 ZZ ZX 1 XX Z XZ XZ 1 ZZ 1 ZZ = = = and simplify 1 ZZ X 1 ZZ X 1 ZZ X ZX ( 1 XX ZZ XZ ZZ X ) 1 ZZ 1 ZZ 1 ZZ Now, subtitute this into the exponential component 8 2 T 1 > (x > X) X) XX Z (x > < 16 T 1 1 (x ) (z 6 XZ X Z) XX Z ZZ exp 6 T 1 1 > 2 4 (z ) (x > Z X) ZZ ZX XX Z > : T 1 1 1 + (z Z) ZZ ZX XX Z XZ ZZ (z Z) 39 > > 7> = 7 7 5> > > ; 40 Appendix C. Multivariate normal theory Combining the …rst and second terms and combine the third and fourth terms gives ( " #) T 1 1 1 (x x (z ) XZ X) X Z XX Z ZZ exp T 1 1 1 2 (z XZ ZZ (z Z) X Z) ZZ ZX XX Z x Then, since (x T X) (z T Z) 1 ZZ = x ZX X XZ 1 ZZ (z Z) T combining these two terms simpli…es as exp 1 (x 2 T 1 XX Z E [x j Z = z]) (x E [x j Z = z]) 1 where E [x j Z = z] = X XZ ZZ (z Z ). Therefore, the result matches the claim for fX (x j Z = z), the conditional distribution of X given Z = z is normally distributed with mean E [x j Z = z] and variance V ar [x j Z]. C.2 Special case of precision Now, we consider a special case of Bayesian normal updating expressed in terms of precision of variables (inverse variance) along with variance representation above. Suppose a variable of interest x is observed with error Y =x+" where x N 2 x x; = x and " N 0; 2 " 1 = 1 " 2 j " independent of x, refers to variance, and j refers to precision of variable j. This implies 2 h i 3 2 E (Y ) E [(Y ) (x )] x x x Y h i 5 V ar = 4 2 x E [(x ) (Y )] E (x ) x = 2 x + 2 x 2 " x 2 x 2 x x : Then, the posterior or updated distribution for x given Y = y is normal. (x j Y = y) N (E [x j Y = y] ; V ar [x j Y ]) C.2 Special case of precision 41 where E [x j Y = y] 2 x = x = 2 2 " x + xy 2 + 2 x " = + 2 x 2 " + + x+ (y x) "y x x " and V ar [x j Y ] = = = = 2 2 x 2 x 2 x 2 x 2 x 2 " + + 2 " 2 x + 2 2 x 2 " 2 2 x " 2 x + 1 x+ 2 " " For both the conditional expectation and variance, the penultimate line expresses the quantity in terms of variance and the last line expresses the same quantity in terms of precision. The precision of x given Y is xjY = x + ". 42 Appendix C. Multivariate normal theory This is page 43 Printer: Opaque this Appendix D Generalized least squares (GLS) For the DGP, Y =X +" where " N (0; ) and is some general n n variance-covariance matrix, then the linear least squares estimator or generalized least squares (GLS ) estimator is bGLS = X T 1 X 1 XT 1 Y with V ar [bGLS j X] = X T 1 1 X If is known, this can be computed by ordinary least squares (OLS ) 1 following transformation of the variables Y and X via where is a T triangular matrix such that = , say via Cholesky decomposition. Then, the transformed DGP is 1 Y y 1 = X + = x + 1 " 44 Appendix D. Generalized least squares (GLS) 1 1 1 where y = Y, x = X, and = " N (0; I). To see where the identity variance matrix comes from, consider 1 V ar " = 1 = 1 = 1 T = = I 1 T 1 T V ar ["] 1 T 1 T 1 T Hence, estimation involves projection of y onto x E [y j x] = xb where b Since 1 = xT x = XT 1 T = 1 xT y 1 T 1 1 T = 1 b = XT 1 1 X 1 T XT 1 Y , we can rewrite 1 X XT 1 Y h which is the GLS estimator for . Further, V ar [b j X] = E (b Since b = XT 1 = XT 1 = T 1 X + XT = = V ar [b j X] + X X T = XT 1 X = XT 1 X = XT as indicated above. 1 X X X 1 1 XT 1 Y 1 XT 1 (X + ") 1 T 1 X 1 1 X T 1 X i XT XT 1 1 XT 1 1 X T T 1 1 1 XT 1 X 1 ) X X T 1 h = E (b ) (b h 1 = E XT X X 1 ""T ) " 1 " " 1 E ""T j X 1 T ) (b X XT X XT 1 1 X X XT 1 X 1 1 jX X i 1 i . This is page 45 Printer: Opaque this Appendix E Two stage least squares IV (2SLS-IV) In this appendix we develop two stage least squares instrumental variable (2SLS-IV ) estimation more generally. Instrumental variables are attractive strategies whenever the fundamental condition for regression, E [" j X] = 0, is violated but we can identify instruments such that E [" j Z] = 0. E.1 General case For the data generating process Y =X +" the IV estimator projects X onto Z (if X includes an intercept so does b = Z Z T Z 1 Z T X, where X is a matrix of regressors, and Z is a Z), X matrix of instruments. Then, the IV estimator for is bT X b bIV = X Provided E [" j Z] = 0 and V ar ["] = and has variance V ar b IV j X; Z = 2 1 2 b XY I, bIV is unbiased, E bIV bT X b X 1 . = , 46 Appendix E. Two stage least squares IV (2SLS-IV) Unbiasedness is demonstrated as follows. bIV 1 = bT X b X = bT X b X XT Z ZT Z 1 ZT Z ZT Z = XT Z ZT Z 1 ZT X = = Then, 1 bT X b + X bT X b + X b XY b T (X + ") X 1 1 1 1 ZT X XT Z ZT Z 1 XT Z ZT Z 1 Z T (X + ") ZT X bT " X 1 bT " X E bIV j X; Z bT X b + X = E = = 1 bT X b + X +0= 1 b T " j X; Z X b T E [" j X; Z] X By iterated expectations, E bIV = EX;Z E bIV j X; Z = Variance of the IV estimator follows similarly. h V ar bIV j X; Z = E bIV bIV T j X; Z From the above development, Hence, V ar bIV j X; Z 1 bT X b = X bIV = E = 2 = 2 bT X b X bT X b X bT X b X 1 1 1 i bT " X bT X b b T ""T X b X X bT X b bT X b X X 1 j X; Z 1 E.2 Special case In the special case where both X and Z have the same number of columns the estimator can be further simpli…ed, bIV = Z T X 1 ZT Y E.2 Special case 47 To see this, write out the estimator bIV = = bT X b X 1 b XY XT Z ZT Z 1 ZT X 1 XT Z ZT Z 1 ZT Y Since X T Z and Z T X have linearly independent columns, we can invert each square term bIV = ZT X 1 ZT Z XT Z = ZT X 1 ZT Y 1 XT Z ZT Z 1 ZT Y Of course, the estimator is unbiased, E bIV = , and the variance of the estimator is 1 bT X b V ar bIV j X; Z = 2 X which can be written V ar bIV j X; Z = 2 ZT X 1 ZT Z XT Z 1 in this special case where X and Z have the same number of columns. 48 Appendix E. Two stage least squares IV (2SLS-IV) This is page 49 Printer: Opaque this Appendix F Seemingly unrelated regression (SUR) First, we describe the seemingly unrelated regression (SUR) model. Second, we remind ourselves Bayesian regression works as if we have two samples: one representative of our priors and another from the new evidence. Then, we connect to seemingly unrelated regression (SUR) — both classical and Bayesian strategies are summarized. We describe the SUR model in terms of a stacked regression as if the latent variables in a binary selection setting are observable. r =X +" where 2 3 2 U W r = 4 Y1 5 ; X = 4 0 Y0 0 0 X1 0 3 0 0 5; X2 2 =4 and " N (0; V = N In ) 1 2 3 2 3 VD 5 , " = 4 V1 5 , V0 50 Appendix F. Seemingly unrelated regression (SUR) 2 =4 Let denoted by V 1 D1 D0 D1 11 10 10 00 D0 N , is = 2 N .. . 0 .. . . 1 0 .. . 0 D1 .. . 0 .. 5, a 3 3 matrix, then the Kronecker product, In D1 In D0 In In = 4 1 .. . 6 6 6 6 6 6 6 6 = 6 6 6 6 6 6 6 4 a 3n 2 3 .. . 0 .. 1 00 In .. 0 .. . . . 0 .. . .. . 0 .. . 5 .. . 0 .. D0 .. . 0 .. . .. . 0 .. . 0 .. . . 0 .. . 10 00 .. . 0 10 0 .. . D0 10 11 10 N 3 D1 .. . 0 1 = 10 In 10 In 11 D0 3n matrix and V 11 In .. . 0 0 .. . . D0 In D1 D1 D0 D1 In .. 00 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 In . F.1 Classical Classical estimation of SUR follows generalized least square (GLS ). b = XT and 1 N 1 In X h i V ar b = X T 1 XT N 1 N In r 1 In X F.2 Bayesian On the other hand, Bayesian analysis employs a Gibbs sampler based on the conditional posteriors as, apparently, the SUR error structure prevents identi…cation of conjugate priors. Recall, from the discussion of Bayesian linear regression with general error structure the conditional posterior for is p ( j ; y; 0 ; ) N ; V where = = 1 X0T 0 1 X0 + X T + XT 1 1 1 X X 1 X0T 1 0 1 0 + XT X0 1 0 + XT Xb 1 Xb F.3 Bayesian treatment e¤ect application b = XT and V X0T = X XT 1 1 X0 + X T 1 0 1 = 1 1 + XT 1 51 y 1 X 1 X F.3 Bayesian treatment e¤ect application For the application of SUR to the treatment e¤ect setting, we replace y with N T 1 1 r = U y1 y 0 and with In yielding the conditional posterior for , p ( j ; y; SU R = 1 X0T 0 X0T = and 1 0; X0 + X T 1 0 X0 + XT 0 1 = = 1 N 1 X0T SU R N + XT b SU R = X T V SU R ) N 1 In X SU R 1N In X b 1 In X N ; V SU R where In X 1 0 1 XT + XT 1 N 1 In r 1N X0 + X T In X 1 1 1N +X In X 1 0 1 N In X b SU R 52 Appendix F. Seemingly unrelated regression (SUR) This is page 53 Printer: Opaque this Appendix G Maximum likelihood estimation of discrete choice models The most common method for estimating the parameters of discrete choice models is maximum likelihood. The likelihood is de…ned as the joint density for the parameters of interest conditional on the data Xt . For binary choice models and Dt = 1 the contribution to the likelihood is F (Xt ) ; and for Dt = 0 the contribution to the likelihood is 1 F (Xt ) where these are combined as binomial draws and F (Xt ) is the cumulative distribution function evaluated at Xt . Hence, the likelihood is L ( jX) = The log-likelihood is ` ( jX) logL ( jX) = n Y Dt F (Xt ) [1 1 Dt F (Xt )] t=1 n X Dt log (F (Xt )) + (1 Dt ) log (1 F (Xt )) t=1 Since this function for binary response models like probit and logit is globally concave, numerical maximization is straightforward. The …rst order conditions for a maximum, max ` ( jX) , are n P t=1 Dt f (Xt )Xit F (Xt ) (1 Dt )f (Xt )Xti 1 F (Xt ) = 0 i = 1; : : : ; k where f ( ) is the density function. Simplifying yields n P t=1 [Dt F (Xt )]f (Xt )Xti F (Xt )[1 F (Xt )] = 0 i = 1; : : : ; k 54 Appendix G. Maximum likelihood estimation of discrete choice models Estimates of are found by solving these …rst order conditions iteratively or, in other words, numerically. A common estimator for the variance of ^M LE is the negative inverse of h i 1 the Hessian matrix evaluated at ^M LE , H D; ^ . Let H (D; ) be the Hessian matrix for the log-likelihood with typical element Hij (D; ) @ 2 `t (D; ) 1 @ i@ j . 1 Details can be found in numerous econometrics references and chapter 4 of Accounting and Causal E¤ ects: Econometric Challenges. This is page 55 Printer: Opaque this Appendix H Optimization In this appendix, we brie‡y review optimization. First, we’ll take up linear programming then we’ll review nonlinear programming.1 H.1 Linear programming A linear program (LP ) is any optimization frame which can be described by a linear objective function and linear constraints. Linear refers to choice variables, say x, of no greater than …rst degree (a¢ ne transformations which allow for parallel lines are included). Prototypical examples are T max x 0 s:t: x Ax r or min T x 0 s:t: Ax x r 1 For additional details, consult, for instance, Luenberger and Ye, 2010, Linear and Nonlinear Programming, Springer, or Luenberger, 1997 Optimization by Vector Space Methods, Wiley. 56 Appendix H. Optimization H.1.1 basic solutions or extreme points Basic solutions are typically determined from the standard form for an LP. Standard form involves equality constraints except non-negative choice variables, x 0. That is, Ax r is rewritten in terms of slack variables, s, such that Ax + s = r. The solution to this program is the same as the solution to the inequality program. A basic solution or extreme point is determined from an m m submatrix of A composed of m linearly independent columns of A. The set of basic feasible solutions then is the collection of all basic solutions involving x 0. 2 1 r1 x1 Consider an example. Suppose A = ,r= ,x= 1 2 r2 x2 s1 and s = . Then, Ax + s = r can be written Bxs = r where B = s2 x A I2 , I2 is a 2 2 identity matrix, and xs = . The matrix B s has two linearly independent columns so each basic solution works with two columns of B, say Bij , and the elements other than i and j of xs are set to zero. For instance, B12 leads to basic solution x1 = 2r13 r2 and x2 = 2r23 r1 . The basic solutions are tabulated below. Bij B12 B13 B14 B23 B24 x1 2r1 r2 3 r2 r1 2 x2 2r2 r1 3 s1 0 r1 2r2 0 r2 2 r1 2r1 r2 2 0 0 0 0 0 s2 0 0 2r2 r1 2 0 r2 2r1 To test feasibility consider speci…c values for r. Suppose r1 = r2 = 10. The table with a feasibility indicator (1 (xs 0)) becomes Bij B12 B13 B14 B23 B24 x1 x2 10 3 10 3 10 5 0 0 0 0 5 10 s1 0 10 0 5 0 s2 0 0 5 0 10 feasible yes no yes yes no Notice, when x2 = 0 there is slack in the second constraint (s2 > 0) and similarly when x1 = 0 there is slack in the …rst constraint (s1 > 0). Basic feasible solutions, an algebraic concept, are also referred to by their geometric counterpart, extreme points. Identi…cation of basic feasible solutions or extreme points combined with the fundamental theorem of linear programming substantially reduce the search for an optimal solution. H.1 Linear programming 57 H.1.2 fundamental theorem of linear programming For a linear program in standard form where A is an m n matrix of rank m, i) if there is a feasible solution, there is a basic feasible solution; ii) if there is an optimal solution, there is a basic feasible optimal solution. Further, if more than one basic feasible solution is optimal, the edge between the basic feasible optimal solutions is also optimal. The theorem means the search for the optimal solution can be restricted to basic feasible solutions — a …nite number of points. H.1.3 duality theorems Optimality programs come in pairs. That is, there is a complementary or dual program to the primary (primal) program. For instance, the dual to the maximization program is a minimization program, and vice versa. primal program T max x x 0 s:t: Ax r dual program min rT 0 s:t: AT or primal program T min x x 0 s:t: Ax r dual program max rT 0 s:t: AT where is a vector of shadow prices or dual variable values. The dual of the dual program is the primal program. strong duality theorem If either the primal or dual has an optimal solution so does the other and their optimal objective function values are equal. If one of the programs is unbounded the other has no feasible solution. weak duality theorem For feasible solutions, the objective function value of the minimization program (say, dual) is greater than or equal to the maximization program (say, primal). The intuition for the duality theorems is straightforward. Begin with the constraints Ax r AT Transposing both sides of the …rst constraint leaves the inequality unchanged. xT AT rT AT 58 Appendix H. Optimization Now, post-multiply both sides of the …rst constraint by and pre-multiply both sides of the second constraint by xT , since both and x are nonnegative the inequality is preserved. xT AT rT Since xT is a scalar, xT = the relation we were after. T T x xT AT xT x. Now, combine the results and we have xT AT rT The solution to the dual lies above that for the primal except when they both reside at the optimum solution, in which case their objective function values are equal. H.1.4 example Suppose we wish to solve max 10x + 12y x 0 s:t 2 1 1 2 x y 10 10 We only need evaluate the objective function at each of the basic feasible 10 = 220 solutions we earlier identi…ed: 10 10 3 +12 3 3 , 10 (5)+12 (0) = 50 = 180 150 , and 10 (0) + 12 (5) = 60 = . The optimal solution is x = y = 10 3 3 3 . H.1.5 complementary slackness Suppose x 0 is an n element vector containing a feasible primal solution, 0 is an m element vector containing a feasible dual solution, s 0 is an m element vector containing primal slack variables, and t 0 is an n element vector containing dual slack variables. Then, x and are optimal if and only if (element-by-element) xt = 0 and s=0 These conditions are economically sensible as either the scarce resource is exhausted (s = 0) or if the resource is plentiful it has no value ( = 0). H.2 Nonlinear programming 59 H.2 Nonlinear programming H.2.1 unconstrained Nonlinear programs involve nonlinear objective functions. For instance, max x f (x) If the function is continuously di¤erentiable, then a local optimum can be found by the …rst order approach. That is, equate the gradient (a vector of partial derivatives composed of terms, @f@x(x) , i = 1; : : : ; n where there are i n choice variables, x). rf (x ) = 0 3 2 @f (x) @x1 6 . 7 6 . 7 = 6 4 4 . 5 2 @f (x) @xn 3 0 .. 7 . 5 0 Second order (su¢ cient) conditions involve the Hessian, a matrix of second partial derivatives. 2 6 H (x ) = 6 4 @ 2 f (x) @x1 @x1 .. . .. @ 2 f (x) @x1 @xn . .. . @ 2 f (x) @xn @xn @ 2 f (x) @xn @x1 3 7 7 5 For a local minimum, the Hessian is positive de…nite (the eigenvalues of H are positive). While for a local maximum, the Hessian is negative de…nite (the eigenvalues of H are negative). H.2.2 convexity and global minima If f is a convex function (de…ned below), then the set where f achieves its local minimum is convex and any local minimum is a global minimum. A function f is convex if for every x1 , x2 , and , 0 1, f ( x1 + (1 If x1 6= x2 , and 0 < ) x2 ) f (x1 ) + (1 ) f (x2 ) ) x2 ) < f (x1 ) + (1 ) f (x2 ) < 1, f ( x1 + (1 then f is strictly convex. If g = (strictly) concave. f and f is (strictly) convex, then g is 60 Appendix H. Optimization H.2.3 example Suppose we face the problem min f (x; y) = x2 10x + y 2 x;y 10y + xy The …rst order conditions are rf (x; y) = 0 or @f @x @f @y = 2x 10 + y = 0 = 2y 10 + x = 0 Since the problem is quadratic and the gradient is composed of linearly independent equations, a unique solution is immediately identi…able. 2 1 1 2 x y 10 10 = 100 3 . or x = y = 10 3 with objective function value de…nite, this solution is a minimum. H= 2 1 As the Hessian is positive 1 2 Positive de…niteness of the Hessian follows as the eigenvalues of H are positive. To see this, recall the sum of the eigenvalues equals the trace of the matrix and the product of the eigenvalues equals the determinant of the matrix. The eigenvalues of H are 1 and 3, both positive. H.2.4 constrained — the Lagrangian Nonlinear programs involve either nonlinear objective functions, constraints, or both. For instance, max f (x) x 0 s:t: G (x) r Suppose the objective function and constraints are continuously di¤erentiable concave and an optimal solution exists, then the optimal solution can be found via the Lagrangian. The Lagrangian writes the objective function less a Lagrange multiplier times each of the constraints. As either the multiplier is zero or the constraint is binding, each constraint term equals zero. L = f (x) r1 ] rn ] 1 [g1 (x) n [gn (x) H.2 Nonlinear programming 61 where G (x) involves n functions, gi (x), i = 1; : : : ; n. Suppose x involves m choice variables. Then, there are m Lagrange equations plus the n constraints that determine the optimal solution. @L @x1 = 0 .. . @L = 0 @xm r1 ] = 0 1 [g1 (x) .. . rn ] = 0 n [gn (x) The Lagrange multipliers (shadow prices or dual variable values) represent the rate of change in the optimal objective function for each of the constraints. i = @f (r ) @ri where r refers to rewriting the optimal solution x in terms of the constraint values, r. If a constraint is not binding, it’s multiplier is zero as it has no impact on the optimal objective function value.2 H.2.5 Karash-Kuhn-Tucker conditions Originally, the Lagrangian only allowed for equality constraints. This was generalized to include inequality constraints by Karash and separately Kuhn and Tucker. Of course, some regularity conditions are needed to ensure optimality. Various necessary and su¢ cient conditions have been proposed to deal with the most general settings. The Karash-Kuhn-Tucker theorem supplies …rst order necessary conditions for a local optimum (gradient of the Lagrangian and the Lagrange multiplier times the inequality constraint equal zero when the Lagrange multipliers on the inequality constraints are non-negative evaluated at x ). Second order necessary (positive semi-de…nite Hessian for the Lagrangian at x ) and su¢ cient (positive definite Hessian for the Lagrangian at x ) conditions are roughly akin to those for unconstrained local minima. 2 Of course, these ideas regarding the multipliers apply to the shadow prices of linear programs as well. 62 Appendix H. Optimization H.2.6 example Continue with the unconstrained example above with an added constraint. min f (x; y) = x2 x;y s:t: 10x + y 2 xy 10y + xy 10 Since the unconstrained solution satis…es this constraint, xy = the solution remains x = y = 10 3 . However, suppose the problem is min f (x; y) = x2 x;y s:t: 10x + y 2 xy 100 9 > 10, 10y + xy 20 The constraint is now active. The Lagrangian is L = x2 10x + y 2 10y + xy (xy 20) and the …rst order conditions are rL = 0 or @L @x @L @y = 2x 10 + y y=0 = 2y 10 + x x=0 and constraint equation (xy 20) = 0 A solution to these nonlinear equations is p x = y=2 5 p 5 = 3 The objective function value is 29:4427 which, of course, is greater than the objective function value for the unconstrained problem, 33:3333. The Hessian is 2 1 H= 1 2 p p 5, with eigenvalues evaluated at the solution, 3 = 5 and 1 + = 4 both positive. Hence, the solution is a minimum. This is page 63 Printer: Opaque this Appendix I Quantum information Quantum information follows from physical theory and experiments. Unlike classical information which is framed in the language and algebra of set theory, quantum information is framed in the language and algebra of vector spaces. To begin to appreciate the di¤erence, consider a set versus a tuple (or vector). For example, the tuple 1 1 is di¤erent than the tuple 1 1 1 but the sets, f1; 1g and f1; 1; 1g are the same as set f1g. Quantum information is axiomatic. Its richness and elegance is demonstrated in that only four axioms make it complete.1 I.1 Quantum information axioms I.1.1 The superposition axiom A quantum unit (qubit) is speci…ed by a two element vector, say 2 , 2 with j j + j j = 1. 1 Some argue that the measurement axiom is contained in the transformation axiom, hence requiring only three axioms. Even though we appreciate the merits of the argument, we’ll proceed with four axioms. Feel free to count them as only three if you prefer. 64 Appendix I. Quantum information Let j i = j0i + y j1i,2 h j = where y is the adjoint (conjugate transpose) operation. I.1.2 The transformation axiom A transformation of a quantum unit is accomplished by unitary (lengthpreserving) matrix multiplication. The Pauli matrices provide a basis of unitary operators. where i = , Y p I= 1 0 0 1 Y = 0 i i 0 0 1 X= Z= 1 0 1 0 0 1 1. The operations work as follows: I = i i , and Z = = H j0i = p1 2 = . Other useful sin- 1 1 and 1 1 Examples of these transformations in Dirac notation are3 gle qubit transformations are H = ,X = ei 0 0 1 . j0i j1i j0i + j1i p p ; H j1i = 2 2 j0i = ei j0i ; j1i = j1i I.1.3 The measurement axiom Measurement occurs via interaction with the quantum state as if a linear projection is applied to the quantum state.4 The set of projection matrices 1 0 and j1i = . 0 1 3 A summary table for common quantum operators expressed in Dirac notation is provided in section I.1.5. 4 This is a compact way of describing measurement. However, conservation of information (a principle of quantum information) demands that operations be reversible. In other words, all transformations (including interactions to measure) be unitary — projections are not unitary. However, there always exist unitary operators that produce the same post-measurement state as that indicated via projection. Hence, we treat projections as an expedient for describing measurement. 2 Dirac notation is a useful descriptor, as j0i = I.1 Quantum information axioms 65 are complete as they add to the identity matrix. P m y Mm Mm = I y where Mm is the adjoint (conjugate transpose) of projection matrix Mm The probability of a particular measurement occurring is the squared absolute value of the projection. (An implication of the axiom not explicitly used here is that the post-measurement state is the projection appropriately normalized; this e¤ectively rules out multiple measurement.) 1 0 For example, let the projection matrices be M0 = j0i h0j = and 0 0 0 0 M1 = j1i h1j = . Note that M0 projects onto the j0i vector and M1 0 1 projects onto the j1i vector. Also note that M0y M0 + M1y M1 = M0 + M1 = I. For j i = j0i + j1i, the projection of j i onto j0i is M0 j i. The 2 probability of j0i being the result of the measurement is h j M0 j i = j j . I.1.4 The combination axiom Qubits are combined by tensor For example, two j0i qubits 3 2 multiplication. 1 6 0 7 7 are combined as j0i j0i = 6 4 0 5 denoted j00i. It is often useful to trans0 form one qubit in a combination and leave another unchanged; this can also be accomplished by tensor multiplication. Let H1 denote a Hadamard transformation on 2the …rst qubit. Then 3 applied to a two qubit system, 1 0 1 0 6 0 1 0 1 7 7 and H1 j00i = j00i+j10i p . H1 = H I = p12 6 4 1 0 2 1 0 5 0 1 0 1 Another important two qubit transformation is the controlled not operator, 2 1 6 0 Cnot = 6 4 0 0 0 1 0 0 0 0 0 1 3 0 0 7 7 1 5 0 The controlled not operator ‡ips the target, second qubit, if the control, …rst qubit, equals j1i and otherwise leaves the target unchanged: Cnot j00i = j00i, Cnot j01i = j01i, Cnot j10i = j11i, and Cnot j11i = j10i, Entangled two qubit states or Bell states are de…ned as follows, 66 Appendix I. Quantum information j j00i + j11i p 2 00 i = Cnot H1 j00i = ij = Cnot H1 jiji for i; j = 0; 1 and more generally, The four two qubit Bell states form an orthonormal basis. I.1.5 Summary of quantum "rules" Below we tabulate two qubit quantum operator rules. The column heading indicates the initial state while the row value corresponding the operator identi…es the transformed state. Of course, the same rules apply to one qubit (except Cnot) or many qubits, we simply have to exercise care to identify which qubit is transformed by the operator (we continue to denote the target qubit via the subscript on the operator). operator j00i j01i j10i j11i I1 or I2 j00i j01i j10i j11i X1 j10i j11i j00i j01i X2 j01i j00i j11i j10i Z1 j00i j01i Z2 j00i j01i Y1 i j10i i j11i j10i i j00i j11i i j01i i j01i H1 j00i+j10i p 2 j01i+j11i p 2 j00i j10i p 2 j01i j11i p 2 H2 j00i+j01i p 2 j00i j01i p 2 j10i+j11i p 2 j10i j11i p 2 1 ei j00i ei j01i j10i j11i e j00i j01i e j10i j11i j00i j01i j11i j10i 2 Cnot i j11i j11i Y2 i i j00i j10i i i j10i common quantum operator rules I.1.6 Observables and expected payo¤s Measurements involve real values drawn from observables. To ensure measurements lead to real values, observables are also represented by Hermitian I.1 Quantum information axioms 67 matrices. A Hermitian matrix is one in which the complex conjugate of the matrix equals the original matrix, M y = M . Hermitian matrices have real eigenvalues and eigenvalues are the values realized via measurement. Suppose we are working with observable M in state j i where 2 3 0 0 0 1 6 0 0 0 7 2 7 M = 6 4 0 5 0 0 3 0 0 0 4 = 1 j00i h00j + 2 j01i h01j + 3 j10i h10j + 4 j11i h11j The expected payo¤ is hM i = h j M j i = 1 h j 00i h00j i + 2 h j 01i h01j i + 3 h j 10i h10j i + 4 h j 11i h11j i In other words, 1 is observed with probability h j 00i h00j i, 2 is observed with probability h j 01i h01j i, 3 is observed with probability h j 10i h10j i, and 4 is observed with probability h j 11i h11j i. 68 Appendix I. Quantum information This is page 69 Printer: Opaque this Appendix J Common distributions To try to avoid confusion, we list our descriptions of common multivariate and univariate distributions and their kernels. Others may employ variations on these de…nitions. 70 Appendix J. Common distributions Multivariate distributions and their support Gaussian (normal) x 2 Rk 2 Rk 2 Rk k positive de…nite Student t x 2 Rk 2 Rk 2 Rk k positive de…nite Wishart W 2 Rk k positive de…nite S 2 Rk k positive de…nite >k 1 Inverse Wishart W 2 Rk k positive de…nite S 2 Rk k positive de…nite >k 1 Density f ( ) functions and their kernels conjugacy 1 2 (x 1 (2 )k=2 j j1=2 1 ) (x ) 1 2 (x )T f (x; ; ) = e /e conjugate prior for mean of multinormal distribution T 1 (x ) f (x; ; ; ) = h [( +k)=2] )k=2 j j1=2 ( 2 )( 1 + (x h / 1 + (x )T 1 )T 1 1 k 2 1 f W; ;S 1 jW j / jW j jW j / jW j (n) = (n k 2 ) (x ) i +k 2 i +k 2 1 f (W ; ; S) = k 2 (x marginal posterior for multi-normal with unknown mean and variance k=2 jSj =2 2 1 2Tr e +k+1 2 +k+1 2 1 (S 1 1 2Tr e = e e 2 k (S jSj k=2 1 2Tr see inverse Wishart W) =2 k (SW 1 2Tr (2) W) (2) 1 ) (SW 1 ) conjugate prior for variance of multinormal distribution 1)!; Rfor n a positive integer 1 (z) = 0 tz 1 e t dt k Y +1 j = k(k 1)=4 2 j=1 Multivariate distributions Appendix J. Common distributions Univariate distributions and their support Beta x 2 (0; 1) ; >0 Density f ( ) functions and their kernels f (x; ; ) = (( )+( )) x 1 (1 x) Chi-square x 2 [0; 1) >0 Extreme value (logistic) x 2 ( 1; 1) 1 < < 1; s > 0 = = 2 exp[ 1=(2x)] 1 ( =2) x =2+1 =2 1 1=(2x) =2 /x e f x; ; 2 =2 exp[ ( 2 ) 2 =2 ( =2) =2 1 2 =(2x)] =2+1 2 x =(2x) e f (x; ) = exp [ x] / exp [ x] f (x; ; s) /x exp[ (x )=s] s(1+exp[ (x )=s])2 exp[ (x )=s] / (1+exp[ (x )=s])2 = Gamma x 2 [0; 1) ; >0 f (x; ; ) = x 1 exp[( )x= ] / x 1 exp [ x= ] Inverse gamma x 2 [0; 1) ; >0 f (x; ; ) =x /x Gaussian (normal) x 2 ( 1; 1) 1 < < 1; > 0 Student t x 2 ( 1; 1) 2 ( 1; 1) ; >0 Pareto x x0 ; x0 > 0 1 (1 x) F (s; ) n s = ns s (1 ) n s s / (1 ) f (x; ) x =2 1 = 2 =2 1( =2) exp[x=2] / x =2 1 e x=2 f (x; ) Binomial s = 1:2: : : : 2 (0; 1) Inverse chi-square x 2 (0; 1) >0 Scaled, inverse chi-square x 2 (0; 1) ; 2>0 Exponential x 2 [0; 1) >0 conjugacy 1 1 /x =x] 1 exp[ ( ) 1 exp [ =x] f (x; h; ) i (x )2 = p21 exp 2 2 i h (x )2 / exp 2 2 ( +1 ) f (x; ; ; ) = p 2 (2) +1 ( 2 2 ) (x ) 1+ 1 2 ( +1 2 2 ) / 1 + 1 (x 2 ) f (x; ; x0 ) = / x 1+1 x0 x 71 +1 Univariate distributions beta is conjugate prior to binomial beta is conjugate prior to binomial see scaled, inverse chi-square see scaled, inverse chi-square conjugate prior for variance of a normal distribution gamma is conjugate prior to exponential posterior for Bernoulli prior and normal likelihood gamma is conjugate prior to exponential and others conjugate prior for variance of a normal distribution conjugate prior for mean of a normal distribution marginal posterior for a normal with unknown mean and variance conjugate prior for unknown bound of a uniform