Contents

advertisement
This is page i
Printer: Opaque this
Contents
A Linear algebra basics
A.1 Basic operations . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Fundamental theorem of linear algebra . . . . . . . . . . . .
A.2.1 Part one . . . . . . . . . . . . . . . . . . . . . . . . .
A.2.2 Part two . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Nature of the solution . . . . . . . . . . . . . . . . . . . . .
A.3.1 Exactly-identi…ed . . . . . . . . . . . . . . . . . . . .
A.3.2 Under-identi…ed . . . . . . . . . . . . . . . . . . . .
A.3.3 Over-identi…ed . . . . . . . . . . . . . . . . . . . . .
A.4 Matrix decomposition and inverse operations . . . . . . . .
A.4.1 LU factorization . . . . . . . . . . . . . . . . . . . .
A.4.2 Cholesky decomposition . . . . . . . . . . . . . . . .
A.4.3 Singular value decomposition . . . . . . . . . . . . .
A.4.4 Spectral decomposition . . . . . . . . . . . . . . . .
A.4.5 quadratic forms, eigenvalues, and positive de…niteness
A.4.6 similar matrices and Jordan form . . . . . . . . . . .
A.5 Gram-Schmidt construction of an orthogonal matrix . . . .
A.5.1 QR decomposition . . . . . . . . . . . . . . . . . . .
A.5.2 Gram-Schmidt QR algorithm . . . . . . . . . . . . .
A.5.3 Accounting example . . . . . . . . . . . . . . . . . .
A.5.4 The Householder QR algorithm . . . . . . . . . . . .
A.5.5 Accounting example . . . . . . . . . . . . . . . . . .
A.6 Some determinant identities . . . . . . . . . . . . . . . . . .
A.6.1 Determinant of a square matrix . . . . . . . . . . . .
1
1
3
4
5
6
6
7
9
10
10
14
15
21
21
21
23
25
26
26
28
28
30
30
ii
Contents
A.6.2 Identities . . . . . . . . . . . . . . . . . . . . . . . .
31
B Iterated expectations
B.1 Decomposition of variance . . . . . . . . . . . . . . . . . . .
33
34
C Multivariate normal theory
C.1 Conditional distribution . . . . . . . . . . . . . . . . . . . .
C.2 Special case of precision . . . . . . . . . . . . . . . . . . . .
35
36
40
D Generalized least squares (GLS)
43
E Two stage least squares IV (2SLS-IV)
E.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . .
E.2 Special case . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
46
F Seemingly unrelated regression (SUR)
F.1 Classical . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.2 Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F.3 Bayesian treatment e¤ect application . . . . . . . . . . . . .
49
50
50
51
G Maximum likelihood estimation of discrete choice models 53
H Optimization
H.1 Linear programming . . . . . . . . . . . . . . . . . .
H.1.1 basic solutions or extreme points . . . . . . .
H.1.2 fundamental theorem of linear programming .
H.1.3 duality theorems . . . . . . . . . . . . . . . .
H.1.4 example . . . . . . . . . . . . . . . . . . . . .
H.1.5 complementary slackness . . . . . . . . . . .
H.2 Nonlinear programming . . . . . . . . . . . . . . . .
H.2.1 unconstrained . . . . . . . . . . . . . . . . . .
H.2.2 convexity and global minima . . . . . . . . .
H.2.3 example . . . . . . . . . . . . . . . . . . . . .
H.2.4 constrained — the Lagrangian . . . . . . . .
H.2.5 Karash-Kuhn-Tucker conditions . . . . . . . .
H.2.6 example . . . . . . . . . . . . . . . . . . . . .
I
Quantum information
I.1 Quantum information axioms . . . . . .
I.1.1 The superposition axiom . . . . .
I.1.2 The transformation axiom . . . .
I.1.3 The measurement axiom . . . . .
I.1.4 The combination axiom . . . . .
I.1.5 Summary of quantum "rules" . .
I.1.6 Observables and expected payo¤s
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
56
57
57
58
58
59
59
59
60
60
61
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
63
63
64
64
65
66
66
Contents
J Common distributions
iii
69
iv
Contents
This is page 1
Printer: Opaque this
Appendix A
Linear algebra basics
A.1 Basic operations
We frequently envision or frame problems as linear systems of equations.1
It is useful to write this compactly in matrix notation, say
Ay = x
where A is an m n (rows
columns) matrix (a rectangular array of
elements), y is an n-element vector, and x is an m-element vector. This
statement compares the result on the left with that on the right, elementby-element. The operation on the left is matrix multiplication or each element is recovered by a vector inner product of the corresponding row from
A with the vector y. That is, the …rst element of the product vector Ay is
the vector inner product of the …rst row A with y, the second element of
the product vector is the inner product of the second row A with y, and
so on. A vector inner product multiplies the same position element of the
leading row and trailing column and sums over the products. Of course, this
means that the operation is only well-de…ned if the number of columns in
the leading matrix, A, equals the number of rows of the trailing, y. Further,
the product matrix has the same number of rows as the leading matrix and
1 G. Strang, Linear Algebra and its Applications, Harcourt Brace Jovanovich College Publishers, or Introduction to Linear Algebra, Wellesley-Cambridge Press o¤ers a
mesmerizing discourse on linear algebra.
2
Appendix A. Linear algebra basics
columns of the trailing. For example, let
2
3
a11 a12 a13 a14
A = 4 a21 a22 a23 a24 5 ;
a31 a32 a33 a34
2
3
y1
6 y2 7
7
y=6
4 y3 5 ;
y4
and
then
2
2
3
x1
x = 4 x2 5
x3
3
a11 y1 + a12 y2 + a13 y3 + a14 y4
Ay = 4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5
a31 y1 + a32 y2 + a33 y3 + a34 y4
The system of equations also covers matrix addition and scalar multiplication by a matrix in the sense that we can rewrite the equations as
Ay
x=0
First, multiplication by a scalar or constant simply multiplies each element
of the matrix by the scalar. In this instance, we multiple the elements of x
by 1.
2
3 2
3
2 3
a11 y1 + a12 y2 + a13 y3 + a14 y4
x1
0
4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5 4 x2 5 = 4 0 5
a31 y1 + a32 y2 + a33 y3 + a34 y4
0
x3
2
3 2
3
2 3
a11 y1 + a12 y2 + a13 y3 + a14 y4
x1
0
4 a21 y1 + a22 y2 + a23 y3 + a24 y4 5 + 4 x2 5 = 4 0 5
a31 y1 + a32 y2 + a33 y3 + a34 y4
0
x3
Then, we add the m-element vector x to the
same position elements are summed.
2
a11 y1 + a12 y2 + a13 y3 + a14 y4
4 a21 y1 + a22 y2 + a23 y3 + a24 y4
a31 y1 + a32 y2 + a33 y3 + a34 y4
m-element vector Ay where
3 2 3
x1
0
x2 5 = 4 0 5
x3
0
Again, the operation is only well-de…ned for equal size matrices and, unlike
matrix multiplication, matrix addition always commutes. Of course, the
m-element vector (the additive identity) on the right has all zero elements.
By convention, vectors are represented in columns. So how do we represent an inner product of a vector with itself? We create a row vector by
A.2 Fundamental theorem of linear algebra
3
transposing the original. Transposition simply puts columns of the original
into same position rows of the transposed. For example, y T y represents the
vector inner product (the product is a scalar) of y with itself where the
superscript T represents transposition.
2
3
y1
6 y2 7
7
y1 y 2 y3 y 4 6
yT y =
4 y3 5
y4
= y12 + y22 + y32 + y42
Similarly, we might be interested in AT A.
2
3
a11 a21 a31 2
6 a12 a22 a32 7 a11 a12 a13
74
AT A = 6
4 a13 a23 a33 5 a21 a22 a23
a31 a32 a33
a14 a24 a34
!
2
a2 + a2
a a
+a a
a a
+a a
6
6
6
=6
6
4
11 13
21 23
+a31 a33
a11 a14 + a21 a24
+a31 a34
a12 a14 + a22 a24
+a32 a34
a11 a12 + a21 a22
+a31 a32
11 12
21 22
+a31 a32
!
2
a12 + a2
22
+a2
32
a11 a13 + a21 a23
+a31 a33
a12 a13 + a22 a23
+a32 a33
a12 a13 + a22 a23
+a32 a33
!
2
a2
13 + a23
+a2
33
a11 a14 + a21 a24
+a31 a34
a12 a14 + a22 a24
+a32 a34
a13 a14 + a23 a24
+a33 a34
11
21
+a2
31
3
a14
a24 5
a34
a13 a14 + a23 a24
+a33 a34
!
2
a2
14 + a24
+a2
34
This yields an n n symmetric product matrix. A matrix is symmetric if
the matrix equals its transpose, A = AT . Or, AAT which yields an m m
product matrix.
2
3
2
3 a11 a21 a31
a11 a12 a13 a14 6
a12 a22 a32 7
7
AAT = 4 a21 a22 a23 a24 5 6
4 a13 a23 a33 5
a31 a32 a33 a34
a14 a24 a34
2
3
2
2
6
6
6
=6
6
6
4
a11 + a12
+a213 + a214
a11 a21 + a12 a22
+a13 a23 + a14 a24
a11 a31 + a12 a22
+a13 a33 + a14 a34
a11 a21 + a12 a22
+a13 a23 + a14 a24
a221 + a222
+a223 + a224
a21 a31 + a22 a32
+a23 a33 + a24 a34
a11 a31 + a12 a22
+a13 a33 + a14 a34
a21 a31 + a22 a32
+a23 a33 + a24 a34
a231 + a232
+a233 + a234
A.2 Fundamental theorem of linear algebra
With these basic operations in hand, return to
Ay = x
When is there a unique solution, y? The answer lies in the fundamental
theorem of linear algebra. The theorem has two parts.
7
7
7
7
7
7
5
3
7
7
7
7
7
5
4
Appendix A. Linear algebra basics
A.2.1 Part one
First, the theorem says that for every matrix the number of linearly independent rows equals the number of linearly independent columns. Linearly
independent vectors are the set of vectors such that no one of them can be
duplicated by a linear combination of the other vectors in the set. A linear combination of vectors is the sum of scalar-vector products where each
vector may have a di¤erent scalar multiplier. For example, Ay is a linear
combination of the columns of A with the scalars in y. Therefore, if there exists some (n 1)-element vector, w, when multiplied by an (m (n 1))
submatrix of A, call it B, such that Bw produces the dropped column
from A then the dropped column is not linearly independent of the other
columns. To reiterate, if the matrix A has r linearly independent columns it
also has r linearly independent rows. r is referred to as the rank of the matrix and dimension of the rowspace and columnspace (the spaces spanned
by all possible linear combination of the rows and columns, respectively).
Further, r linearly independent rows of A form a basis for its rowspace and
r linearly independent columns of A form a basis for its columnspace.
Accounting example
Consider an incidence matrix describing the journal entry properties of
accounting in its columns (each column has one +1 and 1 in it with
the remaining elements equal to zero) and T accounts in its rows. The
rows capture changes in account balances when multiplied by a transaction
amounts vector y. By convention, we assign +1 for a debit entry and 1
for a credit entry. Suppose
2
1
6 1
A=6
4 0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
1
3
0
0 7
7
1 5
1
where the rows represent cash, noncash assets, liabilities, and owners’equity, for instance. Notice, 1 times the sum of any three rows produces the
remaining row. Since we cannot produce another row from the remaining
two rows, the number of linearly independent rows is 3. By the fundamental theorem, the number of linearly independent columns must also be 3.
Let’s check. Suppose the …rst three columns is a basis for the columnspace.
Column 4 is the negative of column 3, column 5 is the negative of the sum
of columns 1 and 3, and column 6 is the negative of the sum of columns 2
and 3. Can any of columns 1, 2 and 3 be produced as a linearly combination
of the remaining two columns? No, the zeroes in rows 2 through 4 rule it
out. For this matrix, we’ve con…rmed the number of linearly independent
rows and columns is the same.
A.2 Fundamental theorem of linear algebra
5
A.2.2 Part two
The second part of the fundamental theorem describes the orthogonal complements to the rowspace and columnspace. Two vectors are orthogonal if
they are perpendicular to one another. As their vector inner product is
proportional to the cosine of the angle between them, if their vector inner
product is zero they are orthogonal.2 n-space is spanned by the rowspace
(with dimension r) plus the n r dimension orthogonal complement, the
nullspace where
AN T = 0
N is an (n r) n matrix whose rows are orthogonal to the rows of A and
0 is an m (n r) matrix of zeroes.
Accounting example continued
For the A matrix above, a basis for the nullspace is
2
1
N =4 0
0
1
1
0
0 0 1
1 0 0
1 1 0
3
1
1 5
0
and
AN T
2
1
6 1
= 6
4 0
0
2
0
6 0
= 6
4 0
0
1
0
1
0
0
0
0
0
1
0
0
1
1
0
0
1
0
1
0
1
3
0
0 7
7
0 5
0
3
2
6
0
6
6
0 7
76
1 56
6
4
1
1 0
1 1
0 1
0 0
1 0
1 1
0
0
1
1
0
0
3
7
7
7
7
7
7
5
Similarly, m-space is spanned by the columnspace (with dimension r) plus
the m r dimension orthogonal complement, the left nullspace where
T
(LN ) A = 0
LN is an m (m r) matrix whose rows are orthogonal to the columns of
A and 0 is an (m r) n matrix of zeroes. The origin is the only point in
common to the four subspaces: rowspace, columnspace, nullspace, and left
2 The inner product of a vector with itself is the squared length (or squared norm) of
the vector.
6
Appendix A. Linear algebra basics
nullspace.
2
3
1
6 1 7
7
LN = 6
4 1 5
1
and
(LN ) A =
1
1
1
1
2
=
0
0
0
0
0
T
1
6 1
6
4 0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
1
0
1
3
0
0 7
7
1 5
1
A.3 Nature of the solution
A.3.1 Exactly-identi…ed
If r = m = n, then there is a unique solution, y, to By = x, and the
problem is said to be exactly-identi…ed.3 Since B is square and has a full
set of linearly independent rows and columns, the rows and columns of
B span r space (including x) and the two nullspaces have dimension zero.
Consequently, there exists a matrix B 1 , the inverse of B, when multiplied
by B produces the identity matrix, I. The identity matrix is a matrix when
multiplied (on the left or on the right) by any other vector or matrix leaves
that vector or matrix unchanged. The identity matrix is a square matrix
with ones along the principal diagonal and zeroes on the o¤-diagonals.
3
2
1 0
0
6
.. 7
6 0 1
. 7
7
I=6
7
6 .
.
..
..
4 ..
. 0 5
0
0 1
Hence,
B
1
By
Iy
y
1
= B
= B
= B
x
x
1
x
3
2
;
1
Suppose
B=
3 Both
y and x are r-element vectors.
1
4
A.3 Nature of the solution
7
and
x=
6
5
then
y
4
10
2
10
1
10
3
10
3
2
1
4
y1
y2
1
0
0
1
y1
y2
= B
"
=
=
y1
y2
=
"
"
1
x
4
10
2
10
24 5
10
12+15
10
#
19
10
3
10
1
10
3
10
#
#
6
5
A.3.2 Under-identi…ed
However, it is more common for r
m; n with one inequality strict. In
this case, spanning m-space draws upon both the columnspace and left
nullspace and spanning n-space draws from both the rowspace and the
nullspace. If the dimension of the nullspace is greater than zero, then it is
likely that there are many solutions, y, that satisfy Ay = x. On the other
hand, if the dimension of the left nullspace is positive and the dimension
of the nullspace is zero, then typically there is no exact solution, y, for
Ay = x. When r < n, the problem is said to be under-identi…ed (there are
more unknown parameters than equations) and a complete set of solutions
can be described by the solution that lies entirely in the rows of A (this is
often called the row component as it is a linear combination of the rows)
plus arbitrary weights on the nullspace of A. The row component, y RS(A) ,
can be found by projecting any consistent solution, y p , onto a basis for the
rows (any linearly independent set of r rows) of A. Let Ar be a submatrix
derived from A with r linearly independent rows. Then,
y RS(A)
T
=
(Ar )
=
(PAr ) y p
T
Ar (Ar )
1
Ar y p
and
y = y RS(A) + N T k
T
T
1
where PAr is the projection matrix, (Ar ) Ar (Ar )
Ar , onto the rows
of A and k is any n-element vector of arbitrary weights.
8
Appendix A. Linear algebra basics
Return to our accounting example above. Suppose the changes in account
balances are
2
3
2
6 1 7
7
x=6
4 1 5
2
Then, a particular solution can be found by setting, for example, the last
three elements of y equal to zero and solving for the remaining elements.
2
1
1
2
0
0
0
6
6
6
yp = 6
6
6
4
3
7
7
7
7
7
7
5
so that
2
1
6 1
6
4 0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
1
2
3
6
0
6
7
0 76
6
1 56
6
4
1
Ay p = x
3
1
3
2
2
1 7
7
7
6
2 7
7 = 6 1 7
7
4
1 5
0 7
5
2
0
0
Let Ar be the …rst three rows.
2
1
A =4 1
0
r
2
PAr
6
6
1 6
6
=
12 6
6
4
7
1
2
2
5
1
1 1
0 0
1 0
1
7
2
2
1
5
1
0
0
2
2
4
4
2
2
3
0
0 5
1
0
1
0
2
2
4
4
2
2
5
1
2
2
7
1
1
5
2
2
1
7
3
7
7
7
7
7
7
5
A.3 Nature of the solution
9
and
y RS(A)
(PAr ) y p
2
7
6 1
6
1 6
6 2
=
12 6
6 2
4 5
1
3
2
1
6 5 7
7
6
7
16
6 4 7
=
6
66 4 7
7
4 5 5
1
=
1
7
2
2
1
5
2
2
4
4
2
2
2
2
4
4
2
2
5
1
2
2
7
1
1
5
2
2
1
7
32
76
76
76
76
76
76
54
1
1
2
0
0
0
3
7
7
7
7
7
7
5
The complete solution, with arbitrary weights k, is
y
y
2
6
6
6
6
6
6
4
y1
y2
y3
y4
y5
y6
3
7
7
7
7
7
7
5
= y RS(A) + N T k
3 2
2
1
6 5 7 6
7 6
6
7 6
16
6 4 7+6
=
6
6
66 4 7
7 6
4 5 5 4
1
3 2
2
1
6 5 7 6
7 6
6
7 6
16
6 4 7+6
=
6
6
66 4 7
7 6
4 5 5 4
1
1
1
0
0
1
1
3
0 0
2
3
1 0 7
7 k1
7
1 1 74
k2 5
0 1 7
7 k3
0 0 5
1 0
3
k1
k1 + k2
k2 + k3
k3
k1
k1 + k2
7
7
7
7
7
7
5
A.3.3 Over-identi…ed
In the case where there is no exact solution, m > r = n, the vector that
lies entirely in the columns of A which is nearest x is frequently identi…ed
as the best approximation. This case is said to be over-identi…ed (there are
more equations than unknown parameters) and this best approximation
is the column component, y CS(A) , and is found via projecting x onto the
columns of A. A common variation on this theme is described by
Y =X
where Y is an n-element vector and X is an n p matrix. Typically, no
exact solution for (a p-element vector) exists, p = r (X is composed of
10
Appendix A. Linear algebra basics
linearly independent columns), and
b=
CS(A)
= XT X
1
XT Y
is known as the ordinary least squares (OLS ) estimator of and the estimated conditional expectation function is the projection of Y into the
columns of X
1
Xb = X X T X
X T Y = PX Y
T
For example, let X = (Ar ) and Y = y P . then
2
3
4
1
b= 4 5 5
6
1
and Xb = PX Y = y RS(A) .
2
6
6
16
Xb = PX Y = 6
66
6
4
1
5
4
4
5
1
3
7
7
7
7
7
7
5
A.4 Matrix decomposition and inverse operations
Inverse operations are inherently related to the fundamental theorem and
matrix decomposition. There are a number of important decompositions,
we’ll focus on four: LU factorization, Cholesky decomposition, singular
value decomposition, and spectral decomposition.
A.4.1 LU factorization
Gaussian elimination is the key to solving systems of linear equations and
gives us LU decomposition.
Nonsingular case
Any square, nonsingular matrix A (has linearly independent rows and
columns) can be written as the product of a lower triangular matrix, L,
times an upper triangular matrix, U .4
A = LU
4 The
general case, A is a m
n matrix, is discussed below.
A.4 Matrix decomposition and inverse operations
11
where L is lower triangular meaning that it has all zero elements above
the main diagonal and U is upper triangular meaning that it has all zero
elements below the main diagonal. Gaussian elimination says we can write
any system of linear equations in triangular form so that by backward
recursion we solve a series of one equation, one variable problems. This is
accomplished by row operations: row eliminations and row exchanges. Row
eliminations involve a series of operations where a scalar multiple of one
row is added to a target row so that a revised target row is produced until
a triangular matrix, L or U , is generated. As the same operation is applied
to both sides (the same row(s) of A and x) equality is maintained. Row
exchanges simply revise the order of both sides (rows of A and elements
of x) to preserve the equality. Of course, the order in which equations are
written is ‡exible.
In principle then, Gaussian elimination on
Ay = x
involves, for instance, multiplication of both sides by the inverse of L,
provided the inverse exists (m = r),
L
1
Ay
1
L LU y
Uy
1
= L
= L
= L
x
x
1
x
1
As Gaussian elimination is straightforward, we have a simple approach for
…nding whether the inverse of the lower triangular matrix exists and, if so,
its elements. Of course, we can identify L in similar fashion
L = AU 1
= LU U
1
T
Let A = Ar (Ar ) the 3 3 full rank matrix from the accounting example
above.
2
3
4
1
1
0 5
A=4 1 2
1 0
2
We …nd U by Gaussian elimination. Multiply row 1 by 1=4 and add to rows
2 and 3 to revise rows 2 and 3 as follows.
2
3
4
1
1
4 0 7=4
1=4 5
0
1=4 7=4
Now, multiply row 2 by 1/7 and add this result to row 3 to identify U .
2
3
4
1
1
1=4 5
U = 4 0 7=4
0
0
12=7
12
Appendix A. Linear algebra basics
Notice we have constructed L 1 in the process.
2
32
3
1
0
0
1
0 0
1
0 5 4 1=4 1 0 5
L 1 = 4 0
0 1=7 1
1=4 0 1
2
3
1
0
0
1
0 5
= 4 1=4
2=7 1=7 1
so that
2
1
4 1=4
2=7
0
1
1=7
L
32
0
4
0 54 1
1
1
1
A
3 = U
2
1
4
1
0 5 = 4 0 7=4
2
0
0
1
2
0
Also, we have L in hand.
L= L
1
3
1
1=4 5
12=7
1
and
2
1
4 1=4
2=7
0
1
1=7
32
0
`11
0 5 4 `21
1
`31
0
`22
`32
L 1L
3 = I2
0
1
0 5 = 4 0
`33
0
0
1
0
3
0
0 5
1
From the …rst row-…rst column, `11 = 1. From the second row-…rst column,
1=4`11 +1`21 = 0, or `21 = 1=4. From the third row-…rst column, 2=7`11 +
1=7`21 + 1`31 = 0, or `31 = 2=7 + 1=28 = 1=4. From the second rowsecond column, `22 = 1. From the third row-second column, 1=7`22 +1`32 =
0, or `32 = 1=7. And, from the third row-third column, `33 = 1. Hence,
3
2
1
0
0
1
0 5
L = 4 1=4
1=4
1=7 1
and
2
4
For
1
1=4
1=4
32
0
0
4
1
1
0 5 4 0 7=4
1=7 1
0
0
2
LU
3 = A
2
1
4
1=4 5 = 4 1
12=7
1
3
1
x=4 3 5
5
1
2
0
3
1
0 5
2
A.4 Matrix decomposition and inverse operations
13
the solution to Ay = x is
2
4
4 0
0
1
7=4
0
Ay
LU y
Uy
32
3
1
y1
1=4 5 4 y2 5
12=7
y3
= x
= x
= L
2
= 4
1
x
5
3 2
3
1
1
5 = 4 11=4 5
3 1=4
1=4 + 11=28
36=7
Backward recursive substitution solve for y. From row three, 12=7y3 =
36=7 or y3 = 7=12 36=7 = 3. From row two, 7=4y2 1=4y3 = 11=4, or
y2 = 4=7 (11=4 + 3=4) = 2. And, from row one, 4y1 y2 y3 = 1, or
y1 = 1=4 ( 1 + 2 + 3) = 1. Hence,
2
3
1
y=4 2 5
3
General case
If the inverse of A doesn’t exist (the matrix is singular), we …nd some
equations after elimination are 0 = 0, and possibly, some elements of y are
not uniquely determinable as discussed above for the under-identi…ed case.
For an m n matrix A, the general form of LU factorization may involve
row exchanges via a permutation matrix, P .
P A = LU
where L is lower triangular with ones on the diagonal and U is an m n
upper echelon matrix with the pivots along the main diagonal.
LU decomposition can also be written as LDU factorization where, as
before, L and U are lower and upper triangular matrices but now have ones
along their diagonals and D is a diagonal matrix with the pivots of A along
its diagonal.
Returning to the accounting example, we utilize LU factorization to solve
for y, a set of transactions amounts that are consistent with the …nancial
statements.
2
1
6 1
6
4 0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
1
2
3
6
0
6
7
0 76
6
1 56
6
4
1
Ay = x
3
y1
2
3
y2 7
2
7
6
7
y3 7
7 = 6 1 7
7
4
y4 7
1 5
5
y5
2
y6
14
Appendix A. Linear algebra basics
For this A matrix, P = I4 (no row exchanges are called for), and row one
is added to row two, the revised row two is added to row three, and the
revised row three is added to row four, which gives
2
3
2 3
2
3 y1
6 y2 7
2
1
1 1
1 0
0
6
7
6 0
7 6 y3 7 6 3 7
1
1
1
1
0
6
76
7=6 7
7 4 2 5
4 0
0 1
1
1
1 56
6 y4 7
4
0
y5 5
0
0 0 0
0
0
y6
The last row conveys no information and the third row indicates we have
three free variables. Recall, for our solution, y p above, we set y4 = y5 = y6 =
0 and solved. From row three, y3 = 2. From row two, y2 = (3 2) = 1.
And, from row one, y1 = (2 1 2) = 1. Hence,
3
2
1
6 1 7
7
6
6 2 7
p
7
6
y =6
7
6 0 7
4 0 5
0
A.4.2 Cholesky decomposition
If the matrix A is symmetric, positive de…nite5 as well as nonsingular, then
we have A = LDLT as U = LT . In this symmetric case, we identify another useful factorization, Cholesky decomposition. Cholesky decomposition
writes
T
A=
1
1
where = LD 2 and D 2 has the square root of the pivots on the diagonal.
Since A is positive de…nite, all of its pivots are positive and their square
root is real so, in turn, is real. Of course, we now have
1
A =
=
1
T
T
T
T
or
A
5A
T
1
=
=
1
matrix, A, is positive de…nite if its quadratic form is strictly positive
xT Ax > 0
for all nonzero x.
A.4 Matrix decomposition and inverse operations
15
T
For the example above, A = Ar (Ar ) , A is symmetric, positive de…nite
and we found
2
4
4 1
1
1
2
0
A
3 = LU
2
1
1
0 5 = 4 1=4
2
1=4
32
0
0
4
1
1
0 5 4 0 7=4
1=7 1
0
0
Factoring the pivots from U gives D and LDLT .
T
A = LDL
2
1
= 4 1=4
1=4
32
0
0
4
1
0 54 0
1=7 1
0
0
7=4
0
And, the Cholesky decomposition is
32
0
1
0 54 0
12=7
0
3
1
1=4 5
12=7
1=4
1
0
3
1=4
1=7 5
1
1
= LD 2
2
= 4
2
6
= 6
4
32
3
2 p0
0
0
0
5
1
0 54 0
7=4 p 0
1=7 1
0
0
12=7
3
0
0
p
7
7
0 7
2
q 5
1
p
2 37
2 7
1
1=4
1=4
2
1=2
1=2
so that
2
4
4 1
1
1
2
0
A =
2
3
1
6
0 5 = 6
4
2
T
2
1=2
1=2
0
p
7
2
1
p
2 7
0
32
2
76
0 76 0
q 54
2 37
0
1=2
1=2
p
7
2
0
1
p
2 7
2
q
3
7
3
7
7
5
A.4.3 Singular value decomposition
Now, we introduce a matrix factorization that exists for every matrix. Singular value decomposition says every m n matrix, A, can be written as the
product of a m m orthogonal matrix, U , multiplied by a diagonal m n
matrix, , and …nally multiplied by the transpose of a n n orthogonal
matrix, V .6 U is composed of the eigenvectors of AAT , V is composed of
6 An orthogonal (or unitary ) matrix is comprised of orthonormal vectors. That is,
mutually orthogonal, unit length vectors so that U 1 = U T and U U T = U T U = I.
16
Appendix A. Linear algebra basics
the eigenvectors of AT A, and
contains the singular values (the square
root of the eigenvalues of AAT or AT A) along the diagonal.
A=U VT
Further, singular value decomposition allows us to de…ne a general inverse
or pseudo-inverse, A+ .
A+ = V + U T
where + is an n m diagonal matrix with nonzero elements equal to the
reciprocal of those for . This implies
AA+ A = A
A+ AA+ = A+
A+ A
T
= A+ A
AA+
T
= AA+
and
Also, for the system of equations
Ay = x
the least squares solution is
y CS(A) = A+ x
and AA+ is always the projection onto the columns of A. Hence,
AA+ = PA = A AT A
1
AT
if A has linearly independent columns. Or,
AT AT
+
=
A+ A
=
V
+
T
UT U V T
T
T
= V T UT U + V T
= A+ A
1
= PAT = AT AAT
A
if A has linearly independent rows (if AT has linearly independent columns).
For the accounting example, recall the row component is the consistent
solution to Ay = x that is only a linearly combination of the rows of A;
A.4 Matrix decomposition and inverse operations
17
that is, it is orthogonal to the nullspace. Utilizing the pseudo-inverse we
have
y RS(A)
+
= AT AT
= PAT y p
T
1
T
(Ar )
=
yp
Ar (Ar )
Ar y p
= A+ Ay p
or simply, since Ay p = x
y RS(A)
= A+ x
2
=
6
6
1 6
6
24 6
6
4
2
=
6
6
16
6
66
6
4
5
5
4
4
1
1
3
1
5 7
7
4 7
7
4 7
7
5 5
1
9
3
0
0
9
3
3
9
0
0
3
9
1
1
4
4
5
5
3
2
3
7
2
7
76 1 7
76
7
74 1 5
7
5
2
The beauty of singular value decomposition is that any m
A, can be factored as
AV = U
n matrix,
since
AV V T = A = U V T
where U and V are m m and n n orthogonal matrices (of eigenvectors),
respectively, and is a m n matrix with singular values along its main
diagonal.
Eigenvalues are characteristic values or singular values of a square matrix
and eigenvectors are characteristic vectors or singular vectors of the matrix
such that
AAT u = u
or we can work with
AT Av = v
where u is an m-element unitary (uT u = 1) eigenvector (component of Q1 ),
v is an n-element unitary (v T v = 1) eigenvector (component of Q2 ), and
is an eigenvalue of AAT or AT A. We can write AAT u = u as
T
AA
AAT u
I u
=
=
Iu
0
18
Appendix A. Linear algebra basics
or write AT Av = v as
AT Av
A A
I v
T
=
=
Iv
0
then solve for unitary vectors u, v, and and roots . For instance, once we
have i and ui in hand. We …nd vi by
uTi A =
i vi
such that vi is unit length, viT vi = 1.
The sum of the eigenvalues equals the trace of the matrix (sum of the
main diagonal elements) and the product of the eigenvalues equals the
determinant of the matrix. A singular matrix has some zero eigenvalues
and pivots (the det (A) = [product of the pivots]), hence the determinant
of a singular matrix, det (A), is zero.7 The eigenvalues can be found by
solving det AAT
I = 0. Since this is an m order polynomial, there are
m eigenvalues associated with an m m matrix.
Accounting example
Return to the accounting example for an illustration. The singular value
decomposition (SVD) of A proceeds as follows. We’ll work with the square,
symmetric matrix AAT . Notice, by SVD,
AAT
= U VT U VT
= U V T V T UT
T T
= U
U
7 The determinant is a value associated with a square matrix with many (some useful)
properties. For instance, the determinant provides a test of invertibility (linear independence). If det (A) = 0, then the matrix is singular and the inverse doesn’t exist; otherwise
det (A) 6= 0, the matrix is nonsingular and the inverse exists. The determinant is the
volume of a parallelpiped in n-dimensions where the edges come from the rows of A.
The determinant of a triangular matrix is the product of the main diagonal elements.
Determinants are unchanged by row eliminations and their sign is changed by row exchanges. The determinant of the transpose of a matrix equals the determinant of the
matrix, det (A) = det AT . The determinant of the product of matrices is the product
of their determinants, det (AB) = det (A) det (B). Some useful determinant identities
are reported in section …ve of the appendix.
A.4 Matrix decomposition and inverse operations
19
so that the eigenvalues of AAT are the squared singular values of A,
The eigenvalues are found by solving for the roots of8
2
6
det 6
4
det AAT
4
1
1
1
2
1
0
2
0
1
2
1
1
2
1
48 + 44
2
Im
=
3
4
12
3
+
7
7 =
5
4
=
T
.
0
0
0
Immediately, we see that one of the roots is zero,9 and
48 + 44 2
(
2) (
12 3 + 4 = 0
4) (
6) = 0
or
= f6; 4; 2; 0g
for AAT .10 The eigenvectors for AAT are found by solving (employ Gaussian
elimination and back substitution)
AAT
i I4
ui = 0
Since there is freedom in the solution, we can make the vectors orthonormal
(see Gram-Schmidt discussion below). For instance, AAT h6I4 u1 = 0
p1
0
a 0 0 a , so we make a = p12 and uT1 =
leads to uT1 =
2
Now, the complementary right hand side eigenvector, v1 , is found by
uT1 A =
v1
=
p
1 v1
2
6
6
6
6
6
1
p uT1 A = 6
6
6
6
6
6
4
1
p
2 3
1
p
2 3
p1
3
p1
3
1
p
2 3
1
p
2 3
3
7
7
7
7
7
7
7
7
7
7
5
8 Below we show how to …nd the determinant of a square matrix and illustrate with
this example.
9 For det AT A
I6 = 0, we have 48 3 + 44 4 12 5 + 6 = 0. Hence, there are
at least three zero roots. Otherwise, the roots are the same as for AAT .
1 0 Clearly,
= f6; 4; 2; 0; 0; 0g for AT A.
0
p1
2
i
.
20
Appendix A. Linear algebra basics
Repeating these steps for the remaining eigenvalues (in descending order;
remember its important to match eigenvectors with eigenvalues) leads to
2
p1
2
6
6
U =6
6
4
and
2
0
0
p1
2
1
p
1
2
1
2
2 3
1
p
2 3
p1
3
6
6
6
6
6
6
V =6
6
6
6
6
4
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
0
p1
2
p1
2
0
p1
3
p1
3
p
p3
2 2
1
p
2 6
p1
6
0
0
0
0
p1
3
0
0
0
1
p
2 3
1
p
2 3
1
2
1
2
1
2
1
2
0
3
7
7
7
7
5
1
p
2 6
1
p
2 6
1
6
q
p
p3
2 2
1
p
2 6
p1
3
2
3
1
p
2 6
1
p
2 6
3
7
7
7
7
7
7
7
7
7
7
7
5
where U U T = U T U = I4 and V V T = V T V = I6 .11 Remarkably,
A = U VT
2 p1
6
6
= 6
6
4
2
1
2
1
2
1
2
1
2
2
0
0
p1
2
2
6
6
6
6
6
6
6
6
6
6
6
4
1
p
2 3
1
p
2 3
p1
3
0
p1
2
p1
2
0
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
3
2 p
6
7
76 0
76
74 0
5
0
p1
3
p1
3
p
p3
2 2
1
p
2 6
p1
6
0
0
0
0
p1
3
0
0
0
1
p
2 3
1
p
2 3
1
2
1
2
1
2
1
2
0
1
6 1
= 6
4 0
0
1
0
1
0
1
0
0
1
1
0
0
1
p1
3
0
1
0
1
p
p3
2 2
1
p
2 6
3
0
0 7
7
1 5
1
3
0 0 0 0 0
2 p0 0 0 0 7
7
0
2 0 0 0 5
0 0 0 0 0
1
p
2 6
1
p
2 6
1
6
q
2
3
1
p
2 6
1
p
2 6
3T
7
7
7
7
7
7
7
7
7
7
7
5
1 1 There are many choices for the eigenvectors associated with zero eigenvalues. We
select them so that they orthonormal. As with the other eigenvectors, this is not unique.
A.4 Matrix decomposition and inverse operations
21
where is m n (4 6) with the square root of the eigenvalues (in descending order) on the main diagonal.
A.4.4 Spectral decomposition
When A is a square, symmetric matrix, singular value decomposition can
be expressed as spectral decomposition.
A = U UT
where U is an orthogonal matrix. Notice, the matrix on the right is the
transpose of the matrix on the left. This follows as AAT = AT A when
A = AT . We’ve illustrated this above if when we decomposed AAT , a
square symmetric matrix.
AAT
= U UT
2 p1
6
6
= 6
4
2
6
= 6
4
2
0
0
1
p
2
4
1
1
2
1
2
1
2
1
2
1
2
1
2
0
1
0
p1
2
1
p
2
0
1
0
2
1
1
2
1
2
1
2
1
2
3
2
1 7
7
1 5
4
3
2
6
7
76 0
74 0
5
0
0
4
0
0
0
0
2
0
3
2
0
6
0 76
6
5
0 4
0
1
p
2
0
0
p1
2
1
2
1
2
1
2
1
2
0
1
p
2
1
p
2
0
A.4.5 quadratic forms, eigenvalues, and positive de…niteness
A symmetric matrix A is positive de…nite if the quadratic form xT Ax is positive for every nonzero x. Positive semi-de…niteness follows if the quadratic
form is non-negative, xT Ax
0 for every nonzero x. Negative de…nite
and negative semi-de…nite symmetric matrices follow in analogous fashion
where the quadratic form is negative or non-positive, respectively. A positive (semi-) de…nite matrix has positive (non-negative) eigenvalues. This
result follows immediately from spectral decomposition. Let y = Qx (y is
arbitrary since x is) and write the spectral decomposition of A as QT Q
where Q is an orthogonal matrix and is a diagonal matrix composed of
the eigenvalues of A. Then the quadratic form xT Ax > 0 can be written as
xT QT Qx > 0 or y T y > 0. Clearly, this is only true if , the eigenvalues,
are all positive.
A.4.6 similar matrices and Jordan form
Now, we provide some support for properties associated with eigenvalues.
Namely, for any square matrix the sum of the eigenvalues equals the trace
1
2
1
2
1
2
1
2
3T
7
7
7
5
22
Appendix A. Linear algebra basics
of the matrix and the product of the eigenvalues equals the determinant of
the matrix. To aid with this discussion we …rst develop the idea of similar
matrices and the Jordan form of a matrix.
Two matrices, A and B, are similar if there exists M and M 1 such that
B = M 1 AM . Similar matrices have the same eigenvalues as seen from
Ax = x where x is an eigenvector of A associated with .
AM M
Ax =
1
x =
x
x
Since M B = AM , we have
M BM 1 x =
M 1 M BM 1 x =
B M 1x =
x
M 1x
M 1x
Hence, A and B have the same eigenvalues where x is the eigenvector of A
and M 1 x is the eigenvector of B.
From here we can see A and B have the same trace and determinant.
First, we’ll demonstrate,
Xvia example, the trace of a matrix equals the
sum of its eigenvalues,
i = tr (A) for any square matrix A. Consider
a11 a12
A =
where tr (A) = a11 + a22 . The eigenvalues of A are
a21 a22
determined from solving det (A
I) = 0.
2
(a11
) (a22
)
(a11 + a22 ) + a11 a22
The two roots or eigenvalues are
q
2
(a11 + a22 )
a11 + a22
=
2
a12 a21
a12 a21
4 (a11 a22
= 0
= 0
a12 a21 )
and their sum is 1 + 2 =
a11 + a22 = tr (A). The idea extends to any
X
square matrix A such that
I) for
i = tr (A). This follows as det (A
any n n matrix A has
coe¢ cient on the n 1 term equal to minus the
X
coe¢ cient on n times
2 example above.12
i , as in the 2
We’ll demonstrate the determinant result in two parts: one for diagonalizable matrices and one for non-diagonalizable matrices using their Jordan
form. Any diagonalizable matrix can be written as A = S S 1 . The deter1
minant of A is then jAj = S S 1 = jSj j j S 1 = j j since S 1 = jSj
Y
which follows from SS 1 = S 1 jSj = jIj = 1. Now, we have j j =
i.
1 2 For n even the coe¢ cient on n is 1 and for n odd the coe¢ cient on
the coe¢ cient on n 1 of opposite sign.
n
is
1 with
A.5 Gram-Schmidt construction of an orthogonal matrix
23
The second part follows from similar matrices and the Jordan form.
When a matrix is not diagonalizable because it doesn’t have a complete
set of linearly independent eigenvectors, we say it is nearly diagonalizable
when it’s in Jordan form. Jordan form means the matrix is nearly diagonal
except for perhaps ones immediately above the diagonal.
1 0
For example, the identity matrix,
, is in Jordan form as well as
0 1
1 1
being diagonalizable while
is in Jordan form but not diagonaliz0 1
able. Even though both matrices have the same eigenvalues they are not
1 1
similar matrices as there exists no M such that
equals M 1 IM .
0 1
Nevertheless, the Jordan form is the characteristic form for a family
of similar matrices as there exists P such that P 1 AP = J where J
1 4
has
is the Jordan form for the family. For instance, A = 13
1 5
1 1
2 1
Jordan form
with P =
. Consider another example,
0 1
1 2
13 6
3 1
3 1
A = 15
has Jordan form
with P =
. Since
1 12
0 2
1 2
they are similar matrices, A and J have the same eigenvalues. Plus, as in
the above examples, the eigenvalues lie on the diagonal of J Y
in general.
The determinant of A = P JP 1 = jP j jJj P 1 = jJj =
i . This
completes the argument.
To summarize, for any n n matrix A:
(1) jAj =
Y
and
(2) tr (A) =
X
i;
i:
A.5 Gram-Schmidt construction of an orthogonal
matrix
Before we put this section to bed, we’ll undertake one more task. Construction of an orthogonal matrix (that is, a matrix with orthogonal, unit length
vectors so that QQT = QT Q = I). Suppose we have a square, symmetric
matrix
2
3
2 1 1
A=4 1 3 2 5
1 2 3
24
Appendix A. Linear algebra basics
1
2
with eigenvalues
columns)
v1
v2
v3
7+
p
p
17 ; 12 7
2
1
2
=4
3+
1
1
p
17 ; 1
1
2
17
and eigenvectors (in the
p
3
17
1
1
3
0
1 5
1
The …rst two columns are not orthogonal to one another and none of the
columns are unit length.
First, the Gram-Schmidt procedure normalizes the length of the …rst
vector
v1
q1 = p
v1T v1
p
3
2
17
p 3+ p
6 q 34 6 17 7
6
7
2p
7
= 6
6
17 3 17 7
4 q
5
2p
17 3 17
2
3
0:369
4 0:657 5
0:657
Then, …nds the residuals (null component) of the second vector projected
onto q1 .13
r2
= 4
Now, normalize r2
q1T v2
3
p
3
17
5
1
1
= v2 1
2 1
2
q2 = p
r2
r2T r2
so that q1 and q2 are orthonormal vectors. Let
Q12
=
2
q1
q2
p
3+
17
p
p
34 6 17
6 q
6
= 6
6
17
4 q
2
2p
17 3 17
0:369
4 0:657
0:657
1 3 Since
the v1T v1
1
2p
3 17
p
3
p3+ 17
p
34+6 17 7
q
7
2p
7
17+3 17 7
5
q
2p
17+3 17
3
0:929
0:261 5
0:261
term is the identity, we omit it in the development.
A.5 Gram-Schmidt construction of an orthogonal matrix
25
Finally, compute the residuals of v3 projected onto Q12
r3
= v3
2
= 4
and normalize its length.14
Q12 QT12 v3
3
0
1 5
1
r3
q3 = p
r3T r3
Then,
Q =
2
q1
q2 q3
p
3+
17
p
p
34 6 17
6 q
6
= 6
6
17
4 q
2
2p
3 17
2p
17 3 17
0:369
4 0:657
0:657
p
p3+ 17
p
34+6 17
q
q
0:929
0:261
0:261
2p
17+3 17
2p
17+3 17
3
0
0:707 5
0:707
0
p1
2
p1
2
3
7
7
7
7
5
and QQT = QT Q = I. If there are more vectors then we continue along the
same lines with the fourth vector made orthogonal to the …rst three vectors
(by …nding its residual from the projection onto the …rst three columns)
and then normalized to unit length, and so on.
A.5.1 QR decomposition
QR is another important (especially for computation) matrix decomposition. QR combines Gram-Schmidt orthogonalization and Gaussian elimination to factor an m n matrix A with linearly independent columns
into a matrix composed of orthonormal columns, Q such that QT Q = I,
multiplied by a square, invertible upper triangular matrix R. This provides
distinct advantages when dealing with projections into the column space of
A. Recall, this problem takes the form Ay = b where the objective is to …nd
y that minimizes the distance to b. Since A = QR, we have QRy = b and
R 1 QT QRy = y = R 1 QT b. Next, we summarize the steps for two QR
algorithms: the Gram-Schmidt approach and the Householder approach.
1
1 4 Again, QT Q
= I so it is omitted in the expression. In this example, v3 is
12 12
orthogonal to Q12 (as well as v1 and v2 ) so it is unaltered.
26
Appendix A. Linear algebra basics
A.5.2 Gram-Schmidt QR algorithm
The Gram-Schmidt algorithm proceeds as described above to form Q. Let
a denote the …rst column of A and construct a1 = p aT to normalize the
a a
…rst column. Construct the projection matrix for this column, P1 = aT1 a1
(since a1 is normalized the inverse of aT1 a1 is unity so it’s dropped from the
expression). Now, repeat with the second column. Let a denote the second
column of A and make it orthogonal to a1 by rede…ning it as a = (I P1 ) a.
Then normalize via a2 = p aT . Construct the projection matrix for this
a a
column, P2 = aT2 a2 . The third column is made orthonormal in similar
fashion. Let a denote the third column of A and make it orthogonal to
a1 and a2 by rede…ning it as a = (I P1 P2 ) a. Then normalize via
a3 = p aT . Construct the projection matrix for this column, P3 = aT3 a3 .
a a
Repeat this for all n columns of A. Q is constructed by combining the
an such that QT Q = I. R is constructed
columns Q = a1 a2
T
as R = Q A. To see that this is upper triangular let the columns of A be
denoted A1 , A2 , . . . ,An . Then,
2 T
3
a1 A1 aT1 A2
aT1 An
6 aT2 A1 aT2 A2
aT2 An 7
6
7
QT A = 6
7
..
..
..
.
.
4
5
.
.
.
.
T
T
T
an An
an A1 an A2
The terms below the main diagonal are zero since aj for j = 2; : : : ; n are
constructed to be orthogonal to A1 , aj for j = 3; : : : ; n are constructed to
be orthogonal to A2 = a2 + P1 A2 , and so on.
Notice, how straightforward it is to solve Ay = b for y.
Ay
QRy
1 T
R Q QRy
= b
= b
= y=R
1
QT b
A.5.3 Accounting example
Return to the 4 accounts by 6 journal entries A matrix. This matrix clearly
does not have linearly independent columns (or for that matter rows) but
we’ll drop a redundant row (the last row) and denote the resultant matrix
A0 . Now, we’ll …nd the QR decomposition of the 6 3 AT0 , AT0 = QR by
the Gram-Schmidt process.
2
3
1 1
0
6 1 0
1 7
6
7
6
1
0
0 7
7
AT0 = 6
6 1 0
0 7
6
7
4 0
1 0 5
0
0
1
A.5 Gram-Schmidt construction of an orthogonal matrix
2
6
6
16
a1 = 6
26
6
4
2
and
6
6
16
a2 = 6
26
6
4
1
1
1
1
0
0
3
2
7
6
7
6
7
6
7 ; P1 = 1 6
7
46
7
6
5
4
0:567
0:189
0:189
0:189
0:756
0
3
1
1
1
1
0
0
2
7
6
7
6
7
6
7 ; P2 = 1 6
7
46
7
6
5
4
2
so that
6
6
16
a3 = 6
26
6
4
2
6
6
6
Q=6
6
6
4
and
0:5
0:5
0:5
0:5
0
0
1
1
1
1
0
0
1
1
1
1
0
0
1
1
1
1
0
0
0:109
0:546
0:218
0:218
0:109
0:764
0:567
0:189
0:189
0:189
0:756
0
2
2
R = QT A = 4 0
0
0:5
1:323
0
The projection solution to Ay = x or A0 y
2
3
2
x0 = 4 1 5 is yrow = Q RT
1
1
1
1
1
0
0
2
6
6
6
1
x0 = 61 6
6
6
4
1 0
1 0
1 0
1 0
0 0
0 0
0
0
0
0
0
0
27
3
7
7
7
7;
7
7
5
3
1 0 0
1 0 0 7
7
1 0 0 7
7;
1 0 0 7
7
0 0 0 5
0 0 0
1
1
1
1
0
0
3
7
7
7
7
7
7
5
0:109
0:546
0:218
0:218
0:109
0:764
3
7
7
7
7
7
7
5
3
0:5
0:189 5
1:309
2
3
2
6 1 7
7
= x0 where x = 6
4 1 5 and
2
3
1
5 7
7
4 7
7.
4 7
7
5 5
1
28
Appendix A. Linear algebra basics
A.5.4 The Householder QR algorithm
The Householder algorithm is not as intuitive as the Gram-Schmidt algorithm but is computationally more stable. Let a denote the …rst column
of A and
except the …rst element is one. De…ne
p z be a vector of zeros
vv T
T
v = a + a az and H1 = I 2 vT v . Then, H1 A puts the …rst column of A
in upper triangular form. Now, repeat the process where a is now de…ned
to be the second column of H1 A whose …rst element is set to zero and z
is de…ned to be a vector of zeros except the second element is one. Utilize
these components to create v in the same form as before and to construct
H2 in the same form as H1 . Then, H2 H1 A puts the …rst two columns of A
in upper triangular form. Next, we work with the third column of H2 H1 A
where the …rst two elements of a are set to zero and repeat for all n columns.
When complete, R is constructed from the …rst n rows of Hn
H2 H1 A.
and QT is constructed from the …rst n rows of Hn
H2 H 1 .
A.5.5 Accounting example
Again, return to the 4 accounts by 6 journal entries A matrix and work
with A0 . Now, we’ll …nd the QR decomposition of the 6 3 AT0 , AT0 = QR
by Householder transformation.
2
6
6
6
AT0 = 6
6
6
4
1
1
1
1
0
0
2
6
6
6
In the construction of H1 , a = 6
6
6
4
2
6
6
6
H1 = 12 6
6
6
4
1
1
1
1
0
0
1
1
1
1
0
0
1
1
1
1
0
0
1 0 0
1 0 0
1 0 0
1 0 0
0 2 0
0 0 2
1
0
0
0
1
0
1
1
1
1
0
0
3
3
0
1
0
0
0
1
3
7
7
7
7
7
7
5
2
1
0
0
0
0
0
3
2
1
2
3
7
6 1 7
7
7
6
7
7
6
7, v = 6 1 7,
7
6 1 7
7
7
6
5
4 0 5
0
2
4 1
7
6 0
1
6
7
7
6 0
1
1
T
7. Then, H1 A0 = 6
2 6 0
7
1
7
6
5
4 0
2
0
0
7
6
7
6
7
6
7, z = 6
7
6
7
6
5
4
and
1
1
1
1
0
2
3
7
7
7
7.
7
7
5
A.5 Gram-Schmidt construction of an orthogonal matrix
2
6
6
6
1 6
For the construction of H2 , a = 2 6
6
4
2
6
6
6
and H2 = 6
6
6
4
2
6
6
6
6
6
6
4
2
0
0
0
0
0
0
0:378
0:378
0:378
0:756
0
0:5
0:189
0:585
0:585
0:171
1
0:5
1:323
0
0
0
0
6
6
6
and H3 = 6
6
6
4
2
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0:5
1:323
0
0
0
0
2
0:5
6 0:5
6
6 0:5
and Q = 6
6 0:5
6
4 0
0
6
6
6
6
6
6
4
0
0:378
0:896
0:104
0:207
0
3
7
7
7
7.
7
7
5
0
0:378
0:104
0:896
0:207
0
2
6
6
6
For the construction of H3 , a = 6
6
6
4
2
2
1
0
0
0
0
0
0
0
0:447
0:447
0:130
0:764
3
0:5
0:189 7
7
1:309 7
7.
7
0
7
5
0
0
0:567
0:189
0:189
0:189
0:756
0
0
1
1
1
2
0
0
0
0:447
0:862
0:040
0:236
2
3
6
7
6
7
6
7
7, z = 6
6
7
6
7
4
5
0
0:756
0:207
0:207
0:585
0
3
2
0
0
7
6 0
0
7
6
6
0:585 7
7, z = 6 1
7
6 0
0:585 7
6
4 0
0:171 5
0
1
0
0
0
0
0:130 0:764
0:040
0:236
0:988
0:069
0:069
0:597
2
2
This leads to R = 4 0
0
0:109
0:546
0:218
0:218
0:109
0:764
0
1
0
0
0
0
0
0
0
0
0
1
3
7
7
7
7.
7
7
5
3
2
6
7
6
7
6
7
7, v = 6
6
7
6
7
4
5
3
0
1:823
0:5
0:5
1
0
29
3
7
7
7
7,
7
7
5
7
7
7
7. Then, H2 H1 AT0 =
7
7
5
3
2
7
6
7
6
7
6
7, v = 6
7
6
7
6
5
4
3
0
0
1:894
0:585
0:171
1
3
7
7
7
7,
7
7
5
7
7
7
7. Then, H3 H2 H1 AT0 =
7
7
5
0:5
1:323
0
3
0:5
0:189 5
1:309
30
Appendix A. Linear algebra basics
2 Finally, the projection solution
3
0:5
0:567 0:109
02
6 0:5
0:189
0:546 7
6
7
6 0:5
0:189
0:218 7
6
7B
4
6 0:5
7@
0:189
0:218
6
7
4 0
0:756
0:109 5
0
0
0:764
2
3
1
6 5 7
6
7
6
7
1 6 4 7
6 6
7.
4
6
7
4 5 5
1
to A0 y = x0 is yrow = Q RT
2
0
0
3T 1
0:5
C
0:189 5 A
1:309
0:5
1:323
0
1
1
x0 =
2
3
2
4 1 5=
1
A.6 Some determinant identities
A.6.1 Determinant of a square matrix
We utilize the fact that
det (A)
= det (LU )
= det (L) det (U )
and the determinant of a triangular matrix is the product of the diagonal
elements. Since L has ones along its diagonal, det (A) = det (U ). Return to
the example above
2
6
I4 = det 6
4
det AAT
4
1
1
1
2
1
0
2
0
1
2
1
1
2
1
4
3
7
7
5
Factor AAT
I4 into its upper and lower triangular components via
Gaussian elimination (this step can be computationally intensive).
2
6
6
L=6
6
4
1
0
0
1
4+
1
4+
2
4+
1
0
1
7 6 + 2
6+
7 6 + 2
0
1
6+
6 6 +
2
3
7
0 7
7
0 7
5
1
A.6 Some determinant identities
31
and
2
4
6
6
6
U =6
6
4
1
1
7 6 +
4
0
0
0
0
0
2
3
2
1
4+
12 18 +8
7 6 +
2
6
4+
12 8 +
7 6 +
3
2
2
2
(24
10 +
6 6 + 2
0
7
7
7
7
7
5
2
)
The determinant of A equals the determinant of U which is the product of
the diagonal elements.
det AAT
I4
=
det (U )
=
(4
7
)
2
6 +
4
24
10 +
6
12
2
2
6 +
18 + 8
7 6 +
!
2
3
2
which simpli…es as
det AAT
I4 =
48 + 44
2
12
3
4
+
Of course, the roots of this equation are the eigenvalues of A.
A.6.2 Identities
Below the notation jAj refers to the determinant of matrix A.
Am m Bm
Cn m Dn
and D 1 exist.
Theorem 1
where A
1
n
= jAj D
n
CA
1
B = jDj A
BD
1
C
Proof.
A
C
B
D
=
A
C
0
I
I
0
A 1B
D CA 1 B
=
I
0
B
D
A
BD 1 C
D 1C
0
I
Since the determinant of a block triangular matrix is the product of the
determinants of the diagonal blocks and the determinant of the product of
matrices is the product of their determinants,
Am
Cn
m
m
Bm
Dn
n
n
= jAj jIj jIj D
= jAj D
CA
CA
1
1
B = jDj jIj A
B = jDj A
BD
1
BD
C
1
C jIj
32
Appendix A. Linear algebra basics
Theorem 2 For A and B m
n matrices,
In + AT B = Im + BAT = In + B T A = Im + AB T
Proof. Since the determinant of the transpose of a matrix equals the determinant of the matrix,
In + AT B =
Im
AT
From theorem 1,
In + AT B
B
In
I + AT B = I + BAT =
T
= In + B T A
= jIj I + AT IB = jIj I + BIAT . Hence,
I + BAT
T
= I + AB T
Theorem 3 For vectors x and y, I + xy T = 1 + y T x.
Proof. From theorem 2, I + xy T = I + y T x = 1 + y T x.
Theorem 4 An
Proof.
A
yT
A
yT
x
1
0
1
I
0
n
+ xy T = jAj 1 + y T A
=
A
yT
0
1
A 1x
1 + yT A 1 x
I
0
1
x where A
A 1x
1 + yT A 1 x
= jAj 1 + y T A
=
=
I
0
x
1
1 A + xy T
=
1
1
exists.
I
0
A + x1y T
1y T
x
1
x
A + x1y T
1y T
0
1
0
1
.
This is page 33
Printer: Opaque this
Appendix B
Iterated expectations
Along with Bayes’ theorem (the glue holding consistent probability assessment together), iterated expectations is extensively employed for connecting conditional expectation (regression) results with causal e¤ects of
interest.
Theorem 5 Law of iterated expectations
E [Y ] = EX [E [Y j X]]
Proof.
EX [E [Y j X]] =
=
Z
x
E [Y j X] f (x) dx
x
Z
x
"Z
yf (y j x) dy f (x) dx
y
x
#
y
By Fubini’s theorem, we can change the order of integration
#
#
Z x "Z y
Z y "Z x
yf (y j x) dy f (x) dx =
y
f (y j x) f (x) dx dy
x
y
y
x
The product rule of Bayes’theorem, f (y j x) f (x) = f (y; x), implies iterated expectations can be rewritten as
"Z
#
Z
y
EX [E [Y j X]] =
x
y
y
f (y; x) dx dy
x
34
Appendix B. Iterated expectations
Finally, the summation rule integrates out X,
produces the result.
EX [E [Y j X]] =
Z
Rx
x
f (y; x) dx = f (y), and
y
yf (y) dy = E [Y ]
y
B.1 Decomposition of variance
Corollary 1 Decomposition of variance.
V ar [Y ] = EX [V ar [Y j X]] + V arX [E [Y j X]]
Proof.
V ar [Y ]
= E Y2
= EX E Y 2 j X
V ar [Y ]
V ar [Y ]
2
E [Y ]
2
EX [E [Y j X]]
2
= EX E Y 2 j X
E [Y ]
h
i
2
2
= EX V ar [Y j X] + E [Y j X]
EX [E [Y j X]]
h
i
2
2
= EX [V ar [Y j X]] + EX E [Y j X]
EX [E [Y j X]]
= EX [V ar [Y j X]] + V arX [E [Y j X]]
The second line draws from iterated expectations while the fourth line is
the decomposition of the second moment.
In analysis of variance language, the …rst term is the residual variation
(or variation unexplained)and the second term is the regression variation
(or variation explained).
This is page 35
Printer: Opaque this
Appendix C
Multivariate normal theory
The Gaussian or normal probability distribution is ubiquitous when the
data have continuous support as seen in the Central Limit theorems and
the resilience to transformation of Gaussian random variables. When confronted with a vector of variables (multivariate), it often is sensible to think
of a joint normal probability assignment to describe their stochastic propX
erties. Let W =
be an m-element vector (X has m1 elements and
Z
Z has m2 variables such that m1 + m2 = m) with joint normal probability,
then the density function is
fW (w) =
1
m=2
(2 )
1=2
j j
1
(w
2
exp
T
)
1
(w
)
where
X
=
Z
is the m-element vector of means for W and
=
XX
XZ
ZX
ZZ
is the m m variance-covariance matrix for W with m linearly independent
rows and columns.
R R
Of course, the density integrates to unity, X Z fW (w) dzdx = 1. And,
the marginal densities
R are found by integratingR out the other variables, for
example, fX (x) = Z fW (w) dz and fZ (z) = X fW (w) dx.
36
Appendix C. Multivariate normal theory
Importantly, as it uni…es linear regression, the conditional distributions
are also Gaussian.
fW (w)
fX (x j Z = z) =
fZ (z)
N (E [X j Z = z] ; V ar [X j Z])
where
E [X j Z = z] =
and
V ar [X j Z] =
Also,
where
fZ (z j X = x)
1
ZZ
XZ
XX
XZ
(z
1
ZZ
Z)
ZX
N (E [Z j X = x] ; V ar [Z j X])
E [Z j X = x] =
and
+
X
Z
V ar [Z j X] =
+
ZX
ZZ
1
XX
ZX
(x
1
XX
X)
XZ
From the conditional expectation (or regression) function we see when the
data are Gaussian, linearity imposes no restriction.
E [Z j X = x] =
Z
+
ZX
1
XX
(x
X)
is often written
E [Z j X = x]
=
I
=
1
XX
ZX
T
+
+
Z
1
XX x
ZX
x
where
corresponds to an intercept (or vector of intercepts) and T x
corresponds to weighted regressors. Applied linear regression estimates the
sample analogs to the above parameters, and .
C.1 Conditional distribution
Next, we develop more carefully the result for fX (x j Z = z); the result for
fZ (z j X = x) follows in analogous fashion.
fX (x j Z)
=
fW (w)
fZ (z)
1
=
=
(2 )
m=2
1
(2 )m2 =2 j
m1 =2
(2 )
exp
j j
1=2
exp
h
h
1
2
(w
T
1
)
T
1
ZZ
(z
jV ar [X j Z]j
1
T
(x E [X j Z]) V ar [X j Z]
2
1
ZZ j
1=2
1
exp
1
2
(z
Z)
i
)
(w
i
)
z
1=2
(x
E [X j Z])
C.1 Conditional distribution
37
The normalizing constants are identi…ed almost immediately since
m=2
(m1 +m2 )=2
(2 )
=
m2 =2
(2 )
(2 )
m1 =2
= (2 )
m2 =2
(2 )
for the leading term and by theorem 1 in section A.6.2 we have
j j=j
ZZ j
XX
1
ZZ
XZ
ZX
since XX , ZZ , and are positive de…nite, their determinants are positive
and their square roots are real. Hence,
1
j
j j2
ZZ j
1
2
j
=
ZZ j
1
2
XX
j
=
XX
XZ
XZ
ZZ j
1
ZZ
1
ZZ
ZX
1
2
1
2
ZX
1
2
1
= jV ar [X j Z]j 2
This leaves the exponential terms
h
T
1
1
)
(w
exp
2 (w
h
T
1
1
exp
Z)
ZZ (z
2 (z
=
exp
1
(w
2
T
)
1
(w
i
)
i
)
z
)+
1
(z
2
T
Z)
1
ZZ
(z
z)
which require a bit more foundation. We begin with a lemma for the inverse
of a partitioned matrix.
Lemma 1 Let a symmetric, positive de…nite matrix H be partitioned as
A BT
where A and C are square n1 n1 and n2 n2 positive de…nite
B C
matrices (their inverses exist). Then,
"
#
1
1
A BT C 1B
A 1 B T C BA 1 B T
1
H
=
1
1
C 1B A BT C 1B
C BA 1 B T
#
"
1
1 T
A BT C 1B
A BT C 1B
B C 1
=
1
1
C 1B A BT C 1B
C BA 1 B T
"
#
1
1
A BT C 1B
A 1 B T C BA 1 B T
=
1
1
C BA 1 B T
BA 1
C BA 1 B T
Proof. H is symmetric and the inverse of a symmetric matrix is also symmetric. Hence, the second and third lines follow from symmetry and the
38
Appendix C. Multivariate normal theory
…rst line. Since H is symmetric, positive de…nite,
= LDLT
I
=
BA
H
A
0 C
0
I
1
0
BA
1
B
I
0
T
1
A
BT
I
and
H
1
=
LT
=
I
0
1
H
=
1
A
1
L
A 1
0
BT
I
Expanding gives
"
1
1
D
X
BA 1 B T
C
0
BA 1 B T
C
1
A
1
1
BA
I
BA
1
0
I
1
1
B T C BA 1 B T
1
C BA 1 B T
#
where
X=A
1
1
+A
BT C
1
BA
1
BT
1
BA
BT C
= A
1
1
B
The latter equality follows from some linear algebra. Suppose it’s true
BT C
A
1
B
1
1
=A
1
+A
BT C
1
BA
1
BT
BA
1
pre- and post-multiply both sides by A
BT C
A A
1
1
B
A = A + BT C
1
post multiply both sides by A
A =
+B
0
=
BT C
A
T
C
BT C
1
1
1
BT C
1
1
BT
B
B
B
1
BA
1
BT C
A
BA
1
BT
B + BT C
1
BA
1
BA
A
1
BT
BA
1
B
BT C
A
1
B
Expanding the right hand side gives
0
=
BT C
B
T
1
B + BT C
C
1
BA
B
BA
1
T
BA
1
1
BT
1
B
T
1
B C
B
Collecting terms gives
BT C
1
Rewrite I as CC
1
0=
0=
BT C
1
B + BT C
BA
1
1
BT
I
BA
1
BT C
1
B
and substitute
B + BT C
BA
1
BT
1
CC
1
BA
1
BT C
1
B
C.1 Conditional distribution
39
Factor
0
=
BT C
1
B + BT C
0
=
BT C
1
B + BT C
1
BA
1
1
BT
C
1
BA
BT C
1
B
B=0
This completes the lemma.
Now, we write out the exponential terms and utilize the lemma to simplify.
exp
=
=
=
exp
exp
8
>
>
>
<
>
>
>
:
8
>
>
>
>
>
<
1
(w
2
T
1
)
(w
h
1
2
1
2
6
6
6
4
>
>
>
>
>
:
8
2
>
>
>
< 16
6
exp
6
>
2
4
>
>
:
1
(z
2
T
X)
(x
T
Z)
(z
1
XX Z
1
ZZ
T
Z)
1
XZ ZZ
(z
i
z)
1
x
XX Z
1
1
z
ZX XX Z
ZZ X
T
1
)
(z
)
+ 21 (z
Z
z
ZZ
3 9
T
1
>
(x
(x
)
X)
X
>
XX Z
7 >
T
>
1
1
>
(x
)
(z
)
7
=
X
Z
XX Z XZ ZZ
7
T
1
1
5
(z
)
(x
)
Z
X
ZZ ZX XX Z
>
T
>
1
>
+ (z
(z
)
>
Z)
Z
ZZ X
>
;
T
1
1
+ 2 (z
)
(z
)
Z
z
ZZ
9
3
T
1
>
(x
>
X)
X)
XX Z (x
=
7>
T
1
1
(x
)
(z
)
7
XZ
X
Z
XX Z
ZZ
7
T
1
1
(z
>
Z)
X ) 5>
ZZ ZX XX Z (x
>
;
T
1
1
+ (z
Z)
Z)
ZZ X
ZZ (z
1
ZZ
2
)+
where XX Z = XX
XZ
From the last term, write
1
ZZ X
1
ZZ
1
ZZ
ZX
and
ZZ X
=
1
ZZ
ZX
1
XX Z
=
ZZ
XZ
X
Z
1
XX
ZX
9
>
>
>
=
>
>
>
;
XZ .
1
ZZ
To see this utilize the lemma for the inverse of a partitioned matrix. By
symmetry
1
1
1
1
ZZ ZX XX Z = ZZ X ZX XX
Post multiply both sides by
1
ZZ
ZX
1
XX Z
XZ
XZ
1
ZZ
1
ZZ
=
=
=
and simplify
1
ZZ X
1
ZZ X
1
ZZ X
ZX
(
1
XX
ZZ
XZ
ZZ X )
1
ZZ
1
ZZ
1
ZZ
Now, subtitute this into the exponential component
8
2
T
1
>
(x
>
X)
X)
XX Z (x
>
< 16
T
1
1
(x
)
(z
6
XZ
X
Z)
XX Z
ZZ
exp
6
T
1
1
>
2
4
(z
)
(x
>
Z
X)
ZZ ZX XX Z
>
:
T
1
1
1
+ (z
Z)
ZZ ZX XX Z XZ ZZ (z
Z)
39
>
>
7>
=
7
7
5>
>
>
;
40
Appendix C. Multivariate normal theory
Combining the …rst and second terms and combine the third and fourth
terms gives
(
"
#)
T
1
1
1
(x
x
(z
)
XZ
X)
X
Z
XX Z
ZZ
exp
T
1
1
1
2
(z
XZ ZZ (z
Z)
X
Z)
ZZ ZX XX Z x
Then, since
(x
T
X)
(z
T
Z)
1
ZZ
= x
ZX
X
XZ
1
ZZ
(z
Z)
T
combining these two terms simpli…es as
exp
1
(x
2
T
1
XX Z
E [x j Z = z])
(x
E [x j Z = z])
1
where E [x j Z = z] = X
XZ ZZ (z
Z ). Therefore, the result matches
the claim for fX (x j Z = z), the conditional distribution of X given Z = z
is normally distributed with mean E [x j Z = z] and variance V ar [x j Z].
C.2 Special case of precision
Now, we consider a special case of Bayesian normal updating expressed
in terms of precision of variables (inverse variance) along with variance
representation above. Suppose a variable of interest x is observed with
error
Y =x+"
where
x
N
2
x
x;
=
x
and
"
N
0;
2
"
1
=
1
"
2
j
" independent of x,
refers to variance, and j refers to precision of
variable j. This implies
2
h
i
3
2
E
(Y
)
E
[(Y
)
(x
)]
x
x
x
Y
h
i
5
V ar
= 4
2
x
E [(x
) (Y
)]
E (x
)
x
=
2
x
+
2
x
2
"
x
2
x
2
x
x
:
Then, the posterior or updated distribution for x given Y = y is normal.
(x j Y = y)
N (E [x j Y = y] ; V ar [x j Y ])
C.2 Special case of precision
41
where
E [x j Y = y]
2
x
=
x
=
2
2
" x + xy
2 + 2
x
"
=
+
2
x
2
"
+
+
x+
(y
x)
"y
x x
"
and
V ar [x j Y ]
=
=
=
=
2 2
x
2
x
2
x
2
x
2
x
2
"
+
+
2
"
2
x
+
2 2
x
2
"
2 2
x "
2
x
+
1
x+
2
"
"
For both the conditional expectation and variance, the penultimate line
expresses the quantity in terms of variance and the last line expresses the
same quantity in terms of precision. The precision of x given Y is xjY =
x + ".
42
Appendix C. Multivariate normal theory
This is page 43
Printer: Opaque this
Appendix D
Generalized least squares (GLS)
For the DGP,
Y =X +"
where " N (0; ) and is some general n n variance-covariance matrix,
then the linear least squares estimator or generalized least squares (GLS )
estimator is
bGLS = X T
1
X
1
XT
1
Y
with
V ar [bGLS j X] = X T
1
1
X
If
is known, this can be computed by ordinary least squares (OLS )
1
following transformation of the variables Y and X via
where is a
T
triangular matrix such that
=
, say via Cholesky decomposition.
Then, the transformed DGP is
1
Y
y
1
=
X +
= x +
1
"
44
Appendix D. Generalized least squares (GLS)
1
1
1
where y =
Y, x =
X, and =
" N (0; I). To see where the
identity variance matrix comes from, consider
1
V ar
"
=
1
=
1
=
1
T
=
= I
1
T
1 T
V ar ["]
1 T
1 T
1
T
Hence, estimation involves projection of y onto x
E [y j x] = xb
where
b
Since
1
=
xT x
=
XT
1
T
=
1
xT y
1 T
1
1 T
=
1
b = XT
1
1
X
1 T
XT
1
Y
, we can rewrite
1
X
XT
1
Y
h
which is the GLS estimator for . Further, V ar [b j X] = E (b
Since
b
=
XT
1
=
XT
1
=
T
1
X
+ XT
=
=
V ar [b j X]
+ X
X
T
=
XT
1
X
=
XT
1
X
=
XT
as indicated above.
1
X
X
X
1
1
XT
1
Y
1
XT
1
(X + ")
1
T
1
X
1
1
X
T
1
X
i
XT
XT
1
1
XT
1
1
X
T
T
1
1
1
XT
1
X
1
)
X
X
T
1
h
= E (b
) (b
h
1
= E XT
X
X
1
""T
)
"
1
"
"
1
E ""T j X
1
T
) (b
X XT
X XT
1
1
X
X XT
1
X
1
1
jX
X
i
1
i
.
This is page 45
Printer: Opaque this
Appendix E
Two stage least squares IV (2SLS-IV)
In this appendix we develop two stage least squares instrumental variable
(2SLS-IV ) estimation more generally. Instrumental variables are attractive
strategies whenever the fundamental condition for regression, E [" j X] = 0,
is violated but we can identify instruments such that E [" j Z] = 0.
E.1 General case
For the data generating process
Y =X +"
the IV estimator projects X onto Z (if X includes an intercept so does
b = Z Z T Z 1 Z T X, where X is a matrix of regressors, and Z is a
Z), X
matrix of instruments. Then, the IV estimator for is
bT X
b
bIV = X
Provided E [" j Z] = 0 and V ar ["] =
and has variance V ar b
IV
j X; Z =
2
1
2
b
XY
I, bIV is unbiased, E bIV
bT X
b
X
1
.
=
,
46
Appendix E. Two stage least squares IV (2SLS-IV)
Unbiasedness is demonstrated as follows.
bIV
1
=
bT X
b
X
=
bT X
b
X
XT Z ZT Z
1
ZT Z ZT Z
=
XT Z ZT Z
1
ZT X
=
=
Then,
1
bT X
b
+ X
bT X
b
+ X
b
XY
b T (X + ")
X
1
1
1
1
ZT X
XT Z ZT Z
1
XT Z ZT Z
1
Z T (X + ")
ZT X
bT "
X
1
bT "
X
E bIV j X; Z
bT X
b
+ X
= E
=
=
1
bT X
b
+ X
+0=
1
b T " j X; Z
X
b T E [" j X; Z]
X
By iterated expectations,
E bIV = EX;Z E bIV j X; Z
=
Variance of the IV estimator follows similarly.
h
V ar bIV j X; Z = E bIV
bIV
T
j X; Z
From the above development,
Hence,
V ar bIV j X; Z
1
bT X
b
= X
bIV
= E
=
2
=
2
bT X
b
X
bT X
b
X
bT X
b
X
1
1
1
i
bT "
X
bT X
b
b T ""T X
b X
X
bT X
b
bT X
b X
X
1
j X; Z
1
E.2 Special case
In the special case where both X and Z have the same number of columns
the estimator can be further simpli…ed,
bIV = Z T X
1
ZT Y
E.2 Special case
47
To see this, write out the estimator
bIV
=
=
bT X
b
X
1
b
XY
XT Z ZT Z
1
ZT X
1
XT Z ZT Z
1
ZT Y
Since X T Z and Z T X have linearly independent columns, we can invert
each square term
bIV
=
ZT X
1
ZT Z XT Z
=
ZT X
1
ZT Y
1
XT Z ZT Z
1
ZT Y
Of course, the estimator is unbiased, E bIV = , and the variance of the
estimator is
1
bT X
b
V ar bIV j X; Z = 2 X
which can be written
V ar bIV j X; Z =
2
ZT X
1
ZT Z XT Z
1
in this special case where X and Z have the same number of columns.
48
Appendix E. Two stage least squares IV (2SLS-IV)
This is page 49
Printer: Opaque this
Appendix F
Seemingly unrelated regression (SUR)
First, we describe the seemingly unrelated regression (SUR) model. Second,
we remind ourselves Bayesian regression works as if we have two samples:
one representative of our priors and another from the new evidence. Then,
we connect to seemingly unrelated regression (SUR) — both classical and
Bayesian strategies are summarized.
We describe the SUR model in terms of a stacked regression as if the
latent variables in a binary selection setting are observable.
r =X +"
where
2
3
2
U
W
r = 4 Y1 5 ; X = 4 0
Y0
0
0
X1
0
3
0
0 5;
X2
2
=4
and
"
N (0; V =
N
In )
1
2
3
2
3
VD
5 , " = 4 V1 5 ,
V0
50
Appendix F. Seemingly unrelated regression (SUR)
2
=4
Let
denoted by
V
1
D1
D0
D1
11
10
10
00
D0
N
, is
=
2
N
..
.
0
..
.
.
1
0
..
.
0
D1
..
.
0
..
5, a 3 3 matrix, then the Kronecker product,
In
D1 In
D0 In
In = 4
1
..
.
6
6
6
6
6
6
6
6
= 6
6
6
6
6
6
6
4
a 3n
2
3
..
.
0
..
1
00 In
..
0
..
.
.
.
0
..
.
..
.
0
..
.
5
..
.
0
..
D0
..
.
0
..
.
..
.
0
..
.
0
..
.
.
0
..
.
10
00
..
.
0
10
0
..
.
D0
10
11
10
N
3
D1
..
.
0
1
=
10 In
10 In
11
D0
3n matrix and V
11 In
..
.
0
0
..
.
.
D0 In
D1
D1
D0
D1 In
..
00
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
In .
F.1 Classical
Classical estimation of SUR follows generalized least square (GLS ).
b = XT
and
1
N
1
In X
h i
V ar b = X T
1
XT
N
1
N
In r
1
In X
F.2 Bayesian
On the other hand, Bayesian analysis employs a Gibbs sampler based on
the conditional posteriors as, apparently, the SUR error structure prevents
identi…cation of conjugate priors. Recall, from the discussion of Bayesian
linear regression with general error structure the conditional posterior for
is p ( j ; y; 0 ; ) N ; V where
=
=
1
X0T
0
1
X0 + X T
+ XT
1
1
1
X
X
1
X0T
1
0
1
0
+ XT
X0
1
0
+ XT
Xb
1
Xb
F.3 Bayesian treatment e¤ect application
b = XT
and
V
X0T
=
X
XT
1
1
X0 + X T
1
0
1
=
1
1
+ XT
1
51
y
1
X
1
X
F.3 Bayesian treatment e¤ect application
For the application of SUR to the treatment e¤ect setting, we replace y with
N
T
1
1
r = U y1 y 0
and
with
In yielding the conditional
posterior for , p ( j ; y;
SU R
=
1
X0T
0
X0T
=
and
1
0;
X0 + X T
1
0
X0
+ XT
0
1
=
=
1
N
1
X0T
SU R
N
+ XT
b SU R = X T
V SU R
)
N
1
In X
SU R
1N
In X b
1
In X
N
; V SU R where
In X
1
0
1
XT
+ XT
1
N
1
In r
1N
X0 + X T
In X
1
1
1N
+X
In X
1
0
1
N
In X b
SU R
52
Appendix F. Seemingly unrelated regression (SUR)
This is page 53
Printer: Opaque this
Appendix G
Maximum likelihood estimation of
discrete choice models
The most common method for estimating the parameters of discrete choice
models is maximum likelihood. The likelihood is de…ned as the joint density
for the parameters of interest conditional on the data Xt . For binary
choice models and Dt = 1 the contribution to the likelihood is F (Xt ) ;
and for Dt = 0 the contribution to the likelihood is 1 F (Xt ) where these
are combined as binomial draws and F (Xt ) is the cumulative distribution
function evaluated at Xt . Hence, the likelihood is
L ( jX) =
The log-likelihood is
` ( jX)
logL ( jX) =
n
Y
Dt
F (Xt )
[1
1 Dt
F (Xt )]
t=1
n
X
Dt log (F (Xt )) + (1
Dt ) log (1
F (Xt ))
t=1
Since this function for binary response models like probit and logit is globally concave, numerical maximization is straightforward. The …rst order
conditions for a maximum, max ` ( jX) , are
n
P
t=1
Dt f (Xt )Xit
F (Xt )
(1 Dt )f (Xt )Xti
1 F (Xt )
= 0 i = 1; : : : ; k
where f ( ) is the density function. Simplifying yields
n
P
t=1
[Dt F (Xt )]f (Xt )Xti
F (Xt )[1 F (Xt )]
= 0 i = 1; : : : ; k
54
Appendix G. Maximum likelihood estimation of discrete choice models
Estimates of are found by solving these …rst order conditions iteratively
or, in other words, numerically.
A common estimator for the variance of ^M LE is the negative inverse of
h
i 1
the Hessian matrix evaluated at ^M LE ,
H D; ^
. Let H (D; ) be
the Hessian matrix for the log-likelihood with typical element Hij (D; )
@ 2 `t (D; ) 1
@ i@ j .
1 Details can be found in numerous econometrics references and chapter 4 of Accounting and Causal E¤ ects: Econometric Challenges.
This is page 55
Printer: Opaque this
Appendix H
Optimization
In this appendix, we brie‡y review optimization. First, we’ll take up linear
programming then we’ll review nonlinear programming.1
H.1 Linear programming
A linear program (LP ) is any optimization frame which can be described
by a linear objective function and linear constraints. Linear refers to choice
variables, say x, of no greater than …rst degree (a¢ ne transformations which
allow for parallel lines are included). Prototypical examples are
T
max
x 0
s:t:
x
Ax
r
or
min
T
x 0
s:t:
Ax
x
r
1 For additional details, consult, for instance, Luenberger and Ye, 2010, Linear and
Nonlinear Programming, Springer, or Luenberger, 1997 Optimization by Vector Space
Methods, Wiley.
56
Appendix H. Optimization
H.1.1 basic solutions or extreme points
Basic solutions are typically determined from the standard form for an
LP. Standard form involves equality constraints except non-negative choice
variables, x 0. That is, Ax r is rewritten in terms of slack variables,
s, such that Ax + s = r. The solution to this program is the same as the
solution to the inequality program.
A basic solution or extreme point is determined from an m m submatrix
of A composed of m linearly independent columns of A. The set of basic
feasible solutions then is the collection of all basic solutions involving x 0.
2 1
r1
x1
Consider an example. Suppose A =
,r=
,x=
1 2
r2
x2
s1
and s =
. Then, Ax + s = r can be written Bxs = r where B =
s2
x
A I2 , I2 is a 2 2 identity matrix, and xs =
. The matrix B
s
has two linearly independent columns so each basic solution works with two
columns of B, say Bij , and the elements other than i and j of xs are set to
zero. For instance, B12 leads to basic solution x1 = 2r13 r2 and x2 = 2r23 r1 .
The basic solutions are tabulated below.
Bij
B12
B13
B14
B23
B24
x1
2r1 r2
3
r2
r1
2
x2
2r2 r1
3
s1
0
r1 2r2
0
r2
2
r1
2r1 r2
2
0
0
0
0
0
s2
0
0
2r2 r1
2
0
r2
2r1
To test feasibility consider speci…c values for r. Suppose r1 = r2 = 10. The
table with a feasibility indicator (1 (xs 0)) becomes
Bij
B12
B13
B14
B23
B24
x1
x2
10
3
10
3
10
5
0
0
0
0
5
10
s1
0
10
0
5
0
s2
0
0
5
0
10
feasible
yes
no
yes
yes
no
Notice, when x2 = 0 there is slack in the second constraint (s2 > 0) and
similarly when x1 = 0 there is slack in the …rst constraint (s1 > 0). Basic feasible solutions, an algebraic concept, are also referred to by their
geometric counterpart, extreme points.
Identi…cation of basic feasible solutions or extreme points combined with
the fundamental theorem of linear programming substantially reduce the
search for an optimal solution.
H.1 Linear programming
57
H.1.2 fundamental theorem of linear programming
For a linear program in standard form where A is an m n matrix of rank
m,
i) if there is a feasible solution, there is a basic feasible solution;
ii) if there is an optimal solution, there is a basic feasible optimal solution.
Further, if more than one basic feasible solution is optimal, the edge
between the basic feasible optimal solutions is also optimal. The theorem
means the search for the optimal solution can be restricted to basic feasible
solutions — a …nite number of points.
H.1.3 duality theorems
Optimality programs come in pairs. That is, there is a complementary or
dual program to the primary (primal) program. For instance, the dual to
the maximization program is a minimization program, and vice versa.
primal program
T
max
x
x 0
s:t:
Ax
r
dual program
min
rT
0
s:t:
AT
or
primal program
T
min
x
x 0
s:t:
Ax
r
dual program
max
rT
0
s:t:
AT
where is a vector of shadow prices or dual variable values. The dual of
the dual program is the primal program.
strong duality theorem
If either the primal or dual has an optimal solution so does the other and
their optimal objective function values are equal. If one of the programs is
unbounded the other has no feasible solution.
weak duality theorem
For feasible solutions, the objective function value of the minimization program (say, dual) is greater than or equal to the maximization program (say,
primal).
The intuition for the duality theorems is straightforward. Begin with the
constraints
Ax r AT
Transposing both sides of the …rst constraint leaves the inequality unchanged.
xT AT rT AT
58
Appendix H. Optimization
Now, post-multiply both sides of the …rst constraint by and pre-multiply
both sides of the second constraint by xT , since both and x are nonnegative the inequality is preserved.
xT AT
rT
Since xT is a scalar, xT =
the relation we were after.
T
T
x
xT AT
xT
x. Now, combine the results and we have
xT AT
rT
The solution to the dual lies above that for the primal except when they
both reside at the optimum solution, in which case their objective function
values are equal.
H.1.4 example
Suppose we wish to solve
max
10x + 12y
x 0
s:t
2
1
1
2
x
y
10
10
We only need evaluate the objective function at each of the basic feasible
10
= 220
solutions we earlier identi…ed: 10 10
3 +12 3
3 , 10 (5)+12 (0) = 50 =
180
150
,
and
10
(0)
+
12
(5)
=
60
=
.
The
optimal
solution
is x = y = 10
3
3
3 .
H.1.5 complementary slackness
Suppose x 0 is an n element vector containing a feasible primal solution,
0 is an m element vector containing a feasible dual solution, s 0 is
an m element vector containing primal slack variables, and t 0 is an n
element vector containing dual slack variables. Then, x and are optimal
if and only if (element-by-element)
xt = 0
and
s=0
These conditions are economically sensible as either the scarce resource
is exhausted (s = 0) or if the resource is plentiful it has no value ( = 0).
H.2 Nonlinear programming
59
H.2 Nonlinear programming
H.2.1 unconstrained
Nonlinear programs involve nonlinear objective functions. For instance,
max
x
f (x)
If the function is continuously di¤erentiable, then a local optimum can be
found by the …rst order approach. That is, equate the gradient (a vector of
partial derivatives composed of terms, @f@x(x)
, i = 1; : : : ; n where there are
i
n choice variables, x).
rf (x ) = 0
3
2
@f (x)
@x1
6 . 7
6 . 7 = 6
4
4 . 5
2
@f (x)
@xn
3
0
.. 7
. 5
0
Second order (su¢ cient) conditions involve the Hessian, a matrix of second
partial derivatives.
2
6
H (x ) = 6
4
@ 2 f (x)
@x1 @x1
..
.
..
@ 2 f (x)
@x1 @xn
.
..
.
@ 2 f (x)
@xn @xn
@ 2 f (x)
@xn @x1
3
7
7
5
For a local minimum, the Hessian is positive de…nite (the eigenvalues of H
are positive). While for a local maximum, the Hessian is negative de…nite
(the eigenvalues of H are negative).
H.2.2 convexity and global minima
If f is a convex function (de…ned below), then the set where f achieves its
local minimum is convex and any local minimum is a global minimum. A
function f is convex if for every x1 , x2 , and , 0
1,
f ( x1 + (1
If x1 6= x2 , and 0 <
) x2 )
f (x1 ) + (1
) f (x2 )
) x2 ) < f (x1 ) + (1
) f (x2 )
< 1,
f ( x1 + (1
then f is strictly convex. If g =
(strictly) concave.
f and f is (strictly) convex, then g is
60
Appendix H. Optimization
H.2.3 example
Suppose we face the problem
min f (x; y) = x2
10x + y 2
x;y
10y + xy
The …rst order conditions are
rf (x; y) = 0
or
@f
@x
@f
@y
=
2x
10 + y = 0
=
2y
10 + x = 0
Since the problem is quadratic and the gradient is composed of linearly
independent equations, a unique solution is immediately identi…able.
2
1
1
2
x
y
10
10
=
100
3 .
or x = y = 10
3 with objective function value
de…nite, this solution is a minimum.
H=
2
1
As the Hessian is positive
1
2
Positive de…niteness of the Hessian follows as the eigenvalues of H are
positive. To see this, recall the sum of the eigenvalues equals the trace of
the matrix and the product of the eigenvalues equals the determinant of
the matrix. The eigenvalues of H are 1 and 3, both positive.
H.2.4 constrained — the Lagrangian
Nonlinear programs involve either nonlinear objective functions, constraints,
or both. For instance,
max
f (x)
x 0
s:t:
G (x)
r
Suppose the objective function and constraints are continuously di¤erentiable concave and an optimal solution exists, then the optimal solution
can be found via the Lagrangian. The Lagrangian writes the objective function less a Lagrange multiplier times each of the constraints. As either the
multiplier is zero or the constraint is binding, each constraint term equals
zero.
L = f (x)
r1 ]
rn ]
1 [g1 (x)
n [gn (x)
H.2 Nonlinear programming
61
where G (x) involves n functions, gi (x), i = 1; : : : ; n. Suppose x involves
m choice variables. Then, there are m Lagrange equations plus the n constraints that determine the optimal solution.
@L
@x1
=
0
..
.
@L
= 0
@xm
r1 ] = 0
1 [g1 (x)
..
.
rn ] = 0
n [gn (x)
The Lagrange multipliers (shadow prices or dual variable values) represent the rate of change in the optimal objective function for each of the
constraints.
i
=
@f (r )
@ri
where r refers to rewriting the optimal solution x in terms of the constraint values, r. If a constraint is not binding, it’s multiplier is zero as it
has no impact on the optimal objective function value.2
H.2.5 Karash-Kuhn-Tucker conditions
Originally, the Lagrangian only allowed for equality constraints. This was
generalized to include inequality constraints by Karash and separately
Kuhn and Tucker. Of course, some regularity conditions are needed to
ensure optimality. Various necessary and su¢ cient conditions have been
proposed to deal with the most general settings. The Karash-Kuhn-Tucker
theorem supplies …rst order necessary conditions for a local optimum (gradient of the Lagrangian and the Lagrange multiplier times the inequality
constraint equal zero when the Lagrange multipliers on the inequality constraints are non-negative evaluated at x ). Second order necessary (positive
semi-de…nite Hessian for the Lagrangian at x ) and su¢ cient (positive definite Hessian for the Lagrangian at x ) conditions are roughly akin to those
for unconstrained local minima.
2 Of course, these ideas regarding the multipliers apply to the shadow prices of linear
programs as well.
62
Appendix H. Optimization
H.2.6 example
Continue with the unconstrained example above with an added constraint.
min f (x; y) = x2
x;y
s:t:
10x + y 2
xy
10y + xy
10
Since the unconstrained solution satis…es this constraint, xy =
the solution remains x = y = 10
3 .
However, suppose the problem is
min f (x; y) = x2
x;y
s:t:
10x + y 2
xy
100
9
> 10,
10y + xy
20
The constraint is now active. The Lagrangian is
L = x2
10x + y 2
10y + xy
(xy
20)
and the …rst order conditions are
rL = 0
or
@L
@x
@L
@y
=
2x
10 + y
y=0
=
2y
10 + x
x=0
and constraint equation
(xy
20) = 0
A solution to these nonlinear equations is
p
x = y=2 5
p
5
= 3
The objective function value is 29:4427 which, of course, is greater than
the objective function value for the unconstrained problem, 33:3333. The
Hessian is
2
1
H=
1
2
p
p
5,
with eigenvalues evaluated at the solution, 3
= 5 and 1 + = 4
both positive. Hence, the solution is a minimum.
This is page 63
Printer: Opaque this
Appendix I
Quantum information
Quantum information follows from physical theory and experiments. Unlike classical information which is framed in the language and algebra of
set theory, quantum information is framed in the language and algebra of
vector spaces. To begin to appreciate the di¤erence, consider a set versus
a tuple (or vector). For example, the tuple 1 1 is di¤erent than the
tuple 1 1 1 but the sets, f1; 1g and f1; 1; 1g are the same as set f1g.
Quantum information is axiomatic. Its richness and elegance is demonstrated in that only four axioms make it complete.1
I.1 Quantum information axioms
I.1.1 The superposition axiom
A quantum unit (qubit) is speci…ed by a two element vector, say
2
,
2
with j j + j j = 1.
1 Some argue that the measurement axiom is contained in the transformation axiom, hence requiring only three axioms. Even though we appreciate the merits of the
argument, we’ll proceed with four axioms. Feel free to count them as only three if you
prefer.
64
Appendix I. Quantum information
Let j i
=
j0i +
y
j1i,2 h j =
where y is the adjoint
(conjugate transpose) operation.
I.1.2 The transformation axiom
A transformation of a quantum unit is accomplished by unitary (lengthpreserving) matrix multiplication. The Pauli matrices provide a basis of
unitary operators.
where i =
, Y
p
I=
1
0
0
1
Y =
0
i
i
0
0
1
X=
Z=
1
0
1
0
0
1
1. The operations work as follows: I
=
i
i
, and Z
=
=
H j0i =
p1
2
=
. Other useful sin-
1 1
and
1
1
Examples of these transformations in Dirac notation are3
gle qubit transformations are H =
,X
=
ei
0
0
1
.
j0i j1i
j0i + j1i
p
p
; H j1i =
2
2
j0i = ei j0i ;
j1i = j1i
I.1.3 The measurement axiom
Measurement occurs via interaction with the quantum state as if a linear
projection is applied to the quantum state.4 The set of projection matrices
1
0
and j1i =
.
0
1
3 A summary table for common quantum operators expressed in Dirac notation is
provided in section I.1.5.
4 This is a compact way of describing measurement. However, conservation of information (a principle of quantum information) demands that operations be reversible.
In other words, all transformations (including interactions to measure) be unitary —
projections are not unitary. However, there always exist unitary operators that produce the same post-measurement state as that indicated via projection. Hence, we treat
projections as an expedient for describing measurement.
2 Dirac
notation is a useful descriptor, as j0i =
I.1 Quantum information axioms
65
are complete as they add to the identity matrix.
P
m
y
Mm
Mm = I
y
where Mm
is the adjoint (conjugate transpose) of projection matrix Mm
The probability of a particular measurement occurring is the squared absolute value of the projection. (An implication of the axiom not explicitly
used here is that the post-measurement state is the projection appropriately
normalized; this e¤ectively rules out multiple measurement.)
1 0
For example, let the projection matrices be M0 = j0i h0j =
and
0 0
0 0
M1 = j1i h1j =
. Note that M0 projects onto the j0i vector and M1
0 1
projects onto the j1i vector. Also note that M0y M0 + M1y M1 = M0 + M1 =
I. For j i = j0i + j1i, the projection of j i onto j0i is M0 j i. The
2
probability of j0i being the result of the measurement is h j M0 j i = j j .
I.1.4 The combination axiom
Qubits are combined by tensor
For example, two j0i qubits
3
2 multiplication.
1
6 0 7
7
are combined as j0i j0i = 6
4 0 5 denoted j00i. It is often useful to trans0
form one qubit in a combination and leave another unchanged; this can
also be accomplished by tensor multiplication. Let H1 denote a Hadamard
transformation on 2the …rst qubit. Then
3 applied to a two qubit system,
1 0 1
0
6 0 1 0
1 7
7 and H1 j00i = j00i+j10i
p
.
H1 = H I = p12 6
4 1 0
2
1 0 5
0 1 0
1
Another important two qubit transformation is the controlled not operator,
2
1
6 0
Cnot = 6
4 0
0
0
1
0
0
0
0
0
1
3
0
0 7
7
1 5
0
The controlled not operator ‡ips the target, second qubit, if the control, …rst
qubit, equals j1i and otherwise leaves the target unchanged: Cnot j00i =
j00i, Cnot j01i = j01i, Cnot j10i = j11i, and Cnot j11i = j10i,
Entangled two qubit states or Bell states are de…ned as follows,
66
Appendix I. Quantum information
j
j00i + j11i
p
2
00 i
= Cnot H1 j00i =
ij
= Cnot H1 jiji for i; j = 0; 1
and more generally,
The four two qubit Bell states form an orthonormal basis.
I.1.5 Summary of quantum "rules"
Below we tabulate two qubit quantum operator rules. The column heading
indicates the initial state while the row value corresponding the operator
identi…es the transformed state. Of course, the same rules apply to one
qubit (except Cnot) or many qubits, we simply have to exercise care to
identify which qubit is transformed by the operator (we continue to denote
the target qubit via the subscript on the operator).
operator
j00i
j01i
j10i
j11i
I1 or I2
j00i
j01i
j10i
j11i
X1
j10i
j11i
j00i
j01i
X2
j01i
j00i
j11i
j10i
Z1
j00i
j01i
Z2
j00i
j01i
Y1
i j10i
i j11i
j10i
i j00i
j11i
i j01i
i j01i
H1
j00i+j10i
p
2
j01i+j11i
p
2
j00i j10i
p
2
j01i j11i
p
2
H2
j00i+j01i
p
2
j00i j01i
p
2
j10i+j11i
p
2
j10i j11i
p
2
1
ei j00i
ei j01i
j10i
j11i
e j00i
j01i
e j10i
j11i
j00i
j01i
j11i
j10i
2
Cnot
i j11i
j11i
Y2
i
i j00i
j10i
i
i j10i
common quantum operator rules
I.1.6 Observables and expected payo¤s
Measurements involve real values drawn from observables. To ensure measurements lead to real values, observables are also represented by Hermitian
I.1 Quantum information axioms
67
matrices. A Hermitian matrix is one in which the complex conjugate of the
matrix equals the original matrix, M y = M . Hermitian matrices have real
eigenvalues and eigenvalues are the values realized via measurement. Suppose we are working with observable M in state j i where
2
3
0
0
0
1
6 0
0
0 7
2
7
M = 6
4 0
5
0
0
3
0
0
0
4
=
1
j00i h00j +
2
j01i h01j +
3
j10i h10j +
4
j11i h11j
The expected payo¤ is
hM i = h j M j i
=
1 h j 00i h00j i + 2 h j 01i h01j i
+ 3 h j 10i h10j i + 4 h j 11i h11j i
In other words, 1 is observed with probability h j 00i h00j i, 2 is observed with probability h j 01i h01j i, 3 is observed with probability h j 10i h10j i,
and 4 is observed with probability h j 11i h11j i.
68
Appendix I. Quantum information
This is page 69
Printer: Opaque this
Appendix J
Common distributions
To try to avoid confusion, we list our descriptions of common multivariate
and univariate distributions and their kernels. Others may employ variations on these de…nitions.
70
Appendix J. Common distributions
Multivariate
distributions
and their support
Gaussian (normal)
x 2 Rk
2 Rk
2 Rk k
positive de…nite
Student t
x 2 Rk
2 Rk
2 Rk k
positive de…nite
Wishart
W 2 Rk k
positive de…nite
S 2 Rk k
positive de…nite
>k 1
Inverse Wishart
W 2 Rk k
positive de…nite
S 2 Rk k
positive de…nite
>k 1
Density f ( ) functions
and their kernels
conjugacy
1
2 (x
1
(2 )k=2 j j1=2
1
)
(x
)
1
2 (x
)T
f (x; ; ) =
e
/e
conjugate prior for
mean of multinormal distribution
T
1
(x
)
f (x; ; ; )
=
h
[( +k)=2]
)k=2 j j1=2
( 2 )(
1 + (x
h
/ 1 + (x
)T
1
)T
1
1
k
2
1
f W; ;S
1
jW j
/ jW j
jW j
/ jW j
(n) = (n
k
2
)
(x
)
i
+k
2
i
+k
2
1
f (W ; ; S) =
k
2
(x
marginal posterior
for multi-normal
with unknown
mean and
variance
k=2 jSj =2
2
1
2Tr
e
+k+1
2
+k+1
2
1
(S
1
1
2Tr
e
=
e
e
2
k
(S
jSj
k=2
1
2Tr
see inverse
Wishart
W)
=2
k
(SW
1
2Tr
(2)
W)
(2)
1
)
(SW
1
)
conjugate prior for
variance of multinormal distribution
1)!; Rfor n a positive integer
1
(z) = 0 tz 1 e t dt
k
Y
+1 j
= k(k 1)=4
2
j=1
Multivariate distributions
Appendix J. Common distributions
Univariate distributions
and their support
Beta
x 2 (0; 1)
; >0
Density f ( ) functions
and their kernels
f (x; ; )
= (( )+( )) x 1 (1 x)
Chi-square
x 2 [0; 1)
>0
Extreme value (logistic)
x 2 ( 1; 1)
1 < < 1; s > 0
=
=
2
exp[ 1=(2x)]
1
( =2)
x =2+1
=2 1
1=(2x)
=2
/x
e
f x; ; 2
=2
exp[
( 2 )
2
=2
( =2)
=2 1
2
=(2x)]
=2+1
2
x
=(2x)
e
f (x; )
= exp [ x]
/ exp [ x]
f (x; ; s)
/x
exp[ (x
)=s]
s(1+exp[ (x
)=s])2
exp[ (x
)=s]
/ (1+exp[ (x )=s])2
=
Gamma
x 2 [0; 1)
; >0
f (x; ; )
= x 1 exp[( )x= ]
/ x 1 exp [ x= ]
Inverse gamma
x 2 [0; 1)
; >0
f (x; ; )
=x
/x
Gaussian (normal)
x 2 ( 1; 1)
1 < < 1; > 0
Student t
x 2 ( 1; 1)
2 ( 1; 1)
; >0
Pareto
x x0
; x0 > 0
1
(1 x)
F (s; )
n s
= ns s (1
)
n s
s
/ (1
)
f (x; )
x =2 1
= 2 =2 1( =2) exp[x=2]
/ x =2 1 e x=2
f (x; )
Binomial
s = 1:2: : : :
2 (0; 1)
Inverse chi-square
x 2 (0; 1)
>0
Scaled, inverse
chi-square
x 2 (0; 1)
; 2>0
Exponential
x 2 [0; 1)
>0
conjugacy
1
1
/x
=x]
1 exp[
( )
1
exp [ =x]
f (x; h; )
i
(x
)2
= p21 exp
2
2 i
h
(x
)2
/ exp
2 2
( +1 )
f (x; ; ; ) = p 2
(2)
+1
(
2
2 )
(x
)
1+ 1
2
( +1
2
2 )
/ 1 + 1 (x 2 )
f (x; ; x0 ) =
/ x 1+1
x0
x
71
+1
Univariate distributions
beta is conjugate
prior to binomial
beta is conjugate
prior to binomial
see scaled,
inverse chi-square
see scaled,
inverse chi-square
conjugate prior for
variance of a
normal distribution
gamma is
conjugate prior
to exponential
posterior for
Bernoulli prior
and normal
likelihood
gamma is
conjugate prior
to exponential
and others
conjugate prior for
variance of a
normal distribution
conjugate prior for
mean of a
normal distribution
marginal posterior
for a normal with
unknown mean
and variance
conjugate prior for
unknown bound
of a uniform
Download