Uploaded by Yitian Zhang

ECO2010 Lecture Notes 2022

advertisement
University of Toronto
Department of Economics
ECO2010H1F
Mathematics and Statistics for PhD Students
Instructor: Prof. Martin Burda
Summer 2022
Lecture Notes
Contents
I
Topology and Real Analysis
1
1 Methods of Proofs
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Proof Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Set
2.1
2.2
2.3
Theory
Sets and Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Metric Spaces
3.1 Metric . . . . . . . . . . . . . . .
3.2 Examples of the Metric Function
3.3 Compactness . . . . . . . . . . .
3.4 Completeness . . . . . . . . . . .
4 Analysis in Metric
4.1 Continuity . . .
4.2 Fixed Points . .
4.3 Convergence . .
1
1
2
6
7
10
11
.
.
.
.
12
12
12
14
16
Spaces
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
18
18
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Vector Spaces
5.1 Normed Space . . . . . . . . . . . . . . .
5.2 Linear Combination . . . . . . . . . . .
5.3 Linear Independence . . . . . . . . . . .
5.4 Basis and Dimension . . . . . . . . . . .
5.5 Vector Spaces Associated with Matrices
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
23
24
24
26
6 Linear Algebra in Vector Spaces
6.1 Linear Transformations . . . . .
6.2 Inner Product Spaces . . . . . . .
6.3 Orthogonal Bases . . . . . . . . .
6.4 Projections . . . . . . . . . . . .
6.5 Euclidean Spaces . . . . . . . . .
6.6 Separating Hyperplane Theorem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
28
29
30
31
32
7 Correspondences
7.1 Continuity of Correspondences . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Kakutani’s Fixed Point Theorem . . . . . . . . . . . . . . . . . . . . . . . .
34
34
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Continuity
8.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 The Maximum Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
37
37
40
II
43
Optimization
9 Constrained Optimization
9.1 Equality Constraints . . . . . . . . . . . . . . . . .
9.2 Extreme Value Theorem . . . . . . . . . . . . . . .
9.3 Inequality Constraints . . . . . . . . . . . . . . . .
9.4 Application: The Consumer Problem . . . . . . . .
9.5 Envelope Theorem for Unconstrained Optimization
9.6 Envelope Theorem for Constrained Optimization .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
45
47
49
50
10 Dynamic Optimization
10.1 Motivation: Labor Supply . . . . . . .
10.2 Approaches to Dynamic Optimization
10.3 2 Periods . . . . . . . . . . . . . . . .
10.4 T Periods . . . . . . . . . . . . . . . .
10.5 General Problem . . . . . . . . . . . .
10.6 Lagrangean approach . . . . . . . . . .
10.7 Maximum Principle . . . . . . . . . .
10.8 In…nite Horizon . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
54
54
56
57
58
60
60
11 Dynamic Programming
11.1 Finite Horizon: Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 In…nite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
62
64
12 DP Application - Economic
12.1 Euler Equation . . . . . .
12.2 Hamiltonian . . . . . . . .
12.3 Dynamic Programming . .
12.4 In…nite Horizon . . . . . .
Growth
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13 DP Application - Labor Supply
13.1 Intertemporal Labor Supply . . . . . . . . . . .
13.2 Finite Horizon with No Uncertainty . . . . . .
13.3 Finite Horizon with Uncertainty . . . . . . . .
13.4 In…nite Horizon without Uncertainty . . . . . .
13.5 In…nite Horizon with Uncertainty . . . . . . . .
13.6 Example: Intertemporal Labor Supply Problem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . .
. . .
. . .
with
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
67
67
68
68
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Finite T
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
70
71
74
75
76
76
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14 Dynamic Optimization in Continuous Time
14.1 Discounting in Continuous Time . . . . . . .
14.2 Continuous Time Optimization . . . . . . . .
14.3 Parallels with Discrete Time . . . . . . . . . .
14.4 Application - Optimal Growth . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
78
80
82
15 Numerical Optimization
15.1 Golden Search . . . . . .
15.2 Nelder-Mead . . . . . . .
15.3 Newton-Raphson Method
15.4 Quasi-Newton Methods .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
87
88
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Statistical Analysis
89
16 Introduction to Probability
16.1 Randomness and Probability . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
89
92
17 Measure-Theoretic Probability
17.1 Elements of Measure Theory . . . . . . .
17.2 Random Variable . . . . . . . . . . . . . .
17.3 Conditional Probability and Independence
17.4 Bayes Rule . . . . . . . . . . . . . . . . .
.
.
.
.
93
93
96
96
97
18 Random Variables and Distributions
18.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . .
18.3 Parametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
101
103
104
19 Statistical Properties of Estimators
19.1 Sampling Distribution . . . . . . . . . . . . . . . . . . .
19.2 Finite Sample Properties . . . . . . . . . . . . . . . . . .
19.3 Convergence in Probability and Consistency . . . . . . .
19.4 Convergence in Distribution and Asymptotic Normality
.
.
.
.
106
106
107
108
110
.
.
.
.
.
112
112
112
113
114
115
20 Stochastic Orders and Delta Method
20.1 A-notation . . . . . . . . . . . . . . .
20.2 O-notation . . . . . . . . . . . . . .
20.3 o-notation . . . . . . . . . . . . . . .
20.4 Op-notation . . . . . . . . . . . . . .
20.5 Delta Method . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21 Regression with Matrix Algebra
21.1 Multiple Regression in Matrix Notation . . . . . .
21.2 MLR Assumptions in Matrix Notation . . . . . . .
21.3 Projection Matrices . . . . . . . . . . . . . . . . . .
21.4 Unbiasedness of OLS . . . . . . . . . . . . . . . . .
21.5 Variance-Covariance Matrix of the OLS Estimator
21.6 Gauss-Markov in Matrix Notation . . . . . . . . .
22 Maximum Likelihood
22.1 Method of Maximum Likelihood
22.2 Examples . . . . . . . . . . . . .
22.3 Information Matrix Equality . .
22.4 Asympotic Properties of MLE . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
116
118
118
119
119
120
.
.
.
.
122
122
122
125
127
23 Generalized Method of Moments
128
23.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
23.2 GMM Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
23.3 Example: Euler Equation Asset Pricing Model . . . . . . . . . . . . . . . . 131
24 Testing of Nonlinear Hypotheses
24.1 Wald Test . . . . . . . . . . . . .
24.2 Likelihood Ratio Test . . . . . .
24.3 Lagrange Multiplier Test . . . . .
24.4 LM Test in Auxiliary Regressions
24.5 Test Comparison . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
134
135
136
136
137
138
25 Bootstrap Approximation
139
25.1 The Empirical Distribution Function . . . . . . . . . . . . . . . . . . . . . . 139
25.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
26 Elements of Bayesian Analysis
142
26.1 The Normal Linear Regression Model . . . . . . . . . . . . . . . . . . . . . 143
26.2 Bernstein-Von Mises Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 144
26.3 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
27 Markov Chain Monte Carlo
146
27.1 Acceptance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
27.2 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 146
28 Neural Networks and Machine
28.1 Arti…cial Neural Networks . .
28.2 Feed-Forward Network . . . .
28.3 Machine Learning for NNs . .
28.4 Cross-Validation . . . . . . .
28.5 Model Fit . . . . . . . . . . .
28.6 Recurrent Neural Networks .
Learning
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
149
150
151
153
155
156
IV
References
159
ECO2010, Summer 2022
Part I
Topology and Real Analysis
1
Methods of Proofs
1.1
Preliminaries
Consider the statements A and B. We often construct statements using the two quanti…ers:
9 (read as "there exists") and 8 (read as "for all"). More speci…cally: (9x 2 A)[B(x)] means
"there exists an x in the set A such that B(x)" and (8x 2 A)[B(x)] means "for all x in the
set A, B(x)".
We can construct new statements out of existing statements by again using the connectives "and" (^), "or" (_), "not" (:). A ^ B means "A and B", A _ B means "A or B", :A
means "not A".
Terminology:
Implication: A =) B; i.e. "A implies B".
Inverse: :A =) :B:
Converse: B =) A
Contrapositive: :B =) :A
Equivalence: A () B which means "A is equivalent to B" or fA =) Bg^fB =)
Ag:
Statements:
A Theorem or Proposition is a statement that we prove to be true.
A Conjecture is a statement that does not have a proof.
A Lemma is a theorem we use to prove another theorem.
A Corollary is a theorem whose proof follows directly from a Theorem.
A De…nition is a statement that is true by interpreting one of its terms in such a
way as to make the statement true.
An Axiom or Assumption is a statement that is taken to be true without proof.
A Tautology is a statement which is true without assumptions (for example, x = x).
A Contradiction is a statement that cannot be true (for example, "A is true and A
is false").
1
1. METHODS OF PROOFS
1.2
1.2.1
ECO2010, Summer 2022
Proof Techniques
Direct Proof
A direct proof establishes the conclusion by a logical combination of the hypothesis with
axioms, de…nitions, and previously proved theorems. Example:
De…nition 1. An integer x is called even (respectively odd) if there is another integer k
for which x = 2k (respectively 2k + 1).
Theorem 2. The sum of two even integers is always even.
Proof. Let x and y be any two even integers, so there exist integers a and b such that x = 2a
and y = 2b. Then, x + y = 2a + 2b = 2(a + b), which is even.
1.2.2
Constructive Proof
A constructive proof demonstrates the existence of an object by creating or providing a
method for creating such an object. Example:
De…nition 3. A rational number r is any number that can be expressed as the quotient
p=q of two integers, p and q; for q 6= 0:
Theorem 4. For every positive rational number x, there exists a rational number y such
that 0 < y < x.
Proof. Let x be any positive rational number. Then there exist positive integers a and b
such that x = a=b. Let y = a=(2b). Thus, y > 0 is also rational and
y=
1.2.3
a
a
< =x
2b
b
Proof by Contradiction
A proof by contradiction shows that if the conclusion were false, a logical contradiction
would occur. We start by assuming the hypothesis is true and the conclusion is false, and
then we derive the contradiction. To prove a logical implication A =) B, assume A and
:B and prove a contradiction in this assumption. Example:
Theorem 5. There is no greatest even integer.
Note that here for simplicity the Theorem states only B; while A is a vacuous statement
(anything).
Proof. Suppose :B; i.e. there is greatest even integer N . Then for every even integer n,
N
n. Now suppose M = N + 2. Then, M is an even integer by de…nition, and also
M > N by de…nition. This contradicts the supposition that N
n for every even integer
n. Hence, the supposition is false and the statement in the Theorem is true.
2
1. METHODS OF PROOFS
1.2.4
ECO2010, Summer 2022
Proof by Contrapositive
An implication and its contrapositive are equivalent statements:
fA =) Bg () f:B =) :Ag
Proving the contrapositive is often more convenient or intuitive than proving the original
statement. Example:
De…nition 6. Two integers are said to have the same parity if they are both odd or both
even.
Theorem 7. If x and y are two integers for which x + y is even, then x and y have the
same parity.
The contrapositive version of the Theorem is:
"If x and y are two integers with opposite parity, then their sum must be odd."
Proof. 1. Assume x and y have opposite parity.
2. Suppose (without loss of generality) that x is even and y is odd. Thus, there are integers
k and m for which x = 2k and y = 2m + 1.
3. Compute the sum x + y = 2k + 2m + 1 = 2(k + m) + 1, which is an odd integer by
de…nition.
1.2.5
Proof by Induction
The induction technique is useful for proving theorems involving an in…nite number of cases
based on some discrete method of categorizing them. An example are theorems of the form
"For every integer n
n0 ; a property P (n) holds" where P (n) depends on n. There are
two types of induction, weak induction and strong induction. Both involve proving a
base case, for example proving the statement P (n0 ), and then an inductive step in which
we build up the proof for the other cases. In the inductive step, we assume the result holds
for one or more cases and then prove it for a new cases. The di¤erence between weak and
strong induction comes in the form of this assumption, which is also called the inductive
hypothesis.
In weak induction, the inductive hypothesis assumes the result is true for one case.
For example, in the inductive step of a proof on the positive integers, we assume P (n) is
true for some n and prove P (n + 1) is also true. Example:
Theorem 8. For any positive integer n,
1 + 2 + ::: + n =
n(n + 1)
2
(1)
Proof. For the base case, take n = 1. Note that 1 = 2=2 = 1(1 + 1)=2. For the inductive
step, suppose the result (1) holds for some positive integer n. Then, by adding (n + 1) to
3
1. METHODS OF PROOFS
ECO2010, Summer 2022
both sides we obtain
n(n + 1)
+ (n + 1)
2
n(n + 1)
2(n + 1)
=
+
2
2
(n + 1)(n + 2)
=
2
1 + 2 + : : : + n + (n + 1) =
(2)
Notice the …nal equation (2) in the proof is just the equation (1) of the theorem with n
replaced by n + 1. Thus, we have shown the result holds for n + 1 if it holds for n. The base
case tells us it holds for n = 1. Then the inductive step says it also holds for n = 2. Once it
holds for n = 2, the inductive step says it also holds for n = 3. This chain of consequences
continues for all positive integers, and thus we proved the theorem.
Strong induction employs a stronger inductive hypothesis that assumes the result is
true for multiple cases. For example, we might assume P (k) is true for all integers 1 k n
and prove P (n + 1) is also true. Example:
Theorem 9 (The Fundamental Theorem of Arithmetics). Every integer n
factored into a product of primes
n = p1 p2
pr
2 can be
in exactly one way.
This theorem actually contains two assertions: (1) the number n can be factored into a
product of primes in some way, and (2) there is only one such way, aside from rearranging
the factors. We will only prove the …rst assertion.
Recall that a prime is a natural number greater than 1 that has no positive divisors
other than 1 and itself. A natural number greater than 1 that is not a prime number is
called a composite number.
Proof. For the base cases, consider n = 2; 3; 4. Certainly, 2 = 2, 3 = 3, and 4 = 22 , so
each can be factored into primes. For the inductive step, let us suppose all positive integers
2 k n can be factored into primes. Then either n + 1 is a prime number, in which case
it is its own factorization into primes, or it is composite. If n + 1 is composite, it is the
product n + 1 = n1 n2 of two integers with 2 n1 ; n2 n. By the inductive hypothesis, we
know n1 and n2 can be factored into primes:
n 1 = p 1 p2
pr
n2 = q 1 q 2
qs
Thus,
n 1 n 2 = p1 p2
pr q 1 q 2
and n + 1 can be factored into a product of primes.
4
qs
1. METHODS OF PROOFS
ECO2010, Summer 2022
Mathematica demonstration: Proof by Induction
http://demonstrations.wolfram.com/ProofByInduction/
Reference:
Daniel Solow. How to Read and Do Proofs. John Wiley & Sons, Inc., New York,
fourth edition, 2005. ISBN 0-471-68058-3.
William J. Turner. A Brief Introduction to Proofs. 2010.
facstaff/turnerw/Writing/proofs.pdf
5
http://persweb.wabash.edu/
2. SET THEORY
2
ECO2010, Summer 2022
Set Theory
6
2. SET THEORY
2.1
ECO2010, Summer 2022
Sets and Subsets
A set is a collection of distinct objects, considered as an object in its own right. We express
the notion of membership by 2 so that x 2 A means "x is an element of the set A" and
y2
= A means "y is not an element of the set A."
B is a subset of A, written B A, if every x 2 B satis…es x 2 A. If B A and there
exists an element in A which is not in B; we say that B is a proper subset of A and write
B A:
Two sets are equal if they contain the same elements: we write A = B if A B and
B A, and A 6= B otherwise.
We usually de…ne a set explicitly by saying "A is the set of all elements x such that
P (x)" written as
A = fx : P (x)g
where P (x) denotes a meaningful property for each x: Sometimes we also write
A = fx 2 S : P (x)g
for a set S to whose elements the property P applies in a given case.
Example:
R+ = fx 2 R : x 0g
de…nes the set of all non-negative real numbers.
The power set of A; denoted P(A), is the set of all subsets of A: A collection is a
subset of P(A); i.e. a set of sets. A family is a set of collections. We have the set /
collection / family hierarchy in place for cases in which we need to distinguish between
several levels of set composition in the same context. A class is a collection of sets that
can be unambiguously de…ned by a property that all its members share.
2.1.1
Sets of Numbers
The following are some of the most important sets we will encounter:
The set of natural numbers N = f1; 2; 3; : : :g
The set of integers Z = f: : : ; 2; 1; 0; 1; 2; : : :g and the set of non-negative integers Z+ = f0; 1; 2; : : :g
The set of rational numbers (or quotients) Q = f m
n : m; n 2 Z; n 6= 0g
The set of real numbers R which can be constructed as a completion of Q in such a
way that a sequence de…ned by a decimal or binary expansion, e.g. f3; 3:1; 3:14; 3:141;
3:1415; : : :g converges to a unique real number. R = Q [ I where I is the set of irrational numbers. Hence, Q R: (There are several di¤erent ways to de…ne the set of
real numbers - if interested in the details see the reference literature for this section).
7
2. SET THEORY
2.1.2
ECO2010, Summer 2022
Set Operations
For sets A and B we de…ne:
1. A \ B; the intersection of A and B; by A \ B = fx : [x 2 A] ^ [x 2 B]g
2. A [ B; the union of A and B; by A [ B = fx : [x 2 A] _ [x 2 B]g
3. A n B; the di¤erence between A and B; by A n B = fx 2 A : x 2
= Bg
4. A4B; the symmetric di¤erence between A and B; by A4B = (A n B)[(B n A)
5. Ac ; the complement of A; by Ac = fx 2 S : x 2
= Ag
6. ;; the empty set, by ; = S c
7. A and B to be disjoint if A \ B = ;:
2.1.3
Ordered Pair and Cartesian Product
An ordered pair is an ordered list (a; b) consisting of two objects a and b whereby for any
two ordered pairs (a; b) and (a0 ; b0 ) it holds that (a; b) = (a0 ; b0 ) if and only if a = a0 and
b = b0 : If A and B are nonempty sets, then the Cartesian product, denoted A B, is
the set of all ordered pairs f(a; b) : a 2 A and b 2 Bg: While in the set fa; bg there is no
preference given to a over b in terms of order, i.e. fa; bg = fb; ag; the order structure allows
us to distinguish between e.g. …rst and second sets.
Mathematica demonstration: Graph Products
http://demonstrations.wolfram.com/GraphProducts/
8
2. SET THEORY
2.1.4
ECO2010, Summer 2022
Relation
In this section we assume that X and Y are non-empty sets. A subset R of X Y is called a
relation from X to Y: If X = Y then we say that R is a relation on X (i.e. R X X):If
(x; y) 2 R then we think of R associating x with y; and express it as xRy:
A relation R on X is:
re‡exive if xRx for each x 2 X;
complete if either xRy or yRx holds for each x; y 2 X;
symmetric if, for any x; y 2 X; xRy implies yRx;
antisymmetric if, for any x; y 2 X; xRy and yRx imply x = y;
transitive if xRy and yRz imply xRz for any x; y; z 2 X:
Let R be a re‡exive relation on X: The asymmetric part of R is de…ned by the relation
PR on X as xPR y if and only if xRy but not yRx: The relation IR RnPR on X is called
the symmetric part of R:
2.1.5
Equivalence Relation
A relation
on X is called an equivalence relation if it is re‡exive symmetric and
transitive. For any x 2 X; the equivalence class of x relative to is de…ned as the set
[x]
fy 2 X : y
xg
A quotient set of X relative to ; denoted X= ; is the class of all equivalence classes
relative to ; that is
X=
f[x] : x 2 Xg
One typically uses an equivalence relation to simplify a situation in a way that all things
that are indistinguishable from a particular perspective are put together in a set and treated
as a single entity. Example: Let X be the set of all the people in the world. "Being a sibling
of" is an equivalence class on X (under the convention that any person is also a sibling of
oneself). Furthermore, "being a Capricorn" (or any other particular sign) is an equivalence
class on X:
2.1.6
Partition
An equivalence relation can be used to decompose a set into subsets such that the member
of each subset share the same equivalence property while members of di¤erent subsets are
"distinct". A partition of X is a class of pairwise disjoint subsets of X whose union is X:
Theorem 10 (Partition Theorem). For any equivalence relation
X= is a partition of X:
9
on X; the quotient set
2. SET THEORY
2.1.7
ECO2010, Summer 2022
Order Relations
A relation % on X is called a preorder on X if it is transitive and re‡exive. A relation %
on X is called a partial order on X if it is antisymmetric preorder on X:
Example 1:
In individual choice theory, a preference relation % on a set X of choice alternatives
is de…ned as a preorder on X: Re‡exivity is assumed and transitivity follows from agent
rationality. The strict preference relation is de…ned as the asymmetric part of % (it is
transitive but not re‡exive). The indi¤erence relation is de…ned as the symmetric part
of % (it is an equivalence relation on X): For any x 2 X; the equivalence class [x] is called
the indi¤erence class of x: This is a generalization of an indi¤erence curve that passes
through x. An implication of the Partition Theorem is that no two distinct indi¤erence sets
can have a point in common. This is a generalization of the result that two indi¤erence
curves cannot cross.
Example 2:
In social choice theory, we often work with multiple preference relations on a given choice
alternative set X: Suppose there are n individuals in the population, and let %i denote the
preference relation of the ith individual. The Pareto dominance relation % on X is then
de…ned as x % y i¤ x %i y for each i = 1; : : : ; n: Here % is a preorder on X in general, and
a partial order on X if each x %i y is antisymmetric.
2.2
Functions
Let X and Y be non-empty sets. A function f that maps X into Y; denoted f : X ! Y;
is a relation f 2 X Y such that:
(1) 8 x 2 X; 9y 2 Y s.t. xf y;
(2) 8 y; z 2 Y with xf y and xf z; we have y = z:
Such function is also often called a map. This is a set-theoretic formulation of the
concept of a function used in calculus.
The set X is called the domain of f: The set Y is called the co-domain of f: The
range of f is de…ned as
f (X) = fy 2 Y : xf y for some x 2 Xg
Example: for a function f : R ! R de…ned by f (x) = x2 the co-domain of f is R but
the range is R+
0 i.e. the interval [0; 1) since f does not map to any negative number.
The set of all functions that map X into Y is denoted by Y X : For example, R[0;1] is
the set of all real-valued functions on [0; 1]. A function whose domain and co-domain are
identical, i.e. f 2 X X ; is called a self-map on X: The notation y = f (x) refers to y as the
image (or value) of x under f: An image of a set A X under f 2 Y X ; denoted f (A);
is de…ned as the collection of all elements y in Y with y = f (x) for some x 2 A; that is
f (A)
ff (x) : x 2 Ag
The range of f is thus the image of the domain.
10
2. SET THEORY
ECO2010, Summer 2022
The inverse image of a set B
Y under f; denoted f
all x in X whose images under f belong to B; that is
f
1
(B)
1 (B);
is de…ned as the set of
fx 2 X : f (x) 2 Bg
Note that f 1 may or may not satisfy the de…nition of a function. If f 1 is a function, we
say that f is invertible and f 1 is the inverse of f: Example: f (x) = x2 is not invertible
since f (x) = 1 does not have a unique image under f 1 :
If f (X) = Y; that is the range of f equals its codomain, then we say that f maps X
onto Y , and refer to it as a surjection (or as a surjective function/map). If f maps
distinct points in its domain to distinct points in its codomain, that is, if x 6= y implies
f (x) 6= f (y) for all x; y 2 X, then we say that f is an injection (or a one-to-one, or an
injective function/map). Finally, if f is both injective and surjective, then it is called a
bijection (or a bijective function/map).
Let X; Z; and Y be non-empty sets. Let f : X ! Z and g : Z ! Y be functions. We
de…ne the composition function g f : X ! Y by (g f )(x) = g(f (x)) for every x 2 X.
Example: suppose that the functions f : R ! R and g : R ! R are de…ned by f (x) = x2
and g(x) = x 1. Then
Note that (g f )(x) = (f
2.3
(g f )(x) = x2
1
(f
1)2
g)(x) = (x
g)(x) does not hold in general, in this example only if x = 1:
Sequences
A sequence in a set X is a function f : N ! X: We usually represent this function as
(x1 ; x2 ; : : :); denoted by (xm ); where xi
f (i) for each i 2 N: The set of all sequences in
X is X N ; commonly denoted as X 1 : A subsequence (xmk ) of a sequence (xm ) 2 X 1 is
a sequence composed of terms of (xm ) that appear in (xmk ) in the same order as in (xm ):
For example, (xmk ) (1; 31 ; 51 ; : : :) is a subsequence of (xm ) (1; 21 ; 13 ; : : :):
Reference:
Ok, E. (2007): Chapter A, Sections 1.1 - 1.5
11
3. METRIC SPACES
3
ECO2010, Summer 2022
Metric Spaces
The de…ning characteristic of a set is embodied in the concept of membership. Every
set partitions the population of all existing things into two groups: those things that are
members (i.e. elements) of the set and those things that are not. Sets do not intrinsically
involve any concept of distance between (i.e. nearness or farness of) their elements. We
can de…ne such concept using further mathematical structure and call it a metric. In this
context a set that is endowed with a metric (or distance) function is often referred to as
space, and each of its elements is referred to as a point in that space.
3.1
Metric
Let X be a non-empty set. A metric on X is a function d : X X ! R+ with the following
properties:
(i) d(x; y) = 0 i¤ x = y;
(ii) d(x; y) = d(y; x) for all x; y 2 X (Symmetry);
(iii) d(x; y) + d(y; z) d(x; z) for all x; y; z 2 X (Triangle Inequality).
If X is a set and d is a metric on X, then (X; d) is a metric space. When the existence
of d is apparent from the context, we denote a metric space simply by X.
3.2
3.2.1
Examples of the Metric Function
Discrete spaces
For an arbitrary set X; with x; y 2 X; let
d(x; y) =
(
0 if x = y
1 if x 6= y
This metric is called a discrete metric.
3.2.2
Spaces of Sequences
Let `p denote the set of all real-valued in…nite sequences, with (xm ); (ym ) 2 `p ; such that
P
p
i jxi j < 1 for any 0 < p < 1. Then
1
X
d((xm ); (ym )) =
i=1
!1
p
jxi
p
yi j
is a metric on `p :
Let `1 denote the set of all bounded real sequences, with (xm ); (ym ) 2 `1 ; such that
supfjxm j : m 2 Ng < 1: Then
d((xm ); (ym )) = supfjxm
is a metric on `1 :
12
ym j : m 2 Ng
3. METRIC SPACES
3.2.3
ECO2010, Summer 2022
Spaces of Functions
Let T be a non-empty set and let B(T ) denote the set of all bounded real-valued functions
de…ned on T; that is
B(T )
ff : T ! R; s.t. supff (x) : x 2 T g < 1g
Let f; g 2 B(T ): Then
d1 (f; g) = supfjf (x)
g(x)j : x 2 T g
is a metric on B(T ); called the sup-metric.
Let C[a; b] denote the set of all continuous real-valued functions de…ned over [a; b]
Then
d(f; g) = supfjf (x) g(x)j : x 2 [a; b]g
R:
is a metric on C[a; b]; and
dp (f; g) =
Z
a
b
jf (x)
p
1
p
g(x)j dx
is also a metric on C[a; b].
3.2.4
Let
Rn
Euclidean Space
be the Euclidean Space. Let x; y 2 Rn : Then
d1 (x; y) =
n
X
i=1
jxi
yi j
is a metric on Rn (often called the Manhattan metric, or the taxicab metric) and
v
u n
uX
d2 (x; y) = t
(xi yi )2
i=1
is also a metric on Rn :
3.2.5
Message Spaces
Assume that we want to send messages in a language of N symbols (letters, numbers,
punctuation marks, space, etc.) We assume that all messages have the same length K (if
they are too short or too long, we either …ll them out or break them into pieces). We let X
be the set of all messages, i.e. all sequences of symbols from the language of length K.
If x = (x1 ; x2 ; :::; xK ) and y = (y1 ; y2 ; :::; yK ) are two messages, we de…ne
d(x; y) = the number of indices n such that xn 6= yn
Then d is a metric on the given message space. It is usually referred to as the Hamming
metric, and is used in coding theory where it serves as a measure of how much a message
gets distorted during transmission.
13
3. METRIC SPACES
3.2.6
ECO2010, Summer 2022
Further Examples
Mathematica demonstration 1 : Metric Examples (left)
http://demonstrations.wolfram.com/DistanceFunctions/
Mathematica demonstration 2: Radial Metric (right)
http://demonstrations.wolfram.com/RadialMetric/
3.3
3.3.1
Compactness
Open and Closed Sets
Let X be a metric space. For any x 2 X and " > 0; de…ne the "-neighborhood of x in
X as the set
N";X (x) fy 2 X : d(x; y) < "g:
A neighborhood of x in X is any subset of X that contains at least one "-neighborhood
of x in X: A subset S of X is said to be open in X (or an open subset of X) if, for each
x 2 S; there exists an " > 0 such that N";X (x) S: A subset S of X is said to be closed
in X (or a closed subset of X) if XnS is open in X: Note that changing a metric on a
given set X would in general yield di¤erent classes of open (and hence closed) sets.
3.3.2
Interior, Closure, Boundary
Let X be a metric space and S
X: The largest open set in X that is contained in S is
called the interior of S (relative to X) and is denoted intX (S): The smallest closed set in
X that contains S is called the closure of S (relative to X) and is denoted clX (S): The
boundary of S (relative to X) is de…ned as
bdX (S)
3.3.3
clX (S)nintX (S):
Convergent Sequences
The property of closedness and openness of a set in a metric space can also be characterized
by means of sequences in that space. Let X be a metric space, x 2 X; and let (xm ) 2 X 1
be a sequence in X: We say that (xm ) converges to x if, for each " > 0; 9M such that
d(xm ; x) < " 8m
M:
We refer to x as the limit of (xm ): Equvalent notation: d(xm ; x) ! 0; or xm ! x; or
lim xm = x: A sequence can converge to at most one limit.
14
3. METRIC SPACES
3.3.4
ECO2010, Summer 2022
Sequential Characterization of Closed Sets
Proposition (C.1). A set S in a metric space X is closed i¤ every sequence in S that
converges in X converges to a point in S.
Note that closedness of S depends on X: For example, S = (0; 1)
X = (0; 1) R but is not closed in Y = [0; 1] R:
3.3.5
R is closed in
Bounded and Connected Sets
A subset S of a metric space X is called bounded (in X) if there exists an " > 0 such that
S
N";X (x) for some s 2 S: If S is not bounded then it is said to be unbounded. Let
X be a metric space. We say that S is connected if there do not exist two nonempty and
disjoint open subsets O and U of S such that
O [ U = S:
3.3.6
Compact Sets
Let X be a metric space and S
X: A class O of subsets of X is said to cover S if S
[O: If all members of such a class O are open in X; then we say that O is an open
cover of X: A metric space X is said to be compact if every open cover of X has a …nite
subset that also covers X: A subset S of X is said to be compact in X (or a compact
subset of X) if every open cover of S has a …nite subset that also covers S:
Example 1:
For more intuition, consider an example of a space that is not compact: the open interval
(0; 1) R: Let O be the collection
O
and observe that
(0; 1) =
1
;1
i
: i = 1; 2; : : :
1
;1 [
2
1
;1 [ :::
3
that is O is an open cover of (0; 1): However, O does not have a …nite subset that covers
(0; 1) because the greatest lower bound of any …nite subset of O is bounded away from 0:
We then conclude that (0; 1) is not a compact metric space.
Example 2:
Now consider the closed interval [0; 1] R and prove its compactness by contradiction.
Assume there exists an open cover O of [0; 1] no …nite subset of which covers [0; 1]:
Then either [0; 21 ] or [ 12 ; 1] is not covered by any …nite subset of O; pick the interval with
this property and call it [a1 ; b1 ]:
Then, either [a1 ; 21 (b1 + a1 )] or [ 12 (b1 + a1 ); b1 ] is not covered by any …nite subset of O;
pick the interval with this property and call it [a2 ; b2 ]:
Continuing in this way inductively, we obtain two sequences (am ) and (bm ) in [0; 1] such
that, for each m = 1; 2; : : : ;
15
3. METRIC SPACES
(i) am
am+1 < bm+1
(ii) bm
am =
ECO2010, Summer 2022
bm ;
1
2m ;
(iii) [am ; bm ] is not covered by any …nite subset of O;
Using (i) and (ii); 9c 2 R such that fcg 2 \1 [ai ; bi ]; for which c = lim am = lim bm :
Take any O 2 O with c 2 O: Then [am ; bm ]
O for m large enough. This contradicts
property (iii) and we conclude that [0; 1] is compact.
3.3.7
The Heine-Borel Theorem
The example above generalizes to any Euclidean space in the following Theorem:
Theorem 11 (The Heine-Borel Theorem). For any
cube [a; b]n is compact.
1 < a < b < 1; the n-dimensional
The proof of the Theorem follows the logic of the previous example, using n-dimensional
cubes instead of intervals.
Proposition (C.4). Any closed subset of a compact metric space X is compact.
To prove the Proposition, let S be a closed subset of X: If O is an open cover of S (with
sets open in S); then O [ fXnSg is an open cover of X: Since X is compact, there exists a
…nite subset of O [ fXnSg; say O0 ; that covers X: Then O0 nfXnSg is a …nite subset of O
that covers S: QED.
An implication of the Heine-Borel Theorem and Proposition C.4 is that any n-dimensional
prism [a1 ; b1 ]
[an ; bn ] is a compact subset of Rn :
3.4
3.4.1
Completeness
Cauchy Sequences
Intuitively speaking, by a Cauchy sequence we mean a sequence the terms of which eventually get arbitrarily close to one another. Formally, a sequence (xm ) in a metric space X is
called a Cauchy sequence if, for any > 0, there exists an M 2 N such that d(xk ; xl ) <
for all k; l M . For example, For instance, (1; 21 ; 13 ; 41 ; : : :) is a Cauchy sequence in R. As
another example, ( 1; 1; 1; 1; :::) is not a Cauchy sequence in R. Note that for any sequence (xm ) in a metric space X, the condition that consecutive terms of the sequence get
closer and closer, that is, d(xm ; xm+1 ) ! 0, is a necessary but not su¢ cient condition for
(xm ) to be Cauchy. For example, (ln(1); ln(2); ln(2); :::) is a divergent real sequence (not
1
Cauchy) but ln(m + 1) ln(m) = ln(1 + m
) ! 0.
Note that Cauchy sequences are bounded. Moreover, any convergent sequence (xm ) in
X is Cauchy. On the other hand, a Cauchy sequence need not be convergent. For example,
consider (1; 21 ; 13 ; 41 ; : : :) as a sequence in the metric space (0; 1] (not in R). This sequence
does not converge in (0; 1] because its limit 0 does not belong to the space. Nonetheless,
the sequence converges in [0; 1] or R). Yet if we know that (xm ) is Cauchy, and that it
16
3. METRIC SPACES
ECO2010, Summer 2022
has a convergent subsequence, say (xmk ), then we can conclude that (xm ) converges. We
summarize these properties of Cauchy sequences in the following Proposition.
Proposition 1. Let (xm ) be a sequence in a metric space X.
(a) If (xm ) is convergent, then it is Cauchy.
(b) If (xm ) is Cauchy, then fx1 ; x2 ; :::g is bounded, but (xm ) need not converge in X.
(c) If (xm ) is Cauchy and has a subsequence that converges in X, then it converges in X
as well.
3.4.2
Complete Metric Spaces
Suppose that we are given a sequence (xm ) in some metric space, and we need to check if
this sequence converges. Doing this directly requires us to “guess” a candidate limit x for
the sequence, and then to show that we actually have xm ! x (or that this never holds for
any choice of x), which may not be feasible or e¢ cient. A better alternative method is to
check whether or not the sequence at hand is Cauchy. If it is not Cauchy, then it cannot
be convergent by the Proposition above. If it is Cauchy then if we knew that in the given
metric space all Cauchy sequences converge, then we have our result. In such a space a
sequence is convergent i¤ it is Cauchy, and hence convergence can always be tested by using
the “Cauchyness” condition. This property is called completeness. A metric space X is
said to be complete if every Cauchy sequence in X converges to a point in X.
We have seen that (0; 1] R is not complete. Another example of an incomplete metric
space is the set of rational numbers Q R: Indeed, for any x 2 RnQ we can …nd an (xm ) 2
Q1 with lim xm = x. Then, (xm ) is Cauchy, but it does not converge in Q. Metric spaces
that are complete include R; Rn ; `p ; and B(T ):
There is a tight connection between the closedness of a set and its completeness as a
metric subspace.
Proposition 2. Let X be a metric space, and Y a metric subspace of X. If Y is complete,
then it is closed in X. Conversely, if Y is closed in X and X is complete, then Y is
complete.
This is a useful observation that lets us obtain other complete metric spaces from the
ones we already know to be complete. One example is the space C[0; 1] which is a metric
subspace of B[0; 1]. Since B[0; 1] is complete and C[0; 1] is closed in B[0; 1] then C[0; 1] is
also complete.
Reference:
Ok, Chapter C, Sections 1, 3, 5
17
4. ANALYSIS IN METRIC SPACES
4
ECO2010, Summer 2022
Analysis in Metric Spaces
The theory behind many applications in economics is developed in general metric spaces.
Examples include:
Existence of competitive equilibrium in general equilibrium theory
Existence of a solution of a dynamic optimization problem
Existence of a Nash equilibrium in game theory
4.1
Continuity
Let (X; dX ) and (Y; dY ) be two metric spaces. The map f : X ! Y is continuous at
x 2 X if, for any " > 0; there exists a > 0 s.t.
dX (x; y) <
implies dY (f (x); f (y)) < "
The map f is continuous if it is continuous at every x 2 X:
Theorem 12. Let X and Y be two metric spaces, and f : X ! Y a continuous function.
If S is a compact subset of X; then f (S) is a compact subset of Y:
The following result is foundational for optimization theory.
Theorem 13 (Weierstrass Extreme Value Theorem). If X is a compact metric space and
' 2 RX is a continuous function, then there exist x; y 2 X with '(x) = sup '(X) and
'(y) = inf '(X):
A metric space X is connected if there do not exist two non-empty and disjoint open
subsets O and U such that O [ U = X:
Example: any interval I R is connected in R:
Theorem 14 (Intermediate Value Theorem). Let X be a connected metric space and ' :
X ! R a continuous function. If '(x)
'(y) for some x; y 2 X; then there exists
z 2 X such that '(z) = :
4.2
Fixed Points
De…nition 15 (Fixed point). Let (X; d) be a metric space and f : X ! X be a function
(or a self-map). We call s 2 X a …xed point of the function f if
s = f (s)
Let X be a metric space. A self-map
on X is said to be a contraction (or a
contractive self-map) if there exists a real number 0 < K < 1 such that
d( (x); (y))
Kd(x; y) 8x; y 2 X
The in…mum of the set of all such K is called the contraction coe¢ cient of
Example: let f 2 RR ; f
t: Then f is a contraction i¤ j j < 1:
18
:
4. ANALYSIS IN METRIC SPACES
4.2.1
ECO2010, Summer 2022
Blackwell’s Contraction Lemma
The contraction property for self-maps in metric spaces can be checked using the following
Lemma.
Lemma 16 (Blackwell’s Contraction). Let T be a non-empty set, and X
B(T ) that is
closed under addition by positive constant functions, that is f 2 X implies f + 2 X
for any > 0: Assume that is an increasing self-map on X; that is (f )
(g) for any
f; g 2 X with f g: If there exists 0 < < 1 such that
(f + )
then
(f ) +
8(f; ) 2 X
R+
is a contraction.
Example: Blackwell’s Contraction Lemma is typically used to show the existence of a
solution to the Bellman Equation mapping on a space of functions in Dynamic programming
which we will encounter later in the course.
4.2.2
Banach Fixed Point Theorem
Theorem 17 (The Banach Fixed Point Theorem). Let X be a complete metric space. If
2 X X is a contraction, then there exists a unique x 2 X such that (x ) = x :
By the Banach Fixed Point Theorem, the Bellman Equation in the Example above has a
unique solution. Note that the exact de…nition of the domain and co-domain of matters:
a metric space is complete if an only if every contraction on it has a …xed point.
4.2.3
Brower Fixed Point Theorem
Existence of a …xed point can also be shown in settings where we cannot prove the contractive property of a self-map. We can guarantee a …xed point of a continuous self-map, as
long as the domain of this map is su¢ ciently well-behaved.
Theorem 18 (The Brower Fixed Point Theorem). For any given n 2 N; let S be a nonempty, closed, bounded, and covex subset of Rn : If
is a continuous self-map on S; then
there exists s 2 S such that (x) = x:
Example: an important result due to Nash (1950) says that every …nite strategic form
game has a mixed strategy equilibrium. The proof is based on Kakutani’s generalization of
the Brower Fixed Point Theorem.
4.3
4.3.1
Convergence
Uniform Convergence
Let T be a metric space. Let B(T ) denote the set of all bounded real-valued functions de…ned
on T endowed with the sup-metric d1 : Denote by CB(T ) the space of all continuous and
bounded real-valued functions de…ned on T; whereby CB(T )
B(T ): Recall that C(T )
denotes the space of all continuous real-valued functions de…ned on T:
19
4. ANALYSIS IN METRIC SPACES
ECO2010, Summer 2022
Let ('m ) be a sequence in B(T ) and let
lim supfj'm (x)
m!1
'(x)j : x 2 T g = 0
that is 'm (x) converges to some '(x) with respect to d1 : We refer to '(x) as the uniform
limit of 'm (x); and say that ('m ) converges to ' uniformly.
Uniform convergence is a global notion of convergence. For sequences of functions it
involves convergence over the entire domain of the functions, not only at a particular point
x: The following theorem says that uniform convergence preserves continuity in the limit.
Theorem 19. If 'm 2 C(T ) and 'm ! ' uniformly then ' 2 C(T ).
Furthermore, uniform convergence allows us to interchange the operation of taking limits. Let (xk ) be a sequence in T such that (xk ) ! x and let 'm (x) be a sequence in C(T )
such that 'm ! ' uniformly. Then,
lim lim 'm xk = lim ' xk
k!1 m!1
k!1
= '(x)
= lim 'm (x)
m!1
= lim lim 'm xk
m!1 k!1
4.3.2
Pointwise Convergence
In contrast, pointwise convergence requires only that
lim j'm (x)
m!1
' (x) j = 0 for each x 2 T:
Here we say that 'm ! ' pointwise. This mode of convergence is local, at a particular
point x.
Uniform convergence implies pointwise convergence, while the converse is not true. As an
example, consider the sequence (fm ) in C[0; 1] de…ned by fm (t) = tm : Then (fm ) converges
to 1f1g pointwise but not uniformly, where 1f1g is the indicator function of 1 in [0; 1]: If
d1 (fm ; 1f1g ) ! 0 was the case then 1f1g would have to be continuous, which is not the
case.
Pointwise convergence does not necessarily allow us to interchange the operation of
taking limits. As an example, let (tk ) = (0; 21 ; : : : ; 1 k1 ; : : :): Then
lim lim fm (tk ) = 0 6= 1 = lim lim fm (tk )
k!1 m!1
m!1 k!1
Mathematica demonstration: convergence that is pointwise but not uniform
20
4. ANALYSIS IN METRIC SPACES
ECO2010, Summer 2022
http://demonstrations.wolfram.com/ANoncontinuousLimitOfASequenceOfContinuousFunctions/
Reference:
Ok, Chapter C, Section 6.1
Charalambos D. Aliprantis, C. D. and K. C. Border (2006)
21
5. VECTOR SPACES
5
ECO2010, Summer 2022
Vector Spaces
We will now endow metric spaces with further structure consisting of additional properties
and algebraic operations.
De…nition 20 (Vector space). A vector space (also called linear space) over R is a set
V of arbitrary elements (called vectors) on which two binary operations are de…ned:
(i) vector addition:
if u; v 2 V , then u + v 2 V
(ii) scalar multiplication:
if a 2 R and v 2 V; then av 2 V
which satisfy the following axioms:
C1: u + v = v + u; 8u; v 2 V
C2: (u + v) + w = u + (v + w); 8u; v; w 2 V
C3: 90 2 V 3 v + 0 = v = 0 + v; 8v 2 V
C4: 8v 2 V; 9( v) 2 V 3 v + ( v) = 0 = ( v) + v
C5: 1v = v; 8v 2 V
C6: a(bv) = (ab)v; 8a; b 2R and 8v 2 V
C7: a(u + v) = au + av; 8u; v 2 V
C8: (a + b)v = av + bv; 8a; b 2R and 8v 2 V
5.1
Normed Space
The notion of distance in a vector space is introduced through a function called a norm.
De…nition 21 (Norm). If V is a vector space, then a norm on V is a function from V to
R; denoted v 7! kvk ; which satis…es the following properties 8u; v 2 V
and 8a 2 R :
(i) kvk 0
(ii) kvk = 0 i¤ v = 0
(iii) kavk = jaj kvk
(iv) ku + vk kuk + kvk
De…nition 22 (Normed vector space). A vector space in which a norm has been de…ned is
called a normed vector space.
Note that the algebraic operations vector addition and scalar multiplication are used in
de…ning a norm. Thus, a norm cannot be de…ned in a general metric space where these
operations may not be de…ned. In the next theorem, we establish formally the relationship
between a metric space and a normed vector space.
22
5. VECTOR SPACES
ECO2010, Summer 2022
Theorem 23. Let V be a vector space. Then
(i) If (V; d) is a metric space, then (V; k k) is a normed vector space with the norm k k :
V ! R de…ned as k k = d(x; 0); 8x 2 V
(ii) If (V; k k) is a normed vector space then (V; ) is a metric space with the metric :
V V ! R de…ned (x; y) = kx yk ; 8x; y 2 V .
Recall that any metric space having the property that all Cauchy sequences are also
convergent sequences is said to be a complete metric space (i.e. limits are preserved within
such spaces).
De…nition 24 (Banach space). A complete normed vector space is called a Banach space.
Examples of Banach spaces include C[a; b], the set of all real valued functions continuous
in the closed interval [a; b]; and Rn , the Euclidean space. Note that in a vector space of
functions, such as C[a; b]; each "vector" (i.e. element of the space) is a function.
We will now introduce the concept of a subspace.
De…nition 25. Suppose that V is a vector space over R, and that W is a subset of V .
Then we say that W is a subspace of V if W forms a vector space over R under the vector
addition and scalar multiplication de…ned in V .
Example: Consider the vector space R2 of all points (x; y), where x; y 2 R. Let L
be a line through the origin 0 = (0; 0). Suppose that L is represented by the equation
x + y = 0 in other words,
L = f(x; y) 2 R2 : x + y = 0g
Note that if (x; y); (u; v) 2 L, then x+ y = 0 and u+ v = 0, so that (x+u)+ (y+v) =
0; whence (x; y) + (u; v) = (x + u; y + v) 2 L. Thus, every line in R2 through the origin is
a vector space over R.
5.2
Linear Combination
Our ultimate goal of this Section is to be able to determine a subset B of vectors in V and
describe every element of V in terms of elements of B in a unique way. The …rst step in
this direction is summarized below.
De…nition 26 (Linear combination ). Suppose that v1 ; : : : ; vr are vectors in a vector space
V over R. By a linear combination of the vectors v1 ; : : : ; vr we mean an expression of
the type
c1 v1 + : : : + cr vr
where c1 ; : : : ; cr 2 R.
Example: In R2 , every vector (x; y) is a linear combination of the two vectors i = (1; 0)
and j = (0; 1); as any (x; y) = xi + yj:
23
5. VECTOR SPACES
ECO2010, Summer 2022
Example: In R4 , the vector (2; 6; 0; 9) is not a linear combination of the two vectors
(1; 2; 0; 4) and (1; 1; 1; 3).
Mathematica demonstration: Linear combinations of curves in complex plane
http://demonstrations.wolfram.com/LinearCombinationsOfCurvesInComplexPlane/
Let us investigate the collection of all vectors in a vector space that can be represented
as linear combinations of a given set of vectors in V .
De…nition 27 (Span). Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. The
set
spanfv1 ; : : : ; vr g = fc1 v1 + : : : + cr vr : c1 ; : : : ; cr 2 Rg
is called the span of the vectors v1 ; : : : ; vr .
Example: The three vectors i = (1; 0; 0), j = (0; 1; 0); and k = (0; 0; 1) span R3 .
Example: The two vectors (1; 2; 0; 4) and (1; 1; 1; 3) do not span R4 .
5.3
Linear Independence
De…nition 28. Suppose that v1 ; : : : ; vr are vectors in a vector space V over R.
(LD) We say that v1 ; : : : ; vr are linearly dependent if there exist c1 ; : : : ; cr 2 R, not all
zero, such that c1 v1 + : : : + cr vr = 0.
(LI) We say that v1 ; : : : ; vr are linearly independent if they are not linearly dependent;
in other words, if the only solution of c1 v1 + : : : + cr vr = 0 in c1 ; : : : ; cr 2 R is given by
c1 = : : : = cr = 0.
5.4
Basis and Dimension
In this section, we complete the task of uniquely describing every element of a vector space
V in terms of the elements of a suitable subset B.
De…nition 29. Suppose that v1 ; : : : ; vr are vectors in a vector space V over R. We say
that fv1 ; : : : ; vr g is a basis for V if the following two conditions are satis…ed:
(B1) spanfv1 ; : : : ; vr g = V
(B2) The vectors v1 ; : : : ; vr are linearly independent.
24
5. VECTOR SPACES
ECO2010, Summer 2022
Proposition 3. Suppose that fv1 ; : : : ; vr g is a basis for a vector space V over R. Then
every element u 2 V can be expressed uniquely in the form
u = c1 v1 + : : : + cr vr
where c1 ; : : : ; cr 2 R.
Example: In Rn , the vectors e1 ; : : : ; en where
ej = f 0; : : : ; 0 ; 1; 0; : : : ; 0 g for every j = 1; : : : ; n
| {z }
| {z }
j 1
n j
are linearly independent and span Rn . Hence fe1 ; : : : ; en g is a basis for Rn . This is known
as the standard basis for Rn .
Example: In the vector space M2;2 (R) of all 2 2 matrices with entries in R, the set
(
)
1 0
0 1
0 0
0 0
;
;
;
0 0
0 0
1 0
0 1
is a basis.
Example: In the vector space Pk of polynomials of degree at most k and with coe¢ cients
in R, the set f1; x; x2 ; : : : ; xk g is a basis.
Mathematica demonstration: B-spline basis functions
http://demonstrations.wolfram.com/CalculatingAndPlottingBSplineBasisFunctions/
We have shown earlier that a vector space can have many bases. For example, any
collection of three vectors not on the same plane is a basis for R3 . In the following discussion,
we attempt to …nd out some properties of bases. However, we will restrict our discussion
to the following simple case.
De…nition 30. A vector space V over R is said to be …nite-dimensional if it has a basis
containing only …nitely many elements.
Example: The vector spaces Rn , M2;2 (R) and Pk that we have discussed earlier are all
…nite-dimensional.
25
5. VECTOR SPACES
ECO2010, Summer 2022
Proposition 4. Suppose that V is a …nite-dimensional vector space over R. Then any two
bases for V have the same number of elements.
De…nition 31. Suppose that V is a …nite-dimensional vector space over R. Then we say
that V is of dimension n if a basis for V contains exactly n elements
Example: The vector space Rn has dimension n.
Example: The vector space M2;2 (R) has dimension 4.
Example: The vector space Pk has dimension (k + 1).
Proposition 5. Suppose that V is a n-dimensional vector space over R. Then any set of
n linearly independent vectors in V is a basis for V .
5.5
Vector Spaces Associated with Matrices
Consider the m
n matrix A with rows r1 ; : : : ; rm and columns c1 ; : : : ; cn : Then
2
3
r1
h
i
6
7
A = 4 ... 5 or A = c1 : : : cn
rm
We also consider the system of homogeneous equations Ax = 0. We will now introduce
three vector spaces that arise from the matrix A.
De…nition 32. Suppose that A is an m n matrix with entries in R.
(RS) The subspace spanfr1 ; : : : ; rm g of Rn , where r1 ; : : : ; rm are the rows of the matrix A,
is called the row space of A.
(CS) The subspace spanfc1 ; : : : ; cn g of Rm , where c1 ; : : : ; cn are the columns of the matrix
A, is called the column space of A.
(NS) The solution space of the system of homogeneous linear equations Ax = 0 is called
the nullspace of A.
Our aim now is to …nd a basis for the row space of a given matrix A with entries in R.
This task is made considerably easier by the following result.
Proposition 6. Suppose that the matrix B can be obtained from the matrix A by elementary
row operations. Then the row space of B is identical to the row space of A.
To …nd a basis for the row space of A, we can now reduce A to row echelon form,
and consider the non-zero rows that result from this reduction. It is easily seen that these
non-zero rows are linearly independent.
Example: Let
2
3
1 3
5 1 5
6 1 4
7 3
2 7
6
7
A=6
7
4 1 5
9 5
9 5
0 3
6 2
1
26
5. VECTOR SPACES
ECO2010, Summer 2022
The matrix A can be reduced to row echelon
2
1 3
6 0 1
6
A=6
4 0 0
0 0
form as
5
2
0
0
1
2
1
0
5
7
5
0
3
7
7
7
5
If follows that v1 = (1; 3; 5; 1; 5); v2 = (0; 1; 2; 2; 7); v3 = (0; 0; 0; 1; 5):
A basis for a column space can be found as a basis for a row space of A0 :
Proposition 7. For any matrix A with entries in R, the dimension of the row space is the
same as the dimension of the column space.
De…nition 33. The rank of a matrix A, denoted by rank(A), is equal to the common value
of the dimension of its row space and the dimension of its column space.
Reference:
Harrison and Waldron (2011), chapter 5
27
6. LINEAR ALGEBRA IN VECTOR SPACES
6
ECO2010, Summer 2022
Linear Algebra in Vector Spaces
6.1
Linear Transformations
De…nition 34. Let V and W be vector spaces over R. The function T : V ! W is called
a linear transformation if the following two conditions are satis…ed:
(LT1) For every u; v 2 V , we have T (u + v) = T (u) + T (v).
(LT2) For every u 2 V and c 2 R; we have T (cu) = cT (u).
De…nition 35. A linear transformation T :Rn !Rm is said to be a linear operator if
n = m. In this case, we say that T is a linear operator on Rn .
6.2
Inner Product Spaces
De…nition 36. Suppose that V is a vector space over R. By a real inner product on V , we
mean a function h ; i : V V !R which satis…es the following conditions:
(IP1) For every u; v 2 V , we have hu; vi = hv; ui :
(IP2) For every u; v; w 2 V , we have hu; v + wi = hu; vi + hu; wi
(IP3) For every u; v 2 V and c 2 R, we have c hu; vi = hcu; vi
(IP4) For every u 2 V , we have hu; ui 0 and hu; ui = 0 if and only if u = 0.
The properties (IP1)-(IP4) describe respectively symmetry, additivity, homogeneity and
positivity. The inner product is thus a whole class of operations which satisfy these properties.
An inner product h ; i always de…nes a norm by the formula
p
kuk = hu; ui
We can check that in such case all the conditions of a norm are satis…ed. However,
the converse is not true, that is, not every norm gives rise to an inner product as the
requirements for de…ning a norm are weaker than those for de…ning an innner product.
While norms are used to evaluate the "size" of individual elements or "distance" of pairs of
elements in vector spaces, inner products are used for transformations of pairs of elements.
De…nition 37. A vector space over R with an inner product de…ned in it is called a real
inner product space.
A real inner product space is a normed vector space but the converse does not hold.
De…nition 38. A complete inner product space is called a Hilbert space.
-
Important special cases of a Hilbert space include:
the n-dimensional Euclidean space Rn ,
L2 , the set of all functions f :R!R with …nite integral of f 2 over R,
the set of all k k matrices with real elements
the set of all polynomials of degree at most k:
28
6. LINEAR ALGEBRA IN VECTOR SPACES
ECO2010, Summer 2022
Example: Consider the vector space M2;2 (R) of all 2 2 matrices with entries in R. For
matrices
"
#
"
#
u11 u21
v11 v21
U=
, V=
u12 u22
v12 v22
in M2;2 (R) let
hU; Vi = u11 v11 + u21 v21 + u12 v12 + u22 v22
It is easy to check that conditions (IP1)-(IP4) are satis…ed.
Example: Consider the vector space P2 of all polynomials with real coe¢ cients and of
degree at most 2. For polynomials
p = p(x) = p0 + p1 x + p2 x2
and
q = q(x) = q0 + q1 x + q2 x2
in P2 ; let
hp; qi = p0 q0 + p1 q1 + p2 q2
It can be checked that conditions (IP1)-(IP4) are satis…ed.
Example: An important special case of a vector space over R is formed by C[a; b], the
collection of all real valued functions continuous in the closed interval [a; b]. For f; g 2
C[a; b]; let
Z b
f (x)g(x)dx
hf; gi =
a
It can be checked that conditions (IP1)-(IP4) are satis…ed.
6.3
Orthogonal Bases
Using the inner product, we now add the extra ingredient of orthogonality to the above
discussion of bases.
Suppose that V is a …nite-dimensional real inner product space. A basis fv1 ; : : : ; vr g of
V is said to be an orthogonal basis of V if hvi ; vj i = 0 for every i; j = 1; : : : ; r satisfying
i 6= j. It is said to be an orthonormal basis if it satis…es the extra condition that kvi k = 1
for every i = 1; : : : ; r.
Example: The basis fe1 ; : : : ; en g
ej = f 0; : : : ; 0 ; 1; 0; : : : ; 0 g for every j = 1; : : : ; n
| {z }
| {z }
j 1
n j
is an orthonormal basis of Rn with respect to the Euclidean inner product.
29
6. LINEAR ALGEBRA IN VECTOR SPACES
ECO2010, Summer 2022
Mathematica demonstration: Orthogonal functions
http://demonstrations.wolfram.com/OrthogonalityOfTwoFunctionsWithWeightedInnerProducts/
6.4
Projections
We can use the machinery of orthogonal vectors to provide a solution for a very practical
problem.
The Projection Problem: Given a …nite-dimensional subspace V of a real inner
product space W , together with a vector b 2 W; …nd the vector v 2 V which is closest to
b in the sense that kb vk2 is minimized. Note that the quantity kb vk2 is minimized
exactly when kb vk is minimized. The square term has the virtue of avoiding square
roots in the calculation.
To solve the projection problem, we need the following key concept:
De…nition 39. Let v1 ; : : : ; vn be an orthogonal basis for the subspace V of the inner product
space W: For any b 2 W; the (parallel) projection of b into the subspace V is the
vector
hv1 ; bi
hv2 ; bi
hvn ; bi
projV b =
v1 +
v2 + : : : +
vn
hv1 ; v1 i
hv2 ; v2 i
hvn ; vn i
It may appear that projV b depends on the basis vectors v1 ; : : : ; vn but this is not the
case.
Theorem 40. Let v1 ; : : : ; vn be an orthogonal basis for the subspace V of the inner product
space W: For any b 2 W; the vector v =projV b is the unique vector in V that minimizes
kb vk2 :
30
6. LINEAR ALGEBRA IN VECTOR SPACES
ECO2010, Summer 2022
As a special case, we obtain the least-squares projection
b = Py = projV y
y
Mathematica demonstration: Projection
http://demonstrations.wolfram.com/ThreeOrthogonalProjectionsOfPolyhedra/
6.5
Euclidean Spaces
An important special case of a real inner product space is the n-dimensional Euclidean
space Rn .
De…nition 41. Suppose that u = (u1 ; : : : ; un ) and v = (v1 ; : : : ; vn ) are vectors in Rn . The
Euclidean dot product of u and v is de…ned by
u v = u1 v1 + : : : + un vn
The Euclidean norm of u is de…ned by
kuk = (u u)1=2
= u21 + : : : + u2n
1=2
and the Euclidean distance between u and v is de…ned by
ku
vk = (u1
v1 )2 + : : : + (un
31
vn )2
1=2
6. LINEAR ALGEBRA IN VECTOR SPACES
ECO2010, Summer 2022
The dot product is a special case of the inner product. We can de…ne an inner product
of two vectors in Rn in a number of ways. Examples include
hu; vi = u v
or
hu; vi = Au Av
In R2 and R3 , we say that two non-zero vectors are perpendicular if their dot product
is zero. We now generalize this idea to vectors in Rn .
De…nition 42. Two vectors u; v 2 Rn are said to be orthogonal if
u v=0
Example: Suppose that u; v 2 Rn are orthogonal. Then
ku + vk2 = (u + v) (u + v)
= u u + 2u v + v v
= kuk2 + kvk2
This is an extension of Pythagoras’theorem.
6.6
Separating Hyperplane Theorem
The Separating Hyperplane Theorem is a result concerning disjoint convex sets in ndimensional Euclidean space. In economic analysis, the Theorem is used in proving e.g.
the Second Fundamental Theorem of Welfare Economics, and the existence of a competitive equilibrium for Aggregate Excess Demand Correspondences.
De…nition 43. A set S Rn is convex if the following condition holds: Whenever A; B 2
S are points in S, the line segment [A; B] is contained entirely in S.
Example in R2 :
Convex set
Non-convex set
De…nition 44. A hyperplane of Rn is its subspace Rn
32
1.
6. LINEAR ALGEBRA IN VECTOR SPACES
ECO2010, Summer 2022
Theorem 45 (Separating Hyperplane Theorem). Let A and B be two disjoint nonempty
convex subsets of Rn . Then there exist a nonzero vector v and a real number c such that
hx; vi
c
and hy; vi
c
for all x 2 A and y 2B.
Example in R2 of convexity (left) and violation of convexity (right):
Reference:
Harrison and Waldron (2011), chapter 6 and section 7.6
33
7. CORRESPONDENCES
7
ECO2010, Summer 2022
Correspondences
Recall the de…nition of a function from a set A to RK
De…nition 46. Given a set A
RN ; a function g : A ! RK is a rule that assigns an
element g(x) 2 RK to every x 2 A:
A correspondence generalizes the concept of a function by mapping into sets.
De…nition 47 (M.H.1). Given a set A RN ; a correspondence f : A ! RK is a rule
that assigns a set f (x) RK to every x 2 A:
Example:
7.1
Continuity of Correspondences
Intuitively, a correspondence is continuous if small changes in x produce small changes in
the set F (x): The formal de…nition of continuity for correspondences is complicated though.
It encompasses two di¤erent concepts:
Upper hemicontinuity
Lower hemicontinuity
De…nition 48. Given A
is the set
RN and Y
RK ; the graph of the correspondence f : A ! Y
f(x; y) 2 A
Y : y 2 f (x)g
De…nition 49 (M.H.2). Given A
RN and the closed set Y
RK ; the correspondence
f : A ! Y has a closed graph if for any two sequences xm ! x 2 A and y m ! y; with
xm 2 A and y m 2 f (xm ) for every m; we have y 2 f (x):
De…nition 50 (M.H.3). Given A
RN and the closed set Y
RK ; the correspondence
f : A ! Y is upper hemicontinuous (uhc) if it has a closed graph and the images of
compact sets are bounded, that is, for every compact set B A the set
f (B) = fy 2 Y : y 2 f (x) for some x 2 Bg
is bounded.
34
7. CORRESPONDENCES
ECO2010, Summer 2022
Heuristically, a function is upper hemicontinuous when a convergent sequence of points
in the domain maps to a sequence of sets in the range which contain another convergent
sequence, then the image of limiting point in the domain must contain the limit of the
sequence in the range.
Example of a upper hemicontinuous correspondence that is not lower hemicontinuous:
De…nition 51 (M.H.4). Given A
RN and a compact set Y
RK ; the correspondence
f : A ! Y is lower hemicontinuous (lhc) if for every sequence xm ! x 2 A with
xm 2 A for all m; and every y 2 f (x); we can …nd a sequence y m ! y and an integer M
such that y m 2 f (xm ) for m > M:
Lower hemicontinuity captures the idea that if a sequence in the domain converges,
then for any given point in the range of the limit we can …nd a sub-sequence whose image
contains a convergent sequence to the given point.
Example of a lower hemicontinuous correspondence that is not upper hemicontinuous
7.2
Kakutani’s Fixed Point Theorem
Theorem 52 (M.I.2). Suppose that A RN is a non-empty, compact, convex set, and that
f : A ! A is an upper hemicontinuous correspondence from A into itself with the property
that the set f (x) A is nonempty and convex for every x 2 A: Then f ( ) has a …xed point;
that is, there is an x 2 A such that x 2 f (x):
35
7. CORRESPONDENCES
ECO2010, Summer 2022
Kakutani’s Fixed Point Theorem generalizes the Brouwer Fixed Point Theorem. Areas
of application include:
Proof of existence of Nash Equilibrium in strategic games,
Proof of existence of Walrasian competitive equilibrium.
Illustration of the Kakutani’s Fixed Point Theorem:
Reference:
MasCollel, Whinston and Green (1995), Appendix M.H.
36
8. CONTINUITY
8
8.1
ECO2010, Summer 2022
Continuity
Convexity
Theorem 53. Let A = (a1 ; : : : ; an ) and B = (b1 ; : : : ; bn ) be any two points in S Rn , and
let a; b be the corresponding column vectors. Then a function f : Rn ! R is:
(i) convex i¤ f ((1 t)a + tb) (1 t)f (a) + tf (b) for all a; b 2 S and all t 2 [0; 1] ;
(ii) concave i¤ f ((1 t)a + tb) (1 t)f (a) + tf (b) for all a; b 2 S and all t 2 [0; 1] ;
When or is replaced by < or >; respectively, these become strictly convex or strictly
concave.
Example of a convex function f (x; y) = x2 + y 2 of two variables:
A function f : Rn ! R is quasiconcave i¤ f ((1 t)a + tb) minff (a) ; f (b)g for all
a; b 2 S and all t 2 [0; 1]: A function f : Rn ! R is quasiconvex i¤ f is quasiconcave.
Every concave function in Rn is quasiconcave, but the converse does not hold.
8.2
The Maximum Theorem
Let T be any metric space. Recall that we denote by C(T ) the set of all continuous functions
de…ned on T , B(T ) the set of all bounded functions de…ned on T , and CB(T ) the set of all
continuous and bounded functions on T . When T is compact then CB(T ) = C(T ) and we
will consider C(T ) as metrized by the sup-metric d1 : In line with the reference source for
this section (Ok, 2007) we will denote a correspondence by the symbol
:
Let X denote a metric space of "decision variables" and let denote a metric space of
"parameters". Our goal is to determine how the maximum value and the maximizing vectors
in a constrained maximization problem involving X depend on the exogenous parameters
. Let ' denote the "objective function", de…ned as a map on X
; and let ( ) be a
non-empty subset of X; representing a constraint on x: A general optimization problem is
then de…ned as
max '(x; )
x2 ( )
whereby the constraint ( ) depends on the given parameter :
For each 2 ; let ' ( ) denote the maximized value of the objective function, and let
( ) denote the maximal choice, as functions of the parameter : The Maximum Theorem
tells us that when solving a constrained optimization problem, if the objective function is
continuous, and the correspondence de…ning the constraint set is continuous, compact, and
37
8. CONTINUITY
ECO2010, Summer 2022
non-empty then the problem has a solution and:
(a) the correspondence ( ) de…ning the optimal choice set is upper-hemicontinuous (if this
is a function, it is a continuous function),
(b) the optimized function ' ( ) is continuous in the parameters.
The Theorem also appears in the literature under the name "Theorem of the Maximum"
or "Berge’s Theorem".
Theorem 54 (The Maximum Theorem). Let
compact-valued correspondence, and ' 2 C(X
( )
and X be two metric spaces,
): De…ne
arg maxf'(x; ) : x 2 ( )g for all
:
X a
2
and
' ( )
maxf'(x; ) : x 2 ( )g for all
2
;
and assume that is continuous at 2 : Then:
(a) :
X is compact-valued, upper hemicontinuous, and closed at ;
(b) ' ( ) : ! R is continuous at :
8.2.1
Application 1: The Demand Correspondence
Consider an agent whose income is
commodity bundles is
> 0 and whose utility function over n-vectors of
u : Rn+ ! R.
The standard choice problem of this consumer is
max u(x) s.t. x 2 B(p; )
where p 2 Rn+ denotes the price vector in the economy, and B : Rn+1
+
correspondence of the consumer:
B(p; )
fx 2 Rn+ : p x
Rn is the budget
g:
The optimum choice of the individual is conditional on the parameter vector (p; ); and
given by the demand correspondence d : Rn+1
Rn de…ned by
+
d(p; )
arg maxfu(x) : x 2 B(p; )g:
Since B is continuous, it follows from the Maximum Theorem that d is compact-valued,
closed, and upper hemicontinuous. Moreover, the indirect utility function u : Rn+1
!R
+
de…ned by u (p;i) max(u(x) : x 2 B(p; )) is continuous.
38
8. CONTINUITY
8.2.2
ECO2010, Summer 2022
Application 2: Strategic Games
A strategic game models strategic interaction of a group of m players for m 2: Denote
the set of players by f1; : : : ; mg: Denote by Xi the action space of player i; de…ned as
a nonempty set which contains all actions that are available to player i = 1; :::; m: The
outcome of the game is obtained when all players choose an action. Hence, an outcome
is any m-vector (x1 ; :::; xm ) 2 X1 X2 : : : Xm = X; where X is called the outcome
space of the game.
In a strategic game, the payo¤s of a player depend not only on her own actions but
also on the actions taken by other players. The payo¤ function of player i, denoted i ;
is thus de…ned on the entire outcome space: i : X ! R: Each player i is assumed to have
a complete preference relation on X which is represented by i . Formally, an m-person
strategic game is then de…ned as a set
G
f(X1 ;
1 ); :::; (Xm ;
m )g:
We say that a strategic game G is …nite if each action space Xi is …nite.
A well-known example of a …nite strategic game is the prisoners’dilemma. The game is
described by means of a payo¤ matrix of the type
1,1
0,6
6,0
5,5
Both players have strong incentives to play the game non-cooperatively by choosing , so the
decentralized behavior dictates the outcome ( ; ): However, at the cooperative outcome
( ; ) both players are strictly better o¤. Thus the prisoners’dilemma illustrates the fact
that decentralized individualistic behavior need not lead to a socially optimal outcome in
strategic situations.
For any strategic game G; let
X
i
f(!; :::; ! m
1)
: ! j 2 Xj for j < i and ! j
1
2 Xj for j > ig
be the collection of all action pro…les of all the players except the player i. Denote by
(a; x i ) the outcome x 2 X where the action taken by player i is a and the actions taken
by other players are x i 2 X i :
De…nition 55 (Nash Equilibrium). Let G f(Xi ; i )i=1;:::;m g be a strategic game. We say
that an outcome x 2 X is a Nash equilibrium if
xi 2 arg maxf i (xi ; x i ) : xi 2 Xi g for all i = 1; :::m:
At a NE there is no incentive for any player to change her action, given the actions of
other players at this outcome. A Nash equilibrium x is said to be symmetric if x1 =
=
xm . We denote the set of all NE and symmetric NE of a game G by N E(G) and N Esym (G);
respectively.
39
8. CONTINUITY
ECO2010, Summer 2022
If each Xi is a nonempty compact subset of a Euclidean space, then we say that the
strategic game G is a compact Euclidean game. If, in addition, i 2 C(X) for each
i = 1; :::; m, then we say that G is a continuous and compact Euclidean game. If,
instead, each Xi is convex and compact, and each ( ; x i ) is quasiconcave for any given
x i 2 X i ; then G is called a convex and compact Euclidean game. Finally, a compact
Euclidean game which is both convex and continuous is called a regular Euclidean game.
Theorem 56 (Nash’s Existence Theorem). If G
game, then N E(G) 6= ;:
f(Xi ;
i )i=1;:::;m g
is a regular Euclidean
To prove the Theorem, for each i = 1; :::; m de…ne the best response correspondence
bi : X i
Xi by
bi (x i ) arg maxf i (xi ; x i ) : xi 2 Xi g
and the correspondence b : X
b(x)
X by
b1 (x
1)
bm (x
m ):
The proof proceeds by showing that Kakutani’s Fixed Point Theorem applies to b: Since
each Xi is compact and convex, so is X: To show that b is convex-valued, …x an arbitrary
x 2 X and 0
1 and note that, for any y; z 2 b(x) we have
i (yi ; x i )
Thus by using the quasiconcavity of
i(
yi + (1
for any wi 2 Xi ; that is
i
=
i (zi ; x i ):
on Xi we …nd
)zi ; x i )
yi + (1
i (yi ; x i )
i (wi ; x i )
)zi 2 bi (x i ):
Since this holds for each i, we conclude that y + (1
)z 2 b(x) and hence b(x) is convexvalued. Furthermore, by the Maximum Theorem, b is upper hemicontinuous. QED.
8.3
Semicontinuity
Continuity is a su¢ cient but not necessary condition for a function to attain a maximum
(or minimum) over a compact set. Other su¢ cient conditions rely on further continuity
concepts; we will now introduce some of them.
Let X be any metric space, and ' 2 RX : We say that ' is upper semicontinuous at
x 2 X if, for any " > 0; 9 > 0 such that
d(x; y) <
implies
'(y)
'(x) + " 8y 2 X;
and ' is lower semicontinuous at x 2 X if, for any " > 0; 9
d(x; y) <
implies
'(y)
40
'(x)
> 0 such that
" 8y 2 X:
8. CONTINUITY
ECO2010, Summer 2022
The function ' is said to be upper (lower) semicontinuous if it is upper (lower) semicontinuous at each x 2 X:
8.3.1
Baire Theorem
Semicontinuity leads to the following generalization of the Weierstrass Theorem.
Theorem 57 (Baire). Let X be a compact metric space, and ' 2 RX : If ' is upper
semicontinuous, then 9 x 2 X with '(x) = sup '(X): If ' is lower semicontinuous, then 9
x 2 X with '(x) = inf '(X):
Example:
Consider the function f (t) + log(1 + t) where f 2 R[0;2] is de…ned as
(
t2 2t for 0 t < 1
f (t) =
2t t2 for 1 t 2
Does f attain a maximum or minimum? Since f is discontinuous at t = 1; the Weierstrass Theorem is not applicable. Instead, we observe that f is upper semicontinuous on
[0; 2] and hence by the Baire Theorem the maximum of f will exist.
41
8. CONTINUITY
y
ECO2010, Summer 2022
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
-0.2
2.0
x
-0.4
8.3.2
Caristi’s Fixed Point Theorem
The following Theorem generalizes the Banach …xed point theorem by allowing for discontinuous maps.
Theorem 58 (Caristi). Let
be a self-map on a complete metric space X: If
d(x; (x))
'(x)
'( (x)) 8x 2 X
for some lower semicontinuous ' 2 RX that is bounded from below, then
in X:
Reference:
Ok, p. 229, 234, 238
42
has a …xed point
ECO2010, Summer 2022
Part II
Optimization
9
9.1
Constrained Optimization
Equality Constraints
Let f (x) be a function in n variables, and let
g1 (x) = b1 ;
g2 (x) = b2 ;
:::;
gm (x) = bm
(3)
be m equality constraints, given by functions g1 ; : : : ; gm and constants b1 ; : : : bm .
The problem of …nding the maximum or minimum of f (x) when x satisfy the constraints
(3) is called the Lagrange problem with objective function f and constraint functions
g1 ; : : : ; gm : We will describe a general method for solving Lagrange problems.
De…nition 59. The Lagrangian or Lagrange function is the function
L(x; ) = f (x)
1 (g1 (x)
in the n + m variables x1 ; : : : ; xn ;
multipliers.
b1 )
1; : : : ;
2 (g2 (x)
m:
b2 )
The variables
m (gm (x)
1; : : : ;
m
bm )
are called Lagrange
To solve the Lagrange problem, we solve the system of equations consisting of the n …rst
order conditions
@L(x; )
= 0;
@x1
@L(x; )
= 0;
@x2
;
@L(x; )
=0
@xn
together with the m equality constraints (3). The solutions give us candidates for optimality.
Example: Find the candidates for optimality for the following Lagrange problem: Maximize/minimize the function f (x1 ; x2 ) = x1 + 3x2 subject to the constraint x21 + x22 = 10.
Solution:
To formulate this problem as a standard Lagrange problem, let g(x1 ; x2 ) = x21 + x22
be the constraint function and let b = 10.
We form the Lagrangian
L(x; ) = f (x)
(g(x)
b) = x1 + 3x2
x21 + x22
10
The FOCs become
g(x)
@L(x; ) @f (x)
=
=1
@x1
@x1
@x1
@L(x; ) @f (x)
g(x)
=
=3
@x2
@x2
@x2
@L(x; )
= x21 + x22 = 10
@
43
1
2
3
2x2 = 0 =) x2 =
2
2x1 = 0 =) x1 =
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
Solving this system of equations, we obtain the candidates for optimality (x1 ; x2 ; ) =
(1; 3; 1=2) and (x1 ; x2 ; ) = ( 1; 3; 1=2):
Since f (1; 3) = 10 and f ( 1; 3) = 10 then if there is a maximum point, then it
must be (1; 3), and if there is a minimum point, then it must be ( 1; 3):
De…nition 60 (NDCQ). The non-degenerate constraint quali…cation (or NDCQ)
condition is satis…ed if
2
3
0 (x) g 0 (x)
0 (x)
g11
g1n
12
6 0
0 (x)
0 (x) 7
g2n
6 g21 (x) g22
7
7=m
6
rank 6 .
.
.
..
.. 7
4 ..
5
0 (x) g 0 (x)
gm1
m2
0 (x)
gmn
The NDCQ says that the constraints are independent at x:
Theorem 61. Let the functions f; g1 ; : : : ; gm in a Lagrange problem be de…ned on a subset
S Rn . If x is a point in the interior of S that solves the Lagrange problem and satis…es
the NDCQ condition, then there exist unique numbers 1 ; : : : ; m such that x ; 1 ; : : : ; m
satisfy the …rst order conditions.
This theorem implies that any solution of the Lagrange problem can be extended to
a solution of the …rst order conditions. This means that if we solve the system of equations that includes the …rst order conditions and the constraints, we obtain candidates for
optimality.
Mathematica demonstration: Lagrange multipliers in 2D
http://demonstrations.wolfram.com/LagrangeMultipliersInTwoDimensions/
9.2
Extreme Value Theorem
Theorem 62 (Extreme Value Theorem). Let f (x) be a continuous function de…ned on a
closed and bounded subset S Rn . Then f has a maximum point and a minimum point in
S.
44
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
Theorem 63 (Su¢ cient Conditions for EVT). Let x be a point satisfying the constraints,
and assume that there exist 1 ; : : : ; m such that (x1 ; : : : ; xn ; 1 ; : : : ; m ) satisfy the …rst
order conditions. Then:
(i) If L(x; ) is convex in x (with i = i …xed), then x is a solution to the minimum
Lagrange problem.
(ii) If L(x; ) is concave as a function in x (with i = i …xed), then x is a solution to
the maximum Lagrange problem.
The Lagrangian: An example (cont.): Returning to our earlier example: maximize/minimize
the function f (x1 ; x2 ) = x1 + 3x2 subject to the constraint x21 + x22 = 10.
Solution:
The Lagrangian for this problem is
x21 + x22
L(x; ) = x1 + 3x2
10
We found one solution to the FOCs as (x1 ; x2 ; ) = (1; 3; 1=2) and when we …x
we obtain the Lagrangian
L(x; ) = x1 + 3x2
1 2
(x + x22 )
2 1
= 1=2
5
This function is concave and hence (x1 ; x2 ; ) = (1; 3) is a maximum point.
Similarly, (x1 ; x2 ) = ( 1; 3) is a minimum point since L(x; ) is convex in x when
we …x = 1=2:
9.3
Inequality Constraints
Consider the following optimization problem with inequality constraints:
max f (x)
subject to
g1 (x)
b1
g2 (x)
b2
..
.
gm (x)
bm
To solve this problem, we can use a variation of the Lagrange method.
Optimization with Inequality Constraints:
45
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
Just as in the case of equality constraints, the Lagrangian is given by
L(x; ) = f (x)
1 (g1 (x)
b1 )
m (gm (x)
bm )
In the case of inequality constraints, we solve the Kuhn-Tucker conditions that consist
of:
The …rst order conditions
@L(x; )
= 0;
@x1
@L(x; )
= 0;
@x2
;
@L(x; )
=0
@xn
and the complementary slackness conditions given by
j
0 for j = 1; 2; : : : ; m
j
= 0 whenever gj (x) < bj
When we solve the Kuhn-Tucker conditions together with the inequality constraints we
obtain candidates for maximum (consider f (x) for a minimum).
Necessary condition:
Theorem 64. Assume that x solves the optimization problem with inequality constraints.
If the NDCQ condition holds at x , then there are unique Lagrange multipliers such that
(x ; ) satisfy the Kuhn-Tucker conditions.
Given a point x satisfying the constraints, the NDCQ condition holds if the rows in
the matrix
2
3
0 (x ) g 0 (x )
0 (x )
g11
g1n
12
6 0
0 (x )
0 (x ) 7
g2n
6 g21 (x ) g22
7
6
7
.
.
.
6
7
..
..
..
4
5
0 (x ) g 0 (x )
gm1
m2
46
0 (x )
gmn
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
corresponding to constraints where gj (x ) = bj are linearly independent.
An Example: Maximize the function f (x1 ; x2 ) = x21 + x22 + x2 1 subject to g(x1 ; x2 ) =
2
x1 + x22 1:
Solution:
The Lagrangian is
L = x21 + x22 + x2
x21 + x22
1
1
The FOCs are
2x1
2x1 = 0
=)
2x1 (1
)=0
2x1 + 1
2x2 = 0
=)
2x2 (1
)=
From (4) we get x1 = 0 or
and hence x1 = 0:
= 1; but
From (5) we get x2 =
1
2(1
)
since
(4)
1
(5)
= 1 is not possible by the second equation
6= 1:
The complementary slackness conditions are
0 and
= 0 if x21 + x22 < 1:
There are two cases:
Case 1: x21 + x22 < 1 and = 0: In this case, x2 = 1=2 from (5) and this satis…es the
inequality. Hence, the point (x1 ; x2 ; ) = (0; 1=2; 0) is a candidate for maximality.
Case 2: x21 + x22 = 1 and
0: Since x1 = 0 from (4), we obtain x2 = 1:
We solve for
in each case, and check that
0:
Thus,
f (0; 1=2) = 1:25
f (0; 1) = 1
f (0; 1) = 1
By the extreme value theorem, the function f has a maximum on the closed and
bounded set given by x21 + x22
1 (a circular disk with radius one), and therefore
(x1 ; x2 ) = (0; 1) is a maximum point.
9.4
Application: The Consumer Problem
De…ne the consumer problem (CP) as:
max U (x) s.t. p0 x
x2Rn
+
w
where p is a vector of prices and w is the consumer’s total wealth.
Given p and w specify the budget set as
B(p; w) = x 2 Rn+ : p0 x
and note that it is compact.
47
w
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
Theorem 65. If U ( ) is continuous and p
0; then CP has a solution.
Proof: Continuous functions on compact sets achieve their maximum.
Form the Lagrangian
L(x; ; p; w) = U (x) +
0
w
px +
n
X
i xi
i=1
The Kuhn-Tucker conditions are:
@U (x)
= pk
@xk
w p0 x = 0
(6)
k
0
i
0 for all i
plus the original constraints.
We can use the Kuhn-Tucker conditions to characterize (so-called Marshallian) demand.
Using the FOCs (6) we can write:
@U (x)
@xk
pk with equality if xk > 0
Hence, for all goods xj and xk consumed in positive quantity, their Marginal Rate of
Substitution equals the ratio of their prices:
M RSjk =
@U (x)=@xj
pj
=
@U (x)=@xk
pk
De…ne the indirect utility function as:
p0 x
v(p; w) = max U (x) s.t.
w
so that v(p; w) is the value of the consumer problem at the optimum. Assume that U (x) is
continuous and concave, and internal solution to the CP exists. Then v(p; w) is di¤erentiable
at (p; w) and by the Envelope Theorem
@v(p; w)
=
@w
0
The Lagrange multiplier gives the value (in terms of utility) of having an additional unit
of wealth. Because of this, is the sometimes called the shadow price of wealth or the
marginal utility of wealth (or income).
48
9. CONSTRAINED OPTIMIZATION
9.5
ECO2010, Summer 2022
Envelope Theorem for Unconstrained Optimization
In an optimization problem we often want to know how the value of the objective function
will change if one or more of the parameter values changes. Let’s consider a simple example:
a pricetaking …rm choosing the pro…t-maximizing amount to produce of a single product. If
the market price of its product changes, how much will the …rm’s pro…t increase or decrease?
The optimization problem is
max (q; p) = pq
q2R+
C(q)
(7)
where (q; p) is the …rm’s pro…t, q is the number of units of the …rm’s product (a decision
variable), p is the market price (a parameter), and C(q) is the …rm’s cost function assumed
convex and di¤erentiable. The F.O.C. for (7) is
@
(q) = p
@q
C 0 (q) = 0
(8)
yielding
C 0 (q) = p
The …rm chooses the output level q at which its marginal cost is equal to the market price,
which the …rm takes as a given parameter. Denote the solution to (7) by q (p):
The value function for (7) is
v(p) = (q (p); p) = pq (p)
C(q (p))
Then,
@
@ @q (p)
+
@p
@q @p
@ @q (p)
=q +
@q @p
v 0 (p) =
(9)
which takes into account of the …rm’s output response to a price change, @q@p(p) : However,
this term is multiplied by @@q which, from (8), is 0 at the optimum . (9) thus becomes
v 0 (p) = q
which implies that at the optimum solution the derivative of the value function v(p) with
respect to a parameter equals only the partial derivative of the objective function (q; p)
with respect to the parameter. This result is summarized in the following theorem:
Theorem 66 (Envelope Theorem for Unconstrained Optimization). For the maximization
problem maxx2R f (x; ); if f : R2 ! R is continuously di¤ erentiable and if the solution
function x : R ! R is di¤ erentiable on an open set U
R, then the derivative of the
value function satis…es
@f
v0( ) =
@ x()
49
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
Here is the Envelope Theorem for n variables and m paramters:
Theorem 67 (Multivariate Envelope Theorem for Unconstrained Optimization). For the
maximization problem maxx2Rn f (x; ); if f : Rn Rm ! R is continuously di¤ erentiable
and if the solution function x : Rm ! Rn is di¤ erentiable on an open set U Rm , then
the partial derivatives of the value function satisfy
@v( )
@f
=
@ i
@ i
for i = 1; : : : ; m:
x ( )
Proof. For each i = 1; : : : ; m we have
@v( )
@
=
f (x ( ); )
@ i
@ i
n
X
@f
@f @xj ( )
=
+
@ i
@xj @ i
j=1
=
because
9.6
@f
@xj
@f
@ i
= 0 at xj ( ) according to the optimization problem’s F.O.C.
Envelope Theorem for Constrained Optimization
Now let’s add a constraint to the multivariate maximization problem:
max f (x; ) subject to g(x; )
x2Rn
0
(10)
where both f and g are continuously di¤erentiable and we assume that the solution function
x ( ) exists and is di¤erentiable on an open set U Rn : The F.O.C. for (10) is
rf (x; ) = rg(x; )
i.e.
@f
@g
(x; ) =
(x; ) for j = 1; : : : ; n:
@xj
@xj
(11)
Let’s assume that the constraint is binding and > 0: Then the solution function x ( )
satis…es the constraint with equality, g(x ( ); ) = 0 for all such that (x ( ); ) 2 U:
We therefore have the following proposition, which will be instrumental in the proof of the
Envelope Theorem.
Proposition 8. If g : Rn R ! R is continuously di¤ erentiable and x ( ) : R
satis…es g(x ( ); ) = 0 for all (x; ) 2 U; then
n
X
@g @xj ( )
=
@xj @
j=1
Proof. The 1st-degree Taylor polynomial equation for
g( x;
50
)=0
@g
@
! Rn
9. CONSTRAINED OPTIMIZATION
is
@g
x1 +
@x1
+
ECO2010, Summer 2022
@g
@g
xn +
@xn
@
+ R( x;
where for the remainder function R it holds that lim( x;
and taking the limit,
3
2
n
X
@g 5
@g xj
lim 4
+
=0
!0
@xj
@
)=0
R( x;
)!0 k( x;
)
)k
= 0: Dividing by
j=1
n
X
@g
lim
@xj !0
xj
=
j=1
n
X
@g @xj ( )
=
@xj @
j=1
@g
@
@g
@
Now the proof of the Envelope Theorem becomes straightforward.
Theorem 68 (Multivariate Envelope Theorem with One Constraint). For the maximization
problem maxx2Rn f (x; ) subject to g(x; ) 0; if f : Rn Rm ! R and g : Rn Rm ! R
are continuously di¤ erentiable, if the solution function x : Rm ! Rn is di¤ erentiable on
an open set U
Rm , and if is the value of the Lagrange multiplier in the F.O.C., then
the partial derivatives of the value function satisfy
@v( )
@f
=
@ i
@ i
x ( )
@g
@ i
for i = 1; : : : ; m:
x ( )
Proof. For i = 1; : : : ; m; evaluating all partial derivatives at (x ( ); ); we have
n
X
@v( ) @f
@f @xj ( )
=
+
@ i
@ i
@xj @ i
=
@f
+
@ i
@f
=
@ i
j=1
n
X
j=1
@g
@ i
@g @xj ( )
@xj @ i
(from the F.O.C. (11))
(by the Proposition above).
Now we can extend the proof to maximization problems with multiple constraints, obtaining the general version of the Envelope Theorem.
Theorem 69 (Multivariate Envelope Theorem with Many Constraints). For the maximization problem maxx2Rn f (x; ) subject to g(x; )
0; if f : Rn Rm ! R and
g : Rn Rm ! R` are continuously di¤ erentiable, if the solution function x : Rm ! Rn
51
9. CONSTRAINED OPTIMIZATION
ECO2010, Summer 2022
is di¤ erentiable on an open set U
Rm , and if 1 ; : : : ; ` are the values of the Lagrange
multipliers in the F.O.C.s, then the partial derivatives of the value function satisfy
@v( )
@f
=
@ i
@ i
x ( )
XÌ€
k=1
k
@gk
@ i
for i = 1; : : : ; m:
x ( )
Epplications of the Envelope Theorem include Hotelling’s Lemma, Shephard’s Lemma,
Roy Identity, and solution to the Bellman equation in Dynamic Programming.
52
10. DYNAMIC OPTIMIZATION
10
10.1
ECO2010, Summer 2022
Dynamic Optimization
Motivation: Labor Supply
Labor supply is a part of a lifetime decision making process. There are some stylized facts
about the pattern of wages and employment over the lifecycle:
1. Earnings and wages rise with age after schooling, then decline before retirement.
2. Hours worked generally rises with age, then falls before retirement, then goes to zero
at retirement.
3. Individuals have been retiring earlier, on average, over the last 20 years.
Variations in health status, family composition and real wages –anticipated or otherwise,
provide incentives for individuals to vary the timing of their labor market earnings for
income-smoothing and taste purposes. But these responses are not adequately captured in
the static model
In the static labor supply model, the household solves the problem:
max U (C; l)
C;l
s.t. C
w
h=H
C; l
h+Y
l
0
where C; l; h are consumption, leisure and hours of work, respectively, w is the wage rate,
Y is non-labor income, and H is the time endowment (e.g. 24 hours per day).
At an interior solution (denoted by ), the marginal rate of substitution between
leisure and consumption equals the wage rate:
UC (C ; l )
w = Ul (C ; l )
In the static model people make a one-shot decision about work and consumption. Time
does not play a role. Thus the model is not suited for the study of dynamics of labor supply
over the life-cycle.
In the intertemporal labor supply model:
1. Utility does not only depend on present labor supply and consumption, but on the
whole sequence of labor supply and consumption over the life-cycle:
U (C0 ; C1 ; : : : ; CT ; l0 ; l1 ; : : : ; lT ) =
T
X
t=0
where
2 (0; 1) is a discount factor.
53
t
U (Ct ; lt )
10. DYNAMIC OPTIMIZATION
ECO2010, Summer 2022
2. Individuals need to form expectations about the future: hence, they maximize expected rather than deterministic utility.
3. The budget constraint is dynamic: individuals can shift resources between periods by
using savings. Consequently, today’s decisions in‡uence future choice sets.
10.2
Approaches to Dynamic Optimization
Solving the Intertemporal Labor Supply model, and similar dynamic models in economics,
requires the use of dynamic optimization methods. There are three distinct approaches to
dynamic optimization.
1. Calculus of variations, a centuries-old extension of calculus to in…nite-dimensional
space. It was generalized under the stimulus of the space race in the late 1950s to
develop optimal control theory, the most common technique for dealing with models
in continuous time.
2. Dynamic programming, which was developed also in the mid-20th century, primarily to deal with optimization in discrete time.
3. The Lagrangean approach, which extends the Lagrangean technique of static optimization to dynamic problems.
We will start with the Lagrangean approach, building on the concepts familiar from
static optimization, and then move on to dynamic programming.
10.3
2 Periods
Suppose you embark on a two-day hiking trip with w units of food. Your problem is to
decide how much food to consume on the …rst day, and how much to save for the second
day. It is conventional to label the …rst period with 0. Let c0 denote consumption on the
…rst day and c1 denote consumption on the second day. The optimization problem is
maxU (c0 ; c1 ) s.t. c0 + c1 = w
c0 ;c1
Optimality requires that daily consumption be arranged so as to equalize the marginal
utility of consumption on the two days, that is
@
@
U (c0 ; c1 ) =
U (c0 ; c1 )
@c0
@c1
Otherwise, the intertemporal allocation of food could be rearranged so as to increase total
utility. Under optimality, marginal bene…t equals marginal cost, where the marginal cost of
consumption in period 0 is the consumption foregone in period 1. This is the fundamental
intuition of dynamic optimization - optimality requires that resources be allocated over time
in such a way that there are no favorable opportunities for intertemporal trade.
54
10. DYNAMIC OPTIMIZATION
ECO2010, Summer 2022
Typically, the utility function is assumed to be separable:
U (c0 ; c1 ) = u0 (c0 ) + u1 (c1 )
and stationary:
U (c0 ; c1 ) = u(c0 ) + u(c1 )
where represents the discount rate of future consumption.
Then the optimality condition is
u0 (c0 ) = u0 (c1 )
Assuming that u is concave, we can deduce that
c0 > c1 ()
<1
Consumption is higher in the …rst period if future consumption is discounted.
This model can be extended in various ways. For example, if it is possible to borrow
and lend at interest rate r, the two-period optimization problem is
maxU (c0 ; c1 ) s.t. c1 = (1 + r)(w
c0 ;c1
c0 )
assuming separability and stationarity. Forming the Lagrangean
L = u(c0 ) + u(c1 )
(c1
(1 + r)(w
c0 ))
the …rst-order conditions for optimality are
@
L = u0 (c0 )
(1 + r) = 0
@c0
@
L = u0 (c1 )
=0
@c1
Eliminating , we conclude that optimality requires that
u0 (c0 ) = u0 (c1 )(1 + r)
(12)
The left-hand side is the marginal bene…t of consumption today. For an optimal allocation,
this must be equal to the marginal cost of consumption today, which is the interest foregone
(1 + r) times the marginal bene…t of consumption tomorrow, discounted at the rate .
Assuming u is concave, we can deduce from (12) that
c0 > c1 () (1 + r) < 1
The balance of consumption between the two periods depends upon the interaction of the
rate of time preference and the interest rate r.
Alternatively, the optimality condition can be expressed as
u0 (c0 )
=1+r
u0 (c1 )
(13)
The quantity on the left hand side is the intertemporal marginal rate of substitution. The
quantity on the right can be thought of as the marginal rate of transformation, the rate at
which savings in the …rst period can be transformed into consumption in the second period.
55
10. DYNAMIC OPTIMIZATION
10.4
ECO2010, Summer 2022
T Periods
To extend the model to T periods, let ct denote consumption in period t and wt the remaining wealth at the beginning of period t. Then
w1 = (1 + r)(w0
c0 )
w2 = (1 + r)(w1
..
.
c1 )
wT = (1 + r)(wT
1
cT
1)
where w0 denotes the initial wealth. The optimal pattern of consumption through time
solves the problem
max
ct ;wt+1
T
X1
t
u(ct ) s.t. wt = (1 + r)(wt
ct
1
1 );
t = 1; 2; : : : ; T
t=0
This is a standard equality constrained optimization problem with T constraints.
Assigning multipliers ( 1 ; 2 ; : : : ; T ) to the T constraints, the Lagrangean is
L=
T
X1
t
u(ct )
t=0
T
X
t (wt
(1 + r)(wt
1
ct
1 ))
t=1
This can be rewritten as
L = u(c0 )
+
T
X1
1 (1
t
+ r)(w0
u(ct )
t+1 (1
c0 )
+ r)ct + (
t+1 (1
+ r)
t )wt
t=1
+
T wT
The …rst order conditions for optimality are
@
L = u0 (c0 )
1 (1 + r) = 0
@c0
@
L = t u0 (ct )
t = 1; 2; : : : ; T
t+1 (1 + r) = 0;
@ct
@
L = (1 + r) t+1
t = 1; 2; : : : ; T 1
t = 0;
@wt
@
L= T = 0
@wT
1
Together, these equations imply
t 0
u (ct ) =
in every period t = 0; 1; : : : ; T
t+1 (1
+ r) =
t
(14)
1 and therefore
t+1 0
u (ct+1 ) =
56
t+1
(15)
10. DYNAMIC OPTIMIZATION
ECO2010, Summer 2022
Substituting (15) into (14), we get
t 0
u (ct ) =
t+1 0
u (ct+1 )(1 + r)
0
0
u (ct ) = u (ct+1 )(1 + r);
t = 0; 1; : : : ; T
1
(16)
which is identical to (12), the optimality condition for the two-period problem. An optimal
consumption plan requires that consumption be allocated through time so that marginal
bene…t of consumption in period t (u0 (ct )) is equal to its marginal cost, which is the interest
foregone (1 + r) times the marginal bene…t of consumption tomorrow discounted at the rate
.
Assuming u is concave, (16) implies that
ct > ct+1 () (1 + r) < 1
The choice of ct determines the level of wealth remaining, wt+1 ; to provide for consumption in future periods. The Lagrange multiplier t associated with the constraint
wt = (1 + r)(wt
1
ct
1)
measures the shadow price of this wealth wt at the beginning of period t. (14) implies
that this shadow price is equal to t u0 (ct ). Additional wealth in period t can either be
consumed or saved, and its value in these two uses must be equal. Consequently, its value
must be equal to the discounted marginal utility of consumption in period t. Note that
the …nal …rst-order condition at T is T = 0, since any wealth left over is assumed to be
worthless.
10.5
General Problem
The general …nite horizon dynamic optimization problem can be depicted as
Starting from an initial state s0 , the decision maker chooses some action a0 2 A0 in the
…rst period. This generates a contemporaneous return or bene…t f0 (a0 ; s0 ) and leads to a
new state s1 : The scheme proceeds similarly until the terminal time T: In each period t, the
transition to the new state is determined by the transition equation
st+1 = gt (at ; st )
Assuming separability, the objective of the decision maker is to choose that sequence
of actions a0 ; a1 ; : : : ; aT 1 which maximizes the discounted sum of the contemporaneous
returns ft (at ; st ) plus the value of the terminal state v(sT ). Therefore, the general dynamic
optimization problem is
max
at 2At
T
X1
t
ft (at ; st ) +
T
v(sT )
t=0
s.t. st+1 = gt (at ; st ); t = 0; : : : ; T
57
1; given s0
(17)
10. DYNAMIC OPTIMIZATION
ECO2010, Summer 2022
The variables in this optimization problem are of two types:
1. at 2 At is known as the control variable. It is immediately under the control of the
decision-maker.
2. st is known as the state variable. It is determined indirectly through the transition
equation.
10.6
Lagrangean approach
The dynamic optimization problem (17) falls into the regular framework of constrained
optimization problems solved by Lagrangean methods that we have encountered. In forming
the Lagrangean, it is useful to multiply each constraint (transition equation) by t+1 giving
the equivalent problem
max
at 2At
t+1
s.t.
(st+1
T
X1
t
ft (at ; st ) +
T
v(sT )
t=0
gt (at ; st )) = 0; t = 0; : : : ; T
1; given s0
Assigning multipliers 1 ; : : : ; T to the T constraints (transition equations), the Lagrangean
is
T
T
X1
X1
t
t+1
L=
ft (at ; st ) + T v(sT )
gt (at ; st ))
t+1 (st+1
t=0
t=0
To facilitate derivation of the …rst-order conditions, it is convenient to rewrite the Lagrangean as
L=
T
X1
t
t=0
T
X
ft (at ; st ) +
T
X1
t+1
t+1 gt (at ; st )
t=0
t
t st
T
+
v(sT )
t=1
= f0 (a0 ; s0 ) +
+
T
X1
t=1
T
t
[ft (at ; st ) +
T sT
+
(18)
1 g0 (a0 ; s0 )
T
t+1 gt (at ; st )
t st ]
v(sT )
Note that the gradients of the constraints are linearly independent since each period’s at
appears in only one transition equation.
A necessary condition for optimality is that at must be chosen in each period t =
0; 1; : : : ; T 1 such that
@
L=
@at
t
@
ft (at ; st ) +
@at
58
t+1
@
gt (at ; st ) = 0
@at
10. DYNAMIC OPTIMIZATION
Similarly, in periods t = 1; : : : ; T
@
L=
@st
t
ECO2010, Summer 2022
1 the resulting st must satisfy
@
ft (at ; st ) +
@st
t+1
@
gt (at ; st )
@st
t
=0
while the terminal state sT must satisfy
@
L=
@sT
T
T
+ v 0 (sT ) = 0
The sequence of actions a0 ; a1 ; : : : ; aT 1 and states s1 ; s2 ; : : : ; sT must also satisfy the transition equations
st+1 = gt (at ; st ); t = 0; 1; : : : ; T 1
These necessary conditions can be rewritten as
@
ft (at ; st ) +
@at
@
ft (at ; st ) +
@st
@
gt (at ; st ) = 0, t = 0; 1; : : : ; T 1
@at
@
gt (at ; st ) = t , t = 1; : : : ; T 1
t+1
@st
st+1 = gt (at ; st ); t = 0; : : : ; T 1
t+1
T
0
= v (sT )
(19)
(20)
(21)
(22)
If, in addition, ft and gt are increasing in st , and ft , gt and v are all concave, then t 0
for every t and the Lagrangean will be concave. To interpret these conditions, observe that
a marginal change in at has two e¤ects:
1. It changes the instantaneous return in period t by
2. It changes the state in the next period st+1 by
measured by t+1 .
@
@at ft (at ; st )
@
@at gt (at ; st );
the value of which is
In terms of interpreting the evolution of the state, observe that a marginal change in st
has two e¤ects:
1. It changes the instantaneous return in period t by
@
@st ft (at ; st )
2. It alters the attainable state in the next period st+1 by
is t+1 @s@ t gt (at ; st ):
@
@st gt (at ; st ),
the value of which
The value of the total e¤ect discounted to the current period t is expressed by the
shadow price t of st : Thus the shadow price in each period t measures the present and
future consequences of a marginal change in st . The terminal condition (22), T = v 0 (sT ),
is known as a transversality condition. The necessary and su¢ cient conditions (19)–(22)
constitute a simultaneous system of 3T equations in 3T unknowns, which in principle can
be solved for the optimal solution.
59
10. DYNAMIC OPTIMIZATION
10.7
ECO2010, Summer 2022
Maximum Principle
Some economy of notation and additional economic interpretation can be made by de…ning
the Hamiltonian by
Ht (at ; st ;
t+1 )
= ft (at ; st ) +
t+1 gt (at ; st )
which measures the total discounted return in period t. The Hamiltonian augments the
single period return f (at ; st ) to account for the future consequences of current decisions,
aggregating the direct and indirect e¤ects of the choice of at in period t.
Then the Lagrangean (18) becomes
L = H0 (a0 ; s0 ;
1) +
T
X1
t
[Ht (at ; st ;
t+1 )
t st ]
t=1
T
T sT
+
T
v(sT )
Optimality requires
@
L=
@at
@
L=
@st
@
L=
@sT
@
Ht (at ; st ; t+1 ) = 0,
@at
@
t
Ht (at ; st ; t+1 )
@st
t
T
T
t = 0; 1; : : : ; T
= 0,
t
1
t = 1; 2; : : : ; T
1
+ v 0 (sT ) = 0
which can be rewritten as
@
Ht (at ; st ;
@at
@
Ht (at ; st ;
@st
t+1 ) = 0,
t+1 ) =
T
t,
t = 0; 1; : : : ; T
t = 1; 2; : : : ; T
1
1
= v 0 (sT )
The Maximum Principle characterizes the solution to the above constrained optimization
problem expressed in terms of the Hamiltonian. Along the optimal path, at should be chosen
in such a way as to maximize the total bene…ts in each period. In a limited sense, the
Maximum Principle transforms a dynamic optimization problem into a sequence of static
optimization problems. These static problems are related by two intertemporal equations
- the transition equation and the corresponding equation determining the evolution of the
shadow price t .
10.8
In…nite Horizon
Many economic problems have no …xed terminal date and are more appropriately or conveniently modeled as having an in…nite horizon. In such case, the dynamic optimization
problem becomes
max
at 2At
1
X
t
ft (at ; st ) s.t. st+1 = gt (at ; st );
t=0
60
t = 0; 1; : : :
(23)
10. DYNAMIC OPTIMIZATION
ECO2010, Summer 2022
given s0 : To ensure that the total discounted return is …nite, we assume that ft is bounded
for every t and < 1.
An optimal solution to (23) must also be optimal over any …nite period, provided the
future consequences are correctly taken into account. That is, (23) is equivalent to
max
at 2At
T
X1
t
ft (at ; st ) +
T
vT (sT ) s.t. st+t = gt (at ; st ); t = 0; 1; : : : ; T
1
t=0
where
vT (sT ) = max
at 2At
1
X
t T
ft (at ; st ) s.t. st+t = gt (at ; st ); t = T; T + 1; : : :
t=T
It follows that the in…nite horizon problem (23) must satisfy the same intertemporal optimality conditions as its …nite horizon counterpart.
61
11. DYNAMIC PROGRAMMING
11
11.1
ECO2010, Summer 2022
Dynamic Programming
Finite Horizon: Key Concepts
Dynamic programming (DP) is an approach to dynamic optimization which facilitates incorporation of uncertainty and lends itself to computation. Again consider the general
dynamic optimization problem
max
at 2At
T
X1
t
ft (at ; st ) +
T
v(sT ) s.t. st+1 = gt (at ; st ); t = 0; : : : ; T
The (maximum) value function for this problem is
(T 1
X
t
v0 (s0 ) = max
ft (at ; st ) + T v(sT ) : st+1 = gt (at ; st ); t = 0; : : : ; T
at 2At
t=0
By analogy, we de…ne the value function for intermediate periods
(T 1
X
t
vt (st ) = max
f (a ; s ) + T t v(sT ) : s +1 = g (a ; s );
at 2At
1
t=0
1
)
= t; t + 1; : : : ; T
=t
1
)
The value function measures the best that can be done given the current state and remaining
time.
Note that
vt (st ) = max fft (at ; st ) + vt+1 (st+1 ) : st+1 = gt (at ; st )g
at 2At
= max fft (at ; st ) + vt+1 (gt (at ; st ))g
(24)
at 2At
This fundamental recurrence relation, known as the Bellman equation, makes explicit the
trade-o¤ between present and future values. It embodies the principle of optimality: "An
optimal policy has the property that, whatever the initial state and decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from
the …rst decision" (Bellman, 1957).
Assuming vt+1 is di¤erentiable and letting
t+1
0
= vt+1
(st+1 )
the …rst-order condition for the maximization in the Bellman equation (24) is
@
ft (at ; st ) +
@at
t+1
@
gt (at ; st ) = 0,
@at
t = 0; 1; : : : ; T
1
(25)
which is the same F.O.C. derived using the Lagrangean approach. This optimality condition
is called the Euler equation.
The derivative of the value function
t
= vt0 (st )
62
11. DYNAMIC PROGRAMMING
ECO2010, Summer 2022
follows an analogous recursion which can be shown as follows. Let the policy function
de…ne the solution of the Euler equation (25), denoted by
at = ht (st )
(26)
Substituting (26) into (24) yields
vt (st ) = ft (ht (st ); st ) + vt+1 (gt (ht (st ); st ))
Assuming h and v are di¤erentiable,
t
@
@
@
ft
ht +
ft +
@at @st
@st
@
@
@
=
ft + t+1
gt +
ft +
@st
@st
@at
= vt0 (st ) =
@
@
@
gt
ht +
gt
@at @st
@st
@
@
gt
ht
t+1
@at
@st
t+1
(27)
Using the Euler equation (25), the term in the brackets of (27) equals zero, and therefore
t
=
@
ft +
@st
t+1
@
gt ,
@st
t = 1; 2; : : : ; T
1
This is equivalent to the recursion we previously derived using the Lagrangean approach.
Coupled with the transition equation and the boundary conditions, the optimal policy
is characterized by
@
ft (at ; st ) +
@at
@
ft (at ; st ) +
@st
@
gt (at ; st ) = 0, t = 0; 1; : : : ; T 1
@at
@
gt (at ; st ) = t , t = 1; : : : ; T 1
t+1
@st
st+1 = gt (at ; st ); t = 0; : : : ; T 1
t+1
T
= v 0 (sT )
which is the same set of dynamic optimality condition as obtained from the Lagrangean
approach.
It is reassuring that the characterization of an optimal solution is the same regardless
of the solution method adopted. The DP approach provides a more elegant derivation
of the basic Euler equation characterizing the optimal solution than does the Lagrangean
approach, although the latter is more easily and directly related to static optimization with
which we are familiar. The main attraction of DP is that it o¤ers an alternative method of
computation, backward induction, which is particularly amenable to computation. Using
backward induction we solve the problem by computing the value function starting from
the terminal state and proceeding iteratively to the initial state, in the process of which
we compute the optimal solution. We will see backward induction derived in detail on the
application of DP to a labor supply problem.
63
11. DYNAMIC PROGRAMMING
11.2
ECO2010, Summer 2022
In…nite Horizon
In the stationary in…nite horizon problem
max
1
X
t
f (at ; st ) s.t. st+1 = g(at ; st ); t = 0; 1; : : :
t=0
The value function is
v(s0 ) = max
(1
X
t
f (at ; st ) : st+1 = g(at ; st ); t = 0; 1; : : :
t=0
)
Note that v; f , and g are assumed to have the same functional form in all time periods.
The Bellman equation
v(s) = max ff (a; s) + v(g(a; s))g
(28)
a
must hold in all periods and all states, so we can dispense with t. The …rst-order conditions
for (28) can be used to derive the Euler equation to characterize the optimal solution.
In many economic models, it is possible to dispense with the separate transition equation
by identifying the control variable in period t with the state variable in the subsequent
period. For example, in an economic growth model, we can consider the choice in each period
e¤ectively as given capital stock today, select capital stock tomorrow, with consumption
being determined as the residual. Letting xt denote the decision variable, the optimization
problem becomes
max
x1 ;x2 ;:::
1
X
t=0
t
f (xt ; xt+1 ) s:t:xt+1 2 G(xt ); t = 0; 1; : : : ; given x0 :
The Bellman equation for this problem is
v(x) = max ff (x; y) + v(y)g
y
(29)
For a stationary in…nite horizon problem, the Bellman equation (28) or (29) de…nes
a functional equation, an equation in which the variable is the function v. From another
perspective, the Bellman equation de…nes an operator v ! v on the space of value functions.
Under appropriate conditions, this operator has a unique …xed point, which is the unique
solution of the functional equation. On this basis, we can guarantee the existence and
uniqueness of the optimal solution to an in…nite horizon problem, and also deduce some of
the properties of the optimal solution.
In those cases in which we want to go beyond the Euler equation and these deducible
properties to obtain an explicit solution, we need to …nd the solution v of its functional
equation (29). In the in…nite horizon problem, we cannot use backward induction since
there is no terminal time T: There are at least three practical approaches to solving the
Bellman equation in in…nite horizon problems:
64
11. DYNAMIC PROGRAMMING
ECO2010, Summer 2022
1. Informed guess and verify:
2. Value function iteration;
3. Policy function iteration.
In simple cases, it may be possible to guess the functional form of the value function,
and then verify that it satis…es the Bellman equation. Given that the Bellman equation has
a unique solution, we can be con…dent that our veri…ed guess is the only possible solution.
In other cases, we can proceed by successive approximation.
For the value function iteration solution, given a particular value function v 1 , (28)
de…nes another value function v 2 by
v 2 (s) = max f (a; s) + v 1 (g(a; s))
a
(30)
and so on. Eventually, this iteration converges to the unique solution of (30).
Policy function iteration starts with a feasible policy h1 (s) and computes the value
function assuming that policy is applied consistently:
(1
)
X
t
v 1 (s) = max
f (h1 (st ); st ) : st+1 = g(at ; st ); t = 0; 1; : : :
t=0
Given this approximation to the value function, we compute a new policy function h2 which
solves the Bellman equation assuming this value function, that is
h2 (s) = arg max f (a; s) + v 1 (g(a; s))
a
and then use this policy function to de…ne a new value function v 2 . Under appropriate
conditions, this iteration will converge to the optimal policy function and corresponding
value function. In many cases, convergence is faster than under value function iteration
(Ljungqvist and Sargent 2000: 33).
65
12. DP APPLICATION - ECONOMIC GROWTH
12
ECO2010, Summer 2022
DP Application - Economic Growth
A …nite horizon version of the optimal economic growth model optimizes the following
problem:
T
X1
t
max
u(ct ) + T v(kT ) s.t. kt+1 = F (kt ) ct
ct
t=0
where c denotes consumption, k denotes capital, F (kt ) is the total supply of goods available
at the end of period t, comprising current output plus undepreciated capital, and v(kT ) is
the value of the remaining capital at the end of the planning horizon.
Setting
at = ct ; st = kt
we obtain
f (at ; st ) = u(ct );
g(at ; st ) = F (kt )
ct
It is economically reasonable to assume that u is concave, and that F and v are concave
and increasing. Then, using the optimality principle, an optimal plan satis…es the equations
u0 (ct ) =
t
t+1 ,
=
t+1 F
kt+1 = F (kt )
T
t = 0; 1; : : : ; T
0
(kt ),
1
t = 1; : : : ; T
ct ; t = 0; : : : ; T
0
= v (kT )
1
(31)
1
(32)
(33)
(34)
From (33), in any t output can be either consumed or saved. The marginal bene…t of
additional consumption in period t is u0 (ct ). The marginal cost of additional consumption
is a reduction in capital available for the subsequent period,
t+1 . (31) requires that
consumption in each t be chosen so that the marginal bene…t of additional consumption
equals its marginal cost.
In period t + 1, (31) and (32) require
u0 (ct+1 ) =
t+1
=
(35)
t+2
t+2 F
0
(kt+1 )
(36)
The impact of additional capital in period t + 1 is increased production F 0 (kt+1 ). This
additional production could be saved for the subsequent period, in which case it would be
worth t+2 F 0 (kt+1 ). Alternatively, the additional production could be consumed, in which
case it would be worth u0 (ct+1 )F 0 (kt+1 ): (35) and (36) imply that
t+1
= u0 (ct+1 )F 0 (kt+1 )
66
12. DP APPLICATION - ECONOMIC GROWTH
12.1
ECO2010, Summer 2022
Euler Equation
(31) implies that this is equal to the marginal bene…t of consumption in period t, that is
u0 (ct ) = u0 (ct+1 )F 0 (kt+1 )
(37)
which is the Euler equation for this problem, determining relative consumption between
successive periods. The left-hand side is the marginal bene…t of consumption in period
t, while the right-hand side is the marginal cost, where the marginal cost is measured by
marginal utility of potential consumption foregone, u0 (ct+1 )F 0 (kt+1 ); discounted one period.
The actual level of the optimal consumption path c0 ; c1 ; : : : ; cT 1 is determined by the initial
capital k0 and by the requirement (34) that the shadow price of capital in the …nal period
0
T be equal to the marginal value of the terminal stock v (kT ):
The Euler equation (37) can be rearranged to yield
u0 (ct )
= F 0 (kt+1 )
u0 (ct+1 )
The left-hand side of this equation is the intertemporal marginal rate of substitution in
consumption, while the right-hand side is the marginal rate of transformation in production,
the rate of which additional capital can be transformed into additional output.
Subtracting u0 (ct+1 ) from both sides, the Euler equation (37) can be expressed as
u0 (ct )
u0 (ct+1 ) =
F 0 (kt+1 )
1 u0 (ct+1 )
Assuming that c is concave, this implies
>
>
0
ct+1 = ct () F (kt+1 ) = 1
<
<
Whether consumption is increasing or decreasing under the optimal plan depends on the
balance between technology and the rate of time preference.
12.2
Hamiltonian
The Hamiltonian for this problem is
H(ct ; kt ;
t+1 )
= u(ct ) +
t+1 (F (kt )
ct )
which immediately yields the optimality conditions
@
H = u0 (ct )
t = 0; 1; : : : ; T
t+1 = 0,
@c
@
H = t+1 F 0 (kt ), t = 1; 2; : : : ; T 1
@k
@
@
H = F (kt )
ct = kt+1 ,
t+1
T
= v 0 (kT )
67
t = 0; 1; : : : ; T
1
1
12. DP APPLICATION - ECONOMIC GROWTH
12.3
ECO2010, Summer 2022
Dynamic Programming
Recall the economic growth dynamic optimization problem
max
ct
T
X1
t
T
u(ct ) +
v(kT ) s.t. kt+1 = F (kt )
ct
t=0
The Bellman equation is
vt (kt ) = max fu(ct ) + vt+1 (kt+1 )g
ct
= max fu(ct ) + vt+1 (F (kt )
ct
ct )g
The F.O.C. for this problem is
u0 (ct )
0
vt+1
(F (kt )
ct ) = 0
(38)
Note that the value function shifted by one time period becomes
vt+1 (kt+1 ) = max fu(ct+1 ) + vt+2 (F (kt+1 )
ct+1 )g
ct+1
(39)
By the Envelope Theorem,
0
0
vt+1
(kt+1 ) = vt+2
(F (kt+1 )
ct+1 )F 0 (kt+1 )
(40)
The F.O.C. for (39) is
u0 (ct+1 )
0
vt+2
(F (kt+1 )
ct+1 ) = 0
Substituting in (40) yields
0
vt+1
(kt+1 ) = u0 (ct+1 )F 0 (kt+1 )
(41)
Substituting in (38) gives the Euler equation (37).
12.4
In…nite Horizon
The in…nite horizon version of the economic growth problem is
max
ct
1
X
t
u(ct ) s.t. kt+1 = F (kt )
ct
t=0
Substituting for ct using the transition equation, the problem becomes
max
ct
1
X
t
u(F (kt )
kt+1 )
t=0
The Bellman equation is
v(kt ) = max fu(F (kt )
kt+1
68
kt+1 ) + v(kt+1 )g
12. DP APPLICATION - ECONOMIC GROWTH
ECO2010, Summer 2022
The F.O.C. for a maximum is
u0 (ct ) + v 0 (kt+1 ) = 0
where
ct = F (kt )
kt+1
Applying the Envelope Theorem,
v 0 (kt ) = u0 (ct )F 0 (kt )
and therefore
v 0 (kt+1 ) = u0 (ct+1 )F 0 (kt+1 )
Substituting into (42) we derive the Euler equation
u0 (ct ) = u0 (ct+1 )F 0 (kt+1 )
69
(42)
13. DP APPLICATION - LABOR SUPPLY
13
ECO2010, Summer 2022
DP Application - Labor Supply
13.1
Intertemporal Labor Supply
Consider the intertemporal labor supply problem that we have used to motivate the need for
dynamic optimization. Here, C; l; h are consumption, leisure and hours of work, respectively,
w is the wage rate, Y is non-labor income, and H is the time endowment (e.g. 24 hours per
day).
1. Utility depends on the whole sequence of labor supply and consumption over the
life-cycle:
T
X
t
U (C0 ; C1 ; : : : ; CT ; l0 ; l1 ; : : : ; lT ) =
U (Ct ; lt )
t=0
where
2 (0; 1) is a discount factor.
2. Individuals need to form expectations about the future: hence, they maximize expected rather than deterministic utility.
3. The budget constraint is dynamic: individuals can shift resources between periods by
using savings. Consequently, today’s decisions in‡uence future choice sets.
The model assumes that the worker can freely borrow or save at a constant risk-free interest rate of r: Denote the savings (or debt) carried into period t by At : The intertemporal
budget constraint is then given by
At+1 + Ct = wt ht + (1 + r)At
The budget constraint equates period t expenditures (savings carried into period t + 1 and
contemporaneous consumption) with period t income (labor earnings plus principal and
interest on the current asset holdings).
The model further assumes that wages wt follow some stochastic process described by
the conditional probability distribution (wt ); speci…ed as
wt+1
t (wt )
Individuals have to solve the following optimization problem:
max
fCt ;lt gT
t=0
subject to
E
T
X
t
U (Ct ; lt )
t=0
At+1 + Ct = wt ht + (1 + r)At
A0 = 0
lt = H
wt+1
Ct ; lt ; ht
ht
t (wt )
0
70
(43)
13. DP APPLICATION - LABOR SUPPLY
ECO2010, Summer 2022
We will solve this problem using Dynamic Programming (DP).
13.2
Finite Horizon with No Uncertainty
T=2 Case for the Consumer Problem
Assume for simplicity for now that:
1. T = 2
2. there is no uncertainty
3. utility comes only from consumption and is logarithmic
4. leisure is constant and equal to one
5. period t income is given deterministically by wt
Then the decision problem is described by:
max ln(C0 ) +
C0 ;C1
ln(C1 )
subject to
A1 + C0 = w0 + (1 + r)A0
A2 + C1 = w1 + (1 + r)A1
A0 = 0
The problem is dynamic since consumption in period t = 0 determines the savings
brought into next period, A1 , which in turn determines how much can be consumed in the
second period. DP solves the maximization problem by backward induction.
The problem in the last period is:
max ln(C1 )
C1
subject to
A2 + C1 = w1 + (1 + r)A1
The optimal decision is to consume all resources left:
A2 = 0
C1 (A1 ) = w1 + (1 + r)A1
Given savings A1 , the highest utility that can be reached in period t = 1 is given by the
value function
v1 (A1 ) = ln [C1 (A1 )]
= ln [w1 + (1 + r)A1 ]
71
13. DP APPLICATION - LABOR SUPPLY
ECO2010, Summer 2022
Going backward in time to t = 0; the problem is still given by (43):
max ln(C0 ) +
ln(C1 )
C0 ;C1
but the DP solution at t = 1 replaces
ln(C1 ) by
ln(C1 (A1 )) = v1 (A1 )
which is the optimal value that can be reached next period. This substitution is based on
the "Principle of Optimality" proven by Richard Bellman.
The maximization problem in period t = 0 is thus given by
max ln(C0 ) + v1 (A1 )
C0
subject to
A1 + C0 = w0
or simply by
max ln(C0 ) + v1 (w0
C0
C0 )
(44)
Note that the intertemporal trade-o¤ between consumption at t = 0 and t = 1 is entirely
captured by the value function v1 ( ): We can now solve (44) with a Lagrangian.
In summary: DP solves the …nite-horizon optimization problem by backward induction,
yielding functions:
C1 (A1 ), the period t = 1 policy function;
v1 (A1 ); the period t = 1 value function;
The argument A1 is the state variable. Without uncertainty there is no gain from
waiting until deciding how to spend savings in period 1. With uncertainty, there will be
gains from the DP approach since it becomes valuable to wait.
General Case for the Consumer Problem
The general …nite horizon maximization problem without uncertainty to be solved is
described by:
1. An objective function
T
X
t
U (Xt )
t=0
that depends on a sequence of vectors of control variables Xt 2
Xt = Ct );
72
t
(in our example
13. DP APPLICATION - LABOR SUPPLY
ECO2010, Summer 2022
2. A law of motion
St+1 = Ft (St ; Xt )
that describes the endogenous evolution of the state variable St 2
St = At ; and Ft is the period t budget constraint);
(in our example
3. An initial value of the state, S0 ,
4. A continuation value beyond period T , vT +1 (ST +1 ) which we assume to be equal to
zero.
General Case with Generic Notation
In general, the DP algorithm works as follows:
1. Solve the maximization problem in the last period T for any ST 2
T
:
max U (XT )
XT
subject to
St+1 = FT (ST ; XT )
The solution is a policy function XT (ST ); often setting ST +1 = 0:
2. Obtain the value function
vT (ST ) = U (XT (ST ))
3. Going backward, solve the following maximization problem at each value of the state
variables St , replacing the discounted sum of future utilities by vt+1 (St+1 ) as obtained
in the previous induction step:
max U (Xt ) + vt+1 (Ft (St ; Xt ))
Xt 2
(45)
t
4. Repeat step 3 until we reach t = 0.
The equation (45) is the Bellman Equation.
The outcome of the DP algorithm is:
A sequence of policy functions fXt (St )gTt=0
A sequence of value functions fvt (St )gTt=0
The policy functions are rules that give the optimal value of the controls at any value of
the state variables. In the simple consumption-savings model considered above, the policy
functions are consumption functions that depend on the current value of available resources.
73
13. DP APPLICATION - LABOR SUPPLY
13.3
ECO2010, Summer 2022
Finite Horizon with Uncertainty
In most applications the uncertainty originates in the stochastic evolution of factor prices
which enter via the dynamic resource constraint. Introduce a stochastic variable "t and
assume it exhibits the Markov property
P ("t+1 j"t ; "t
1 ; " t 2 ; : : : ; "0 )
= P ("t+1 j"t )
This amounts to assuming an AR(1) process for "t : More speci…cally, assume
"t+1
The decision problem is then given by:
max E
fXt gT
t=0
("t )
"
subject to
T
X
t
#
U (Xt )
t=0
(46)
St+1 = Ft (St ; Xt ; "t )
"t+1
("t )
S0 ; "0 are given
Xt 2
t;
St 2
t;
"t 2
t
Note that in period t expectations are formed about "t+1 while "t has been observed (i.e.
is treated as a …xed state variable).
The Bellman equation is now given by
vt (St ; "t ) = max fU (Xt ) + E [vt+1 (St+1 ; "t+1 )j"t ]g
Xt 2
t
subject to
St+1 = Ft (St ; Xt ; "t )
"t+1
which can be re-written as
vt (St ; "t ) = max
Xt 2
U (Xt ) +
t
("t )
Z
vt+1 (St+1 ; "t+1 )f ("t+1 j"t ) d"t+1
Solution: Since T is …nite, the solution proceeds again by backward induction.
Complications:
1. An additional state-variable "t (so-called state-space augmentation)
2. Need to evaluate an integral
A popular approach: assume that "t is discrete, or discretize (approximate) a continuous
stochastic process:
X
E [vt+1 (St+1 ; "t+1 )j"t ] =
vt+1 (St+1 ; "t+1 )P ("t+1 j"t )
t
74
13. DP APPLICATION - LABOR SUPPLY
13.4
ECO2010, Summer 2022
In…nite Horizon without Uncertainty
In many dynamic models in economics, T ! 1: This can happen as the discrete time
periods of decision become very small. In this case we cannot apply backward induction to
solve the DP problem.
The starting point is to focus on "stationary economies" For this purpose de…ne stationarity as follows: Conditional on the state variables, the value and policy functions do
not depend on time. This implies that conditional on the state, the problem always looks
identical irrespective of calendar time. If in fact the data is nonstationary, we need to perform a suitable normalization of the problem. As an example, recall the neo-classical growth
model in Macroeconomics where each endogenous variable is normalized by the population
size and the current value of total factor productivity.
In the following we assume stationarity and hence drop the time indices. The policy
and value function are now written as X (S) and v(S): The "next period" state and control
variables are denoted by S 0 and X 0 :The law of motion becomes S 0 = F (S; X)
The Bellman equation (BE) becomes:
v(S) = max U (X) + v(S 0 )
X2
0
s.t. S = F (S; X)
This implies
v(S) = max fU (X) + v(F (S; X))g
X2
Note that the same value function enters both sides of the BE, albeit evaluated at
di¤erent points in the state space. In contrast, in the …nite horizon problem we had vt vs.
vt+1 with the latter obtained from backward induction. Hence, when T ! 1, the BE is
a functional equation because the unknown it solves for is a function (v( )) rather than a
number.
We need to solve for v( ) along with the policy function X (S): Does the BE have a
unique solution in v( ) and X ( ) ?
De…ne the mapping T (v) by
T (v) = max fU (X) + v(F (S; X))g
X2
A solution of the BE is a function v( ) such that
v = T (v)
which is a …xed point of T (v): At the …xed point, performing the maximization of U (X) +
v(F (S; X)) for any point in the state space gives back v evaluated at any point in the
state space.
Theorem 70 (Existence). Let the state space and control space be compact, 2 (0; 1)
, the law of motion F (S; X) be continuous, the utility function U (X) be continuous and
bounded. Then the Bellman equation has a unique …xed point.
75
13. DP APPLICATION - LABOR SUPPLY
ECO2010, Summer 2022
The Existence Theorem is an application of the Contraction Mapping Theorem (in
mathematics also called the Banach Fixed Point Theorem) which suggests an algorithm for
solution:
Theorem 71 (Fixed Point Iteration). Let v 0 (S) be some continuous and bounded function
de…ned on the compact state space : Let
v 1 = T (v 0 ) = max U (X) + v 0 (F (S; X))
X2
Given this initialization, perform the following "value function iteration" v n = T (v n
Then this iteration converges to the unique …xed point v = T (v):
1 ):
As a measure of convergence of the Fixed Point Iteration algorithm, de…ne the metric
= supS2 v n (S) v n 1 (S) and declare the algorithm as converged if falls below a
certain tolerance level.
13.5
In…nite Horizon with Uncertainty
Additional condition:
"0
(")
The stochastic process in " is not allowed to diverge. One possible restriction for compliance
would be assuming AR(1) with j j < 1 for the evolution of ":
The Bellman equation now becomes:
v(S; ") = max U (X) + E v(S 0 ; "0 )j"
X2
0
s.t. S = F (S; X; ")
"0
(")
Note that we need to perform the integration over "0 conditional on " to obtain E [v(S 0 ; "0 )j"]
in each value function iteration.
13.6
Example: Intertemporal Labor Supply Problem with Finite T
Recall the intertemporal labor supply problem (43):
max
fCt ;lt gT
t=0
E
T
X
t
U (Ct ; lt )
t=0
subject to
At+1 + Ct = wt ht + (1 + r)At
A0 = 0
lt = H
wt+1
Ct ; lt ; ht
ht
t (wt )
0
76
13. DP APPLICATION - LABOR SUPPLY
ECO2010, Summer 2022
Here the state variables in period t are (At ; wt ), and the controls are (Ct ; lt ; ht ).
The period t Bellman equation is given by
vt (At ; wt ) = max fU (Ct ; lt ) + E [vt+1 (At+1 ; wt+1 )jwt ]g
Ct ;lt ;ht
s.t. At+1 + Ct = wt ht + (1 + r)At
A0 = 0
lt = H
wt+1
Ct ; lt ; ht
ht
t (wt )
0
First assume that at the optimum the constraints Ct ; lt ; ht
0 are not binding. Then
the …rst-order condition for the maximization problem on the RHS of the Bellman equation
with respect to Ct is given by
@U (Ct ; lt )
@vt+1 (At+1 ; wt+1 )
= E
wt
@Ct
@At+1
(47)
and the Envelope Theorem yields
@vt (At ; wt )
@U (Ct ; lt )
= (1 + r)
@At
@Ct
(48)
Combining (47) and (48) gives us
@U (Ct+1 ; lt+1 )
@U (Ct ; lt )
= (1 + r)E
wt
@Ct
@Ct+1
(49)
The equation (49) is called the Consumption Euler Equation. It equates the expected marginal rate of substitution between consumption in two subsequent periods with
the discounted gross return on savings (1 + r). The Euler Equation describes the optimal
evolution of consumption over time and imposes a restriction on the growth rate of expected
marginal utilities, re‡ecting the consumption-smoothing motive. Individuals use savings to
smooth life-cycle trajectories of consumption.
Next consider the optimal choice of leisure in period t: The …rst-order condition is given
by
@U (Ct ; lt )
@vt+1 (At+1 ; wt+1 )
= wt E
wt
(50)
@lt
@At+1
and combining (50) with (47) we obtain
@U (Ct ; lt )
@U (Ct ; lt )
= wt
@lt
@Ct
(51)
Note that (51) is the same condition for the optimal choice of leisure and consumption
as in the static model. Within each period the static marginal rate of substitution between
consumption and leisure is always equal to the static wage. Individuals adjust their labor
supply period-by-period to smooth consumption.
77
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
14
ECO2010, Summer 2022
Dynamic Optimization in Continuous Time
In a continuous time dynamic optimization problem the agent optimizes over an integral
function of the control variable and the evolution of the state variable is given by a differential equation. In contrast, in the discrete time case the agent optimizes over a sum
of functions of the control variable, and the evolution of the state variable is given by a
di¤erence equation.
14.1
Discounting in Continuous Time
The discount rate in the discrete time model can be thought of as the present value of $1
invested at the interest rate . That is, to produce a future return of $1 when the interest
rate is ;
1
=
1+
needs to be invested, since this amount will accrue to
(1 + ) = 1
after one period. However, suppose interest is accumulated n times during the period, with
=n earned each sub-period and the balance compounded. Then, the present value is
=
1
(1 + =n)n
since after the full period this amount will accrue to
(1 + =n)n = 1
Since
lim (1 + =n)n = exp( )
n!1
the present value of $1 with continuous compounding over 1 period is
= exp(
)
= exp(
t)
and similarly over t periods
t
14.2
Continuous Time Optimization
A continuous time maximization problem typically takes the form
Z T
max
[exp ( t) f (at ; st )] dt
Xt
(52)
0
subject to
s_t = g(at ; st )
with the notational convention s_t =
Here:
@st
@t :
78
(53)
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
ECO2010, Summer 2022
at is the control (or choice) variable
st is the state variable
f and g are functions that describe the objective function and the evolution of the
state variable respectively
is the discount rate.
In order to …nd the solution to the continuous time optimization problem, we …rst de…ne
the Hamiltonian function
Ht = f (at ; st ) + t s_t
where t is a Lagrange multiplier term (known as a co-state variable)
The Hamiltonian function has an economic interpretation: Ht describes the total impact
on the inter-temporal objective function of choosing a particular level of the control variable
Xt at time t. The direct impact is captured by the f (at ; st ) term. In addition, at also has
an indirect e¤ect by changing the level of the state variable in the amount s_t : Since t tells
us how much the objective function can be increased by one more unit of at at the margin,
it follows that t s_t summarizes the total indirect impact on the objective function. Thus,
Ht summarizes the total impact of picking at on the inter-temporal objective function.
The conditions for an optimum come next. Since Ht summarizes the total impact of
choosing a given level of at on the inter-temporal utility function, it follows that the optimal
solution is characterized by a …rst-order condition of the form
@Ht
=0
@at
As in the Bellman equation, there is another, less intuitive condition involving the state
variable. In the continuous time case this equation is that
@Ht
=
@st
t
t
called the co-state equation. These two conditions form a part of the so-called Pontryagin’s Maximum Principle characterizing the solution to a dynamic optimization
problem.
The …nal two constraints are the initial condition:
s0 = s
(54)
sT = s
(55)
and the terminal condition,
If the time horizon is in…nite, the terminal condition becomes the so-called transversality condition
lim sT T = 0
(56)
T !1
79
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
ECO2010, Summer 2022
This states that in the limit either there has to be no units of the state variable left over
(sT = 0) or if there are units left over, then their value has to be zero in terms of maximizing
utility ( T = 0).
To summarize, given the optimization problem (52) and (53)
max
at
Z
T
[exp (
t) f (at ; st )] dt s.t. s_t = g(at ; st )
0
we de…ne the Hamiltonian
Ht = f (at ; st ) +
t s_t
and the conditions for the solutions
@Ht
=0
@at
@Ht
=
@st
=)
t
t
fa (at ; st ) +
t ga (at ; st )
=) fs (at ; st ) +
=0
t gs (at ; st )
=
(57)
t
t
(58)
This is combined with the initial condition (54) and either the terminal condition (55) or
the transversality condition (56) to form a system of di¤erential equations that characterize
the solution to the problem.
14.3
Parallels with Discrete Time
We can set up an approximate version of the above problem in discrete time as
max
at
T
X
t=0
1
1+
t
f (at ; st )
subject to
st+1 = st + g(at ; st )
The transition equation is slightly di¤erent here than what we worked with previously,
simply so that we can draw a parallel between the evolution of the state variable in the case
of discrete time
st+1 st = g(at ; st )
and the evolution of the state variable in the case of continuous time
s_t = g(at ; st )
The value function in discrete time is
vt (st ) = max f (at ; st ) +
at
s.t. st+1 = st + g(at ; st )
80
1
1+
vt+1 (st+1 )
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
ECO2010, Summer 2022
The …rst-order condition and the envelope condition are
0 = fa (at ; st ) +
vt0 (st ) = fs (at ; st ) +
1
1+
1
1+
0
vt+1
(st+1 )ga (at ; st )
(59)
0
vt+1
(st+1 ) [1 + gs (at ; st )]
(60)
Rearranging the envelope condition (60),
1
0
vt+1
(st+1 ) = fs (at ; st ) +
1+
0 (s
(1 + ) vt0 (st ) vt+1
t+1 )
= fs (at ; st ) +
1+
0 (s
vt0 (st ) vt+1
v 0 (st )
t+1 )
+ t
= fs (at ; st ) +
1+
1+
vt0 (st )
1
1+
1
1+
1
1+
0
vt+1
(st+1 )gs (at ; st )
0
vt+1
(st+1 )gs (at ; st )
0
vt+1
(st+1 )gs (at ; st )
(61)
The above rearrangement of the envelope condition is unnecessary to the solution in the
discrete time case. However, the solution technique for the continuous time case generates
two equations that are identical to the …rst-order condition (59) and the rearranged envelope
condition (61).
Comparing the …rst-order conditions (FOC) for the Hamiltonian case (57) given above
with the FOC for the Bellman case (59), we can see that they are equivalent if
t
=
1
1+
0
vt+1
(st+1 )
(62)
This makes sense since we de…ned vt (st ) as the optimized value of the lifetime objective
0 (s
function and hence 1+1 vt+1
t+1 ) is the change in the (discounted) maximized value of
the objective function when we add 1 more unit of st for next period, i.e. the equivalent of
t.
Using the equivalency (62), we can rewrite the discrete-time envelope condition (61) as
t 1
t
+
t 1
= fs (at ; st ) +
t gs (at ; st )
(63)
If we move from discrete time to continuous time, where the di¤erence between two time
periods becomes in…nitesimally small, the envelope condition (63) can be thought of as
being analogous to (58), i.e.
t
+
t
= fs (at ; st ) +
t gs (at ; st )
which is Pontryagin’s co-state equation, showing that it has parallels in the Bellman equation.
81
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
14.4
ECO2010, Summer 2022
Application - Optimal Growth
One of the most famous macroeconomic models is the Ramsey/Cass/Koopman’s model of
an economy. The following is a simpli…ed version of that model describing the behavior
of a consumer/producer in the economy. The individual faces the following maximization
decision
Z
T
[exp (
max
Ct
t) U (Ct )] dt
s.t. K t = F (Kt )
Ct
Kt
0
where Ct is consumption, Kt is the capital stock, 0 < < 1 is the rate of capital depreciation.
The utility function is
1 1=
C
1
U (Ct ) = t
1 1=
where > 0:
The production function is
F (Kt ) = Kt
where 0 < < 1:
We can de…ne the Hamiltonian as
Ht = U (Ct ) +
tK t
t [F (Kt )
= U (Ct ) +
Ct
Kt ]
The initial level of capital K0 is given, and we also assume that the transversality condition
holds, so that
lim Kt t = 0
t!1
The …rst order condition and the co-state equation are
@Ht
=0
@Ct
@Ht
=
@St
=)
t
t
U 0 (Ct )
t
=0
F 0 (Kt )
=)
t
=
t
t
Combining these gives us the Euler equation for consumption in continuous time:
F 0 (Kt )
U 0 (Ct ) = U 0 (Ct )
U 00 (Ct )C t
This simpli…es to the following Euler equation
U 00 (Ct )C t =
+
F 0 (Kt ) U 0 (Ct )
Given the utility function (for a speci…c )
U (Ct ) = 2
and the production function
p
Ct
F (Kt ) = Kt
82
(64)
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
ECO2010, Summer 2022
we have
1
U 0 (Ct ) = Ct 2
3
1
U 00 (Ct ) =
Ct 2
2
F 0 (Kt ) = Kt 1
Using these in the Euler equation (64) yields
3
Ct 2 C t = 2
+
Kt
1
Ct
=2
Ct
+
Kt
1
Ct
1
2
(65)
Rearranging (65) yields
(66)
The growth rate of consumption is related to the di¤erence between the rate of return
on capital (F 0 (Kt )
) and the discount rate. If the rate of return on capital and the rate
at which we discount the future are identical, i.e. F 0 (Kt )
= ; then consumption will
0
remain constant over time. If F (Kt )
> then consumption will be rising over time. If
F 0 (Kt )
< then consumption will be falling over time.
Kt
The Euler equation (66) along with the ‡ow budget constraint K t = F (Kt ) Ct
0
and the initial condition K0 = K and transversality condition limt!1 Kt U (Ct ) constitute
a system of non-linear di¤erential equations that make up the solution to the problem. It
can be shown that the system is saddle-path stable and that the steady state of the model
is at
1
1
K =
+
1
1
C =
1
+
+
The di¤erential equation system with the solution are depicted in the phase diagram
below.
83
14. DYNAMIC OPTIMIZATION IN CONTINUOUS TIME
84
ECO2010, Summer 2022
15. NUMERICAL OPTIMIZATION
15
ECO2010, Summer 2022
Numerical Optimization
There are two broad classes of optimization algorithms:
1. Derivative-free methods
- very useful if the objective function is not smooth, or if its derivatives are expensive
to compute
2. Derivative-based methods (Newton-type methods)
- very useful if objective function derivatives or derivative estimates are readily available
15.1
Golden Search
A widely used derivative-free method is the Golden search method. Its principle is similar
to the Bisection method for root…nding. Consider the problem max f (x) over [a; b] where
f is continuous, concave and unimodal. Pick x1 ; x2 2 (a; b) with x1 < x2 . Evaluate f (x1 );
f (x2 ) and replace [a; b] with [a; x2 ] if f (x1 ) > f (x2 ) or with [x1 ; b] if f (x1 ) < f (x2 ). A local
maximum must be contained in the new interval. Repeat this procedure until the length of
the interval is shorter than some desired tolerance level.
A key issue is how to pick the interior evaluation points. Golden search uses an optimal
reduction factor for the search interval by minimizing the number of function evaluations,
re-using interior points from the previous iterations. The choice of x1 ; x2 makes use of the
p
properties of the golden ratio
= ( 5 + 1)=2 named by the ancient Greeks. Note that
1= =
1 and 1= 2 = 1 1= : The points x1 ; x2 are optimally selected as
x1 = b
x2 = a +
a0
b
a
b
a
After one iteration of the search, it is possible that we will discard a and replace it with
= x1 : Then the new value to use as x1 will be
x01
=b
b
(b
=b
=b
=a+
a0
(b
b
=b
a) (
a) (1
a
b
x1
1)
b
=b
=b
1= ) = b
(b
b
b a
a)
2
b + b= + a
a=
= x2
This implies that we can re-use a point that we already have: we don’t need a new evaluation
of f (x0 ) and use f (x2 ) instead. Similarly, if we update to b0 = x2 ; then x02 = x1 and we can
re-use that point.
85
15. NUMERICAL OPTIMIZATION
15.2
ECO2010, Summer 2022
Nelder-Mead
For multivariate functions, a widely-used derivative-free optimization method is the NelderMead algorithm. The algorithm begins by evaluating the objective function at n + 1
points, forming a so-called simplex in the n-dimensional Euclidean space.
At each iteration, the algorithm:
Determines the point on the simplex with the lowest function value and alters that
point by re‡ecting it through the opposite face of the simplex (Re‡ection).
If the re‡ection …nds a new point that is higher than all the others on the simplex,
the algorithm checks expanding the simplex further in this direction (Expansion).
If the re‡ection fails to produce a point that is at least as good as the second worst
point, the algorithm contracts the simplex by halving the distance between the original
point and its opposite face (Contraction).
If re‡ection point is not better than the second worst point, the algorithm shrinks the
entire simplex toward the best point (Shrinkage).
2-dimensional space illustration: original triangle is in light grey, new triangle is in dark
grey
Animated demonstration:
86
15. NUMERICAL OPTIMIZATION
ECO2010, Summer 2022
https://www.youtube.com/watch?v=HUqLxHfxWqU
15.3
Newton-Raphson Method
The Newton-Raphson method is a popular derivative-based optimization method, identical
to applying Newton’s method to …nding the root of the gradient of the objective function.
The Newton-Raphson method uses successive quadratic approximations to the objective
function. Given x(k) ; the subsequent iterate is computed by maximizing the second order
Taylor approximation to f about x(k) :
f (x)
f (x(k) ) + f 0 (x(k) )(x
1
x(k) ) + (x
2
x(k) )T f 00 (x(k) )(x
Solving the …rst-order condition
f 0 (x(k) ) + f 00 (x(k) )(x
x(k) ) = 0
yields the iteration rule
x(k+1) = x(k)
h
i
f 00 (x(k) )
1
f 0 (x(k) )
The Newton-Raphson method can be very sensitive to starting values.
Animated demonstration:
http://bl.ocks.org/dannyko/raw/ffe9653768cb80dfc0da/
87
x(k) )
15. NUMERICAL OPTIMIZATION
15.4
ECO2010, Summer 2022
Quasi-Newton Methods
There are a number of modi…cations to the Newton-Raphson method, called quasi-Newton
1
;
methods. Some employ an approximation B(x(k) ) to the inverse Hessian f 00 (x(k) )
using a search direction
d(k+1) = B(x(k) )f 0 (x(k) )
called the Newton step. Robust quasi-Newton methods shorten or lengthen the Newton
step in order to obtain improvement in the objective function. This includes the DavidsonFletcher-Powell (DFP), Broyden-Fletcher-Goldfarb-Shano (BFGS), and Berndt–Hall–Hall–
Hausman (BHHH) updating methods.
Mathematica demonstration: Minimizing the Rosenbrock function
http://demonstrations.wolfram.com/MinimizingTheRosenbrockFunction/
88
ECO2010, Summer 2022
Part III
Statistical Analysis
16
Introduction to Probability
Statistical analysis involves the fundamental concept of "probability". There are several
di¤erent interpretations of probability. Two key categories are:
1. Frequentist probability: limit of the relative frequency of an event’s occurrence in a
large number of trials.
2. Bayesian probability: quantity that is assigned to represent a state of knowledge (also
called belief).
For the purpose of illustration of the concepts involved, in this section we will use the
well-known simple regression model
y=
We will consider the hypothesis H0 :
to more elaborate models.
16.1
1
0
+
1x
= 0; H1 :
+"
1
6= 0: The concepts discussed generalize
Randomness and Probability
Population is de…ned as the entire collection of elements about which information is desired. Random process (or experiment) is de…ned as a procedure, involving a given
population, that can conceptually be repeated many times, leading to certain outcomes
(qualitative or abstract). Sample space is de…ned as set of all possible outcomes of the
random process. An event is de…ned as a subset of the sample space.
For any given event, only one of two possibilities may hold: it occurs or it does not.
The relative frequency of occurrence of an event, observed in a number of repetitions of the
experiment, is a measure of the frequentist probability of that event. Denote by nt the total
number of random experiments conducted and by nx the number out of these trials where
the event x occurred. The frequentist probability P (x) of the event x occurring is
nx
P (x) = lim
nt !1 nt
In the Bayesian approach "randomness" = uncertainty. The reason something is random is not because it is generated by a “random process" but because it is unknown.
Hence, unknown parameters are treated as random variables. Typically, Bayesians call
random variables "observables" and parameters "unobservables".
In frequentist statistics: Data (events) are a repeatable random sample, underlying
parameters remain constant during this repeatable process, parameters are …xed.
In Bayesian Statistics: Data are observed from the realized sample, parameters are
unknown and therefore described probabilistically, data are …xed.
89
16. INTRODUCTION TO PROBABILITY
16.1.1
ECO2010, Summer 2022
Frequentist Framework
Frequentist Random Process:
(Note: 1 arrow = 1 piece of "evidence", i.e. in general 1 data set consisting of many
data points.) What does it mean to have 95% con…dence?
The frequentist con…dence interval traps the truth in 95% of experiments. To de…ne
anything frequentist, you have to imagine repeated experiments.
Frequentist Hypothesis Testing:
For frequentist inference and hypothesis testing, we need to imagine running the random experiment again and again. We always need to think about many other "potential"
datasets, not just the one we actually have to analyze. How does Bayesian inference di¤er?
90
16. INTRODUCTION TO PROBABILITY
16.1.2
ECO2010, Summer 2022
Bayesian Framework
The Bayes Theorem tells us:
p( jY )
posterior
/
/
f (Y j )
( )
likelihood prior
Prior distribution ( ): distributional model of what we know about parameter
excluding the information in the data.
Likelihood f (Y j ): based on modeling assumptions, how (relatively) likely the data Y
are if the truth is :
Keep adding data, and updating knowledge, as data becomes available, and knowledge
will (in general) concentrate around true .
Likelihood Function (left) and posterior probability (right):
Example: here is exactly the same idea, in practice. During the search for Air France
447, from 2009-2011, knowledge about the black box location was described via probability,
using Bayesian inference. Eventually, the black box was found in the red area.
91
16. INTRODUCTION TO PROBABILITY
ECO2010, Summer 2022
No replications of data are involved (no replicate plane crashes).
16.2
Inference
The frequentist approach is based on pre-data considerations. "If the hypothesis being
tested is in fact true, what is the probability that we shall get data indicating that it is
true?"
Bayesians frame their argument in terms of post-data considerations: "What is the
probability conditional on the data, that the hypothesis is true?"
For frequentists the probability of a hypothesis is not meaningfully de…ned (it is a-priori
…xed to be equal 1).
A frequentist 95% Con…dence Interval (CI):
– Contains 0 with probability 0.95 in hypothetical many random experiments,
only before one has seen the data.
– After the actual data has been observed, the probability is either 0 or 1 (
either inside the CI, or outside of the CI).
0
is
– With a large number of repeated samples, 95% of calculated con…dence intervals
would include the true value of the parameter.
– (In practice, CIs are sometimes misinterpreted as guides to post-sample uncertainty.)
A Bayesian 95% Credible Set (CS):
– Has 95% coverage probability after one has seen the data.
–
is a random variable with a proper distribution quantifying the uncertainty
inherent in 0 :
0
– Expresses the posterior probability that
0
is inside the CS.
Where do priors come from? Recall from above: prior = model based on previous
experience of the analyst. Priors come from all data external to the current study. Priors
give the analyst the option of incorporating additional information beyond what is contained
in the current data set.
Priors can be informative (e.g.
N ( 0 ; 20 )), or di¤use (e.g.
/ c s.t. c 2 R).
Incorrect inference resulting from a misleading restrictive prior is equivalent to a frequentist model misspeci…cation. Remedies: qualitative priors (e.g. monotonicity or smoothness
restrictions), or non-parametric priors, providing ‡exible modeling framework, quickly dominated by data evidence.
Reference:
Greenberg (2012), Geweke et al (2011)
92
17. MEASURE-THEORETIC PROBABILITY
17
ECO2010, Summer 2022
Measure-Theoretic Probability
17.1
Elements of Measure Theory
A rigorous measure-theoretic de…nition of probability is based on the concept of probability
measure, which is a special case of a measure of sets. Intuitively, we can think of a measure
of a set A as an integral over the set’s "region":
Z
Z
p(x)dx
dx or
(A) =
(A) =
A
A
for some function p(x):
Examples:
If A is a geometric object, such as a cone, then (A) can be the volume of A:
If A is a physical object, such as a body, then (A) can be the physical mass of A:
If A is a random event then (A) can be the probability mass of A (probability of A
occurring).
Let (A) =
R
A dx:
Integrals have a number of decomposition properties, including:
(;) = 0 (integral over an empty set is zero).
For pairwise disjoint sets Ai ;
(A1 [ A2 ) = (A1 ) + (A2 )
If B
A;
(B)
(A)
and
(AnB) = (A)
(B):
For a non-empty set S; it is not always possible to integrate on P(S); i.e. the power set of
S: Non-measurable sets are typically di¢ cult to construct and play no role in applications,
but they do exist (e.g. a Vitali set). We need to restrict to a subset A ("measurable"
sets) of P(S):
17.1.1
Measure Space
De…nition 72. Let A be a non-empty class of subsets of S: A is an algebra if
1) Ac 2 A whenever A 2 A
2) A1 [ A2 2 A whenever A1 ; A2 2 A
93
17. MEASURE-THEORETIC PROBABILITY
ECO2010, Summer 2022
De…nition 73. A is a -algebra if A is an algebra and
1
S
2’)
An 2 A whenever An 2 A for n = 1; 2; : : :
n=1
Since A is non-empty, (1) and (2) ) ; 2 A and S 2 A because
A 2 A ) Ac 2 A ) A [ Ac = S 2 A ) S c = ; 2 A
We can generate an algebra or -algebra from any collection of subsets by adding to the
set the complements and unions of all its elements. The smallest -algebra is fS; ;g:
The Borel -algebra B(S) is "generated" as follows:
1. Start with T , a collection of all open sets in S
2. Keep adding sets according to the de…nition of a -algebra
3. The resulting -algebra with the fewest number of sets is B(S):
As a result, B(S) contains all open and closed sets. This is the usual -algebra we work
with when S = R:
Let S be a non-empty set and A a -algebra of subsets of S: The pair (S; A) is called a
measurable space. A mapping : A ! [0; 1] is called a measure if:
1.
2.
(;) = 0
1
S
Ai
=
i=1
P1
i=1
(Ai ) if Ai 2 A are pairwise disjoint.
The tuple (S; A; ) forms a measure space.
17.1.2
Density
Intuitively, a density f is a function that transforms a measure
pointwise re-weighting on S: Let
Z
Z
d 2 (x) =
f (x)d 1 (x)
2 (A) =
A
1
into a measure
2
by
A
Derivative notation:
d
or
2 (x)
d
d
2
= f (x)d
1 (x)
(x) = f (x)
1
Here f is a function that is integrated to obtain 2 and is hence a "derivative" of 2 : For
given 1 ; 2 there may not always exist a corresponding density f: "Re-weighting" by f
cannot work if 1 (A) = 0 and 2 6= 0. If such case is ruled out, i.e. if 1 (A) = 0 implies
94
17. MEASURE-THEORETIC PROBABILITY
2 (A)
by
= 0 for all A 2 A; then
2
is absolutely continuous with respect to
2
Important Measures
Let A
S and x 2 S:
Dirac measure
1;
denoted
1
The Radon-Nikodym theorem tells us that
2
1:
17.1.3
ECO2010, Summer 2022
2
has a density w.r.t.
1
if and only if
x
– De…ned as the set function
x (A)
–
x
=
(
1 if x 2 A
0 if x 2
=A
is also called a point mass at x; or atom on x
Counting measure
– De…ned as a set function
(A) =
(
#A if A is …nite
1 if A is in…nite
where #A is the number of elements in A:
Lebesgue measure
– More di¢ cult to de…ne, so we provide intuitive geometric interpretation in Rn
– ([a; b]) = ((a; b)) = b
a , length of an interval
–
of a surface segment in R2 is its area
–
of a body in R3 is its volume.
Probability measure P
– Mathematically, P (A) = (A) in a measurable space (S; A) with the property
that (S) = 1
– P includes
x
and
as special cases following the normalization (S) = 1
– The probability measure P is thus a large family of measures.
95
17. MEASURE-THEORETIC PROBABILITY
17.1.4
ECO2010, Summer 2022
Probability Space
The tuple (S; A; P ) forms a probability space. In this context, S is interpreted as the
set of all possible outcomes of a random process. P (A) is interpreted as the probability of
event A occurring.
When S is discrete, we usually take A = P(S): When S = R, or some subinterval
thereof, we usually take A = (B):
In cases when S is discrete and with …nite number of elements, P is often given by
a normalized counting measure. In continuous cases, P is often given by a normalized
Lebesgue measure.
17.2
Random Variable
The environment of (S; A; P ) is useful for proofs, but not easy to work with for modeling.
It is more convenient to work with a transformation of S using the notion of a random
variable. Formally, a random variable X is the mapping
X:S!R
with the required property that X is [BnA] measurable, i.e.
AX = fA
S : X(A) 2 Bg = fX
1
(B) : B 2 Bg
A
The distribution of r.v. X is the probability measure PX de…ned by
PX (X(A)) = P (X
1
(B))
Using the random variable X we have transferred (S; A; P ) ! (R; B; PX ).
Example: We toss two dice, in which case the sample space is S = f(1; 1); (1; 2); : : : ; (6; 6)g:
We can de…ne two random variables: the Sum and the Product:
X1 (S) = f2; 3; 4; 5; : : : ; 12g
X2 (S) = f1; 2; 3; 4; 5; 6; 8; 9; : : : ; 36g
17.3
Conditional Probability and Independence
In many cases we have some information available (call this event B) and wish to predict
some other event A given that we have observed B:
De…nition 74. The probability of an event A given an event B; denoted P (AjB) is
obtained as
P (A \ B)
P (AjB) =
P (B)
when P (B) > 0:
Note that P ( jB) is a probability measure. In particular:
1. P (AjB)
0
96
17. MEASURE-THEORETIC PROBABILITY
ECO2010, Summer 2022
2. P (SjB) = 1
3. P
1
S
Ai B
=
i=1
P1
i=1 P (Ai jB)
for any pairwise disjoint events fAi g1
i=1 :
When A and B are mutually exclusive events, P (AjB) = 0: When A B then P (AjB) =
P (A)=P (B) P (A) with strict inequality unless P (B) = 1: When B A then P (AjB) = 1:
De…nition 75. The law of total probability is stated as
P (A) = P (A \ B) + P (A \ B c )
Example: say we have information on 100 stocks ordered into the following table:
Yesterday
up
down
Today
up down
53
25
15
7
68
32
78
22
100
Letting A = fup Yesterdayg and B = fup Todayg the information here is e¤ectively
P (A \ B); P (A \ B c ); P (Ac \ B); P (Ac \ B c ): To convert the information into marginal
probabilities, we can use the law of total probability. Conditional probabilities can be
calculated using the previous de…nition. E.g.
78
100
22
P (down Yesterday) =
100
53
53=100
=
P (up Todayjup Yesterday) =
78=100
78
P (up Yesterday) =
De…nition 76 (Independence). Suppose P (A); P (B) > 0. Then A and B are independent
events if
1. P (A \ B) = P (A)P (B) [symmetric in A and B; does not require P (A); P (B) > 0]
2. P (AjB) = P (A)
3. P (BjA) = P (B)
17.4
Bayes Rule
The Bayes Rule is a formula for updating conditional probabilities, and as such is used in
both frequentist and Bayesian inference.
De…nition 77 (Bayes Rule). For two sets A and B
P (BjA) =
P (AjB)P (B)
P (AjB)P (B) + P (AjB c )P (B c )
97
17. MEASURE-THEORETIC PROBABILITY
ECO2010, Summer 2022
This follows from the de…nition of conditional independence
P (A \ B) = P (AjB)P (B) = P (BjA)P (A)
and the law of total probability
P (A) = P (A \ B) + P (A \ B c ) = P (AjB)P (B) + P (AjB c )P (B c )
Equivalently, Bayes Rule is often stated as
P (AjB)P (B)
P (A)
/ P (AjB)P (B)
P (BjA) =
P (B) is often called the "prior" probability, P (AjB) the "likelihood", and P (BjA) the
"posterior" probability.
Mathematica demonstration: Probability of being sick after having tested positive for a
disease
http://demonstrations.wolfram.com/ProbabilityOfBeingSickAfterHavingTestedPositiveForADiseaseBa/
98
18. RANDOM VARIABLES AND DISTRIBUTIONS
18
ECO2010, Summer 2022
Random Variables and Distributions
Associated with each random variable is the (cumulative) distribution function
FX (x) = PX (X
x)
de…ned for all x 2 R: This function e¤ectively replaces PX :
Discrete random variable can take on a discrete set of values (e.g. gender, age,
number of years of schooling). The cdf of X is a step function. The di¤erence between
the cdfs of two adjacent elements in its domain is the probability mass function (pmf):
fX (xj ) = PX (X = xj )
Example: x 2 f1; 2; 3; 4; 5g
xj
fX (xj )
FX (xj )
0
0.7
0.7
1
0.1
0.8
2
0.08
0.88
3
0.06
0.94
4
0.04
0.98
5
0.02
1.00
Continuous random variable can take on a continuum of values (e.g. GDP growth
rate, stock returns). The cdf of X is a continuous function. The (Radon-Nikodym) derivative of the cdf is the probability density function (pdf). Note: PX (X = x) = 0. The
set fX = xg is an example of a set of measure zero.
Example: commuting time
99
18. RANDOM VARIABLES AND DISTRIBUTIONS
ECO2010, Summer 2022
Joint density f (X; Y ) of two continuous random variables:
Consider two random variables, X; Y , with a joint distribution F (X; Y ). The marginal
distribution of Y is just another name for its probability distribution F (Y ), irrespective
of X. The term is used to distinguish F (Y ) alone from the joint distribution of Y and
another r.v. Similarly for marginal density:
P (Y = yi ) =
k
X
P (X = xj ; Y = yi )
j=1
f (Y = yi ) =
Z
f (X = x; Y = y)dx
The distribution of a random variable Y conditional on another random variable X
taking on a speci…ed value is called the conditional distribution of Y given X. In
general, for the continuous case
fY jX (yjx) =
and the discrete case
P (Y = yjX = x) =
fY;X (y; x)
fX (x)
P (Y = y; X = x)
P (X = x)
Mathematica demonstration: Bivariate joint and conditional Gaussian
100
18. RANDOM VARIABLES AND DISTRIBUTIONS
ECO2010, Summer 2022
http://demonstrations.wolfram.com/TheBivariateNormalAndConditionalDistributions/
18.1
Moments
The k th moment (or non-central moment) is the quantity
h i
=
E
Xk
k
The k th moment about the mean (or central moment) is the quantity
h
i
k
=
E
(X
E[X])
k
Orders:
the 0th moment is one
the 1st (non-central) moment is the mean
the 2nd moment is the variance
the 3rd moment is skewness
the 4th moment is (excess) kurtosis
Moments of order k > 2 are typically called "higher-order moments". Moments of order
higher than a certain k may not exist for certain distributions.
18.1.1
Moment Generating Function
For a random variable x with distribution function F its moment generating function
(MGF) is
mX (t) = E[exp(tX)]
Z
= exp(tX)dF (X):
The MGF is also known as the Laplace transformation of the density of X.
101
(67)
18. RANDOM VARIABLES AND DISTRIBUTIONS
ECO2010, Summer 2022
The rth derivative evaluated at zero is the rth uncentered moment of X :
dr
exp(tX)
dtr
= E [X r exp(tX)]
(r)
mX (t) = E
and thus
(r)
mX (0) = E [X r ] :
A limitation of the MGF is that it does not exist (i.e. is not …nite) for many random
variables. Finiteness of the integral (67) requires the tail of the density of x to decline
exponentially. This excludes thick-tailed distributions such as the Pareto.
18.1.2
Characteristic Function
This limitation is removed if we consider the characteristic function (CF) of X, which
is de…ned as
'X (t) = E[exp(itX)]
Z
= exp(itX)dF (X)
p
where i =
1: The CF is also known as the Fourier transformation of the density of
X. The CF exists for all random variables and all values of t since
exp(itX) = cos(tX) + i sin(tX)
is bounded. Similarly to the MGF, the rth derivative of the characteristic function evaluated
at zero takes the simple form
(r)
'X (0) = ir E[X r ]:
'X (t) is named the characteristic function since it completely and uniquely characterizes
the distribution F (X) of X.
Theorem 78 (Power Series Expansion of CF). If 'X (t) converges on some open interval
containing t = 0; then X has moments of any order and
'X (t) =
1
X
(it)k
k=0
k!
E[X k ]
The Theorem implies that 'X (t); and hence F (X); are uniquely characterized by the
collection of their moments.
2 t2 =2):
Example 1: if X N ( ; 2 ) then 'X (t) = exp(i t
Example 2: if X P oisson( X ) then 'X (t) = exp( (exp(it) 1)):
Theorem 79 (Convolution Theorem). Let X1 ; X2 ; :::; Xn be independent random variables
P
and let S = ni=1 Xi : Then
n
Y
'S (t) =
'Xi (t):
i=1
102
18. RANDOM VARIABLES AND DISTRIBUTIONS
Example: Let X
P oisson(
X ),
Y
P oisson(
ECO2010, Summer 2022
Y );
and Z = X + Y: Then
'Z (t) = 'X (t)'Y (t)
= exp(
X (exp(it)
= exp((
and hence Z
18.2
P oisson(
X
+
X
+
1)) exp(
Y ) (exp(it)
Y (exp(it)
1))
1))
Y ):
Transformations of Random Variables
Goal: given a distribution FX (X) of a random variable X; …nd a distribution FY (Y ) of the
r.v.
Y = g(X)
where g is a deterministic function. Denote by fX and fY the respective density functions.
18.2.1
Monotonic Transformations
Let g(X) be a monotonic (i.e. increasing or decreasing) transformation.
1. Case g
10 (y)
>0:
FY (y) = P (Y
y) = P (g
1
(Y )
1
g
(y)) = P (X
1
g
(y)) = FX (g
1
(y))
Then,
fY (y) = FY0 (y) = FX0 (g
2. Case g
10 (y)
1
10
(y))g
(y) = fX (g
1
(y))g
10
(y)
<0:
FY (y) = P (Y
1
y) = P (g
(Y )
g
FX0 (g
1
1
(y)) = P (X
g
1
(y)) = 1
FX (g
1
(y))
Then,
fY (y) = FY0 (y) =
Note that here
g
10 (y)
10
(y))g
(y) =
fX (g
1
(y))g
10
(y)
> 0.
Hence,
fY (y) = fX (g
1
10
(y)) g
(y)
Example: When X has a standard normal density function, X
density of Y = + X: Since
Y
X = g 1 (Y ) =
then
g
10
(Y ) =
1
Therefore,
fY (y) = fX (g
1
(y)) g
1
= p exp
2
103
10
(y)
)2
(y
2
2
1
N (0; 1); obtain the
18. RANDOM VARIABLES AND DISTRIBUTIONS
18.2.2
ECO2010, Summer 2022
Non-Monotonic Transformation
For illustration we will consider a special case where g(X) a quadratic transformation. More
general cases follow by direct extension. Let
Y = g(X) = X 2
Then,
FY (y) = P (Y
y) = P (X 2
y) = P
1
1
y2 < X
y2
1
= FX y 2
FX
1
y2
and hence
1
y
2
1
fY (y) = FY0 (y) = FX0 y 2
1
y
2
=
Example: Suppose X
1
2
h
1
fX y 2 + fX
1
2
2
FX
1
y2
1
y2
i
1
y
2
1
2
N (0; 1); and obtain the density of Y = X 2 : Then,
1
1 2
fX (x) = p exp
x
2
2
1 1
1
p exp
fY (y) =
y 2
2
2
y
=p
1
2
exp
1
1
y2
2
2
1
+ p exp
2
1
1
y2
2
2
1
y
2
Note that we used f (x) = f ( x) that holds for N (0; 1) by symmetry around zero. The
resulting fY (y) is the density of the 2 distribution with 1 degree of freedom.
18.3
Parametric Distributions
Families of distributions (or densities) are often parametrized by 2 Rn and expressed in
the form fFX (xj ) : 2 g : We will take a closer look at several continuous distributions.
Chi-square Distribution: pdf and cdf
P
2 . This result follows from a combination
If Zi N (0; 1) and X = ki=1 Zi2 then X
k
of the Convolution Theorem and transformation of random variables.
104
18. RANDOM VARIABLES AND DISTRIBUTIONS
ECO2010, Summer 2022
t-Distribution: pdf and cdf
If Z
2;
n
N (0; 1) and X
then T = p Z
X=n
tn :
F Distribution: pdf and cdf
If X
2
d1
and Y
2
d2
then Q =
X=d1
Y =d2
Fd1;d2 :
Other distributions:
Benford, Beta, Boltzmann, categorical, Conway-Maxwell-Poisson, compound Poisson,
discrete phase-type, extended negative binomial, Gauss-Kuzmin, geometric, hypergeometric, Irwin–Hall, Kumaraswamy, logarithmic, negative binomial, parabolic fractal, Rademacher,
raised cosine, Skellam, triangular, U-quadratic, Wigner semicircle, Yule-Simon, Zipf, ZipfMandelbrot, zeta, Beta prime, Bose–Einstein, Burr, chi, Coxian, Erlang, exponential,
Fermi-Dirac, folded normal, Fréchet, Gamma, generalized extreme value, generalized inverse Gaussian, half-logistic, half-normal, Hotelling’s T-square, hyper-exponential, hypoexponential, inverse chi-square (scaled inverse chi-square), inverse Gaussian, inverse gamma,
Lévy, log-normal, log-logistic, Maxwell-Boltzmann, Maxwell speed, Nakagami, noncentral
chi-square, Pareto, phase-type, Rayleigh, relativistic Breit–Wigner, Rice, Rosin–Rammler,
shifted Gompertz, truncated normal, type-2 Gumbel, Weibull, Wilks’ lambda, Cauchy,
extreme value, exponential power, Fisher’s z, generalized normal, generalized hyperbolic,
Gumbel, hyperbolic secant, Landau, Laplace, logistic, normal inverse Gaussian, skew normal, slash, type-1 Gumbel, Variance-Gamma, Voigt,...
Reference: Balakrishnan and Nevzorov (2003)
105
19. STATISTICAL PROPERTIES OF ESTIMATORS
19
ECO2010, Summer 2022
Statistical Properties of Estimators
A generic econometric model can be written in the form
ff (X; ) :
2
g
where X is a matrix of random variables, is a vector of parameters, and is a parameter
space.
Consider a data sample x = (x1 ; : : : ; xn ) treated as realizations of the random variables
X = (X1 ; : : : ; Xn ): An estimator is a function b(X) of (X1 ; : : : ; Xn ) intended as a basis
for learning about the unknown . An estimate is a realized value of b(x) for a particular
data sample x: Desirable properties of estimators: unbiasedness, consistency, e¢ ciency, and
tractable asymptotic distribution (e.g. Normality).
Examples of parameters and their estimators:
1. Analysis without an econometric model:
P opulation quantities:
M ean:
X
= E[X]
Variance:
2
X
= E[(X
Covariance:
XY
= E[(X
Sam ple m ean:
X)
2
Sam ple variance:
]
X )(Y
Y
)]
Sam ple covariance:
T heir estim ators:
P
X =n 1 n
i=1 Xi
2
SX
= (n
1)
1
Pn
i=1
SXY = (n 1) 1
Pn
X
Yi
i=1 Xi
Xi
X
2
Y
2. Analysis with an econometric model:
Population quantity:
Simple regression slope:
Its estimator:P
n
Xi Yi
Its estimator: b = Pi=1
n
X2
i=1
19.1
Sampling Distribution
i
Recall that an estimator b(X) is a random variable. For example the sample mean X is
combination (i.e. function) of other random variables Xi which makes X a (composite)
random variable. The probability density function (pdf) of the estimator b(X) is called the
sampling distribution. In most cases the sampling distribution of b(X) is unknown, but
there are ways to estimate or approximate it (using asymptotics or bootstrap procedures).
The sampling distribution is used for hypothesis testing and evaluation of the properties of
estimators.
We can broadly distinguish between two types of properties:
1. Small sample or …nite sample properties (for …xed n < 1):
(a) Bias
106
19. STATISTICAL PROPERTIES OF ESTIMATORS
ECO2010, Summer 2022
(b) E¢ ciency
(c) Mean squared error
2. Large sample or asymptotic properties (as n ! 1):
(a) Consistency
(b) Asymptotic distribution
19.2
Finite Sample Properties
The bias of b is given by E[b]
: An estimator is unbiased if the mean of its sampling
distribution is E[b] = . Example: b is a biased estimator and
is an unbiased estimator
of :
If b and
V ar(b):
are two unbiased estimators of ; then
is e¢ cient relative to b if V ar( )
The mean square error (MSE) allows us to compare estimators that are not necessarily
unbiased:
h
i
M SE(b) = E (b
)2
= V ar(b) + bias2 (b)
107
19. STATISTICAL PROPERTIES OF ESTIMATORS
ECO2010, Summer 2022
We prefer estimators with smaller MSE.
19.3
Convergence in Probability and Consistency
A sequence of deterministic (i.e. nonstochastic) real numbers fan g converges to a limit
a if, for any " > 0, there exists n = n (") such that, for all n > n ,
jan
aj < "
(68)
which implies
P (jaN
aj < ") = 1 for all N > N
For example, if an = 2 + 3=n, then the limit is a = 2 since jan aj = j2 + 3=nj = j3=nj < "
all n > n = 3=". For a sequence of random (i.e. stochastic) variables fXn g, we can never
be completely certain that the condition (68) will always be satis…ed, even for large n:
Instead, we require that the probability that (68) is satis…ed approach 1 as n approaches in…nity:
P (jXn cj < ") ! 1 as n ! 1
A formal de…nition of this requirement is as follows:
De…nition 80 (Convergence in Probability). A sequence of random variables fXn g converges in probability to a limit c if, for any " > 0 and > 0, there exists n ("; ) such
that, for all n > n ,
P (jXn cj < ") > 1
We write
plim (Xn ) = c
or, equivalently,
p
Xn ! c
Recall that an estimator bn is a random variable. Note that bn can be viewed as a
sequence indexed by the sample size n:
De…nition 81. If bn converges in probability to
0:
108
0
then bn is a consistent estimator of
19. STATISTICAL PROPERTIES OF ESTIMATORS
Thus, for a consistent estimator bn of
or
0;
ECO2010, Summer 2022
it holds by de…nition that
p
bn !
0
plim(bn ) =
0
Note that it is possible for an estimator to be biased in small samples, but consistent
(i.e., it is biased in small samples, but this bias goes away as n gets large). For example,
for b 2 R with b 6= 0 we could have
E[bn ] =
lim E[bN ] =
N !1
0
+
b
n
0
In general, as long as the variance of an estimator collapses to zero as the sample size gets
larger, while any small sample bias disappears as the sample size gets larger, the estimator
will be consistent.
Convergence in probability is preserved under deterministic transformations. Formally,
this is expressed by the following Theorem:
Theorem 82 (Slutsky’s Theorem). Let Xn be a …nite-dimensional vector of random variables, and g( ) be a real-valued function continuous at a vector of constants c: Then
p
Xn ! c
implies
p
g (Xn ) ! g(c)
Laws of large numbers (LLN) are theorems for convergence in probability in the
special case of sample averages.
Theorem 83 (Weak Law of Large Numbers). Let fX1 ; : : : ; Xn g be a sequence of n independent and identically distributed (iid) random variables, with E[Xi ] = < 1: Then
n
Xn
1X
p
Xi !
n
i=1
There are a number of di¤erent version of LLNs. These include the Kolmogorov or
Markov LLN that imply convergence in probability.
Mathematica demonstration: Law of Large Numbers
109
19. STATISTICAL PROPERTIES OF ESTIMATORS
ECO2010, Summer 2022
http://demonstrations.wolfram.com/IllustratingTheLawOfLargeNumbers/
19.4
Convergence in Distribution and Asymptotic Normality
De…nition 84 (Convergence in Distribution). Let fXn g be a sequence of random variables
and assume that Xn has cdf Fn (Xn ). fXn g is said to converge in distribution to a
random variable X with cdf F (X) if
lim Fn (Xn ) = F (X):
n!1
We write
d
Xn ! X
and we call F the limit distribution of fXn g.
p
Convergence in probability implies convergence in distribution; that is Xn ! c implies
d
Xn ! c where the constant c can be though of as a degenerate random variable with a
probability mass of 1 placed at c. In general the converse does not hold.
Theorem 85 (Continuous Mapping Theorem). Let Xn be a …nite-dimensional vector of
random variables, and g( ) be a continuous real-valued function. Then
d
Xn ! X
implies
d
g (Xn ) ! g(X)
p
d
Theorem 86 (Transformation Theorem). If Xn ! X and Yn ! c where X is a random
variable and c is a constant, then
d
(i) Xn + Yn ! X + c
d
(ii) Xn Yn ! cX
d
(iii) Xn =Yn ! c
110
1X
if c 6= 0:
19. STATISTICAL PROPERTIES OF ESTIMATORS
ECO2010, Summer 2022
Central limit theorems (CLT) are theorems on convergence in distribution of sample
averages. Intuition: Take a (small) sample of independent measurements of a random
variable and compute its average (Examples: commuting time, website clicks). Take another
such sample and compute average, another sample and average, etc. Keep repeating. The
distribution (histogram) of such averages becomes Normal (Gaussian) as n ! 1.
Theorem 87 (Classical Central Limit Theorem). Let fX1 ; : : : ; Xn g be a sequence of n
independent and identically distributed (iid) random variables, with E[Xi ] = < 1 and
P
0 < V ar(Xi ) = 2 < 1: Then as n ! 1; the distribution of X n n 1 ni=1 Xi converges
2
to the Normal distribution with mean and variance n , i.e.
d
Xn ! N ( ;
2
=n)
irrespective of the shape of the original distribution of the Xi :
There are a number of di¤erent version of CLTs. These include the Lindeberg–Levy and
Liapounov CLT.
From a LLN, the sample average has a degenerate distribution as it converges to the
2
1=2
) by n
constant . We scale (X n
; the inverse of the standard deviation of X n ;
to construct a random variable with unit variance. Hence, the CLT implies that
2
n
1=2
(X n
)=
p
n(X n
d
)= ! N (0; 1)
The CLT is used to derive the asymptotic distribution of a number of commonly used
estimators (e.g. GMM, ML)
Demonstration: CLT
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
111
20. STOCHASTIC ORDERS AND DELTA METHOD
20
ECO2010, Summer 2022
Stochastic Orders and Delta Method
20.1
A-notation
Consider a so-called "A-notation" which means "absolutely at most." For example, A(2)
stands for a quantity whose absolute value is less than or equal to 2. This notation has a
natural connection with decimal numbers: Saying that
to saying that
is approximately 3.14 is equivalent
= 3:14+A(0:005).
Another example:
10A(2) = A(100)
We can use the A-notation in general operations. For example:
(3:14 + A(0:005)) (1 + A(0:01))
= 3:14 + A(0:005) + A(0:0314) + A(0:00005)
The A-notation applies to variable quantities as well as to constant ones. Examples:
sin(x) = A(1)
A(x) = xA(1)
A(x) + A(y) = A(x + y) if x
0 and y
0
(1 + A(t))2 = 1 + 3A(t) if t = A(1)
The equality sign "=" is not symmetric with respect to such notation. We have 3 =A(5)
and 4 =A(5) but not A(5)= 4. In this context, the equality sign is used as the word “is”in
English: Aristotle is a man, but a man isn’t necessarily Aristotle.
20.2
O-notation
The O-notation ("big oh notation") is conceptually similar to the A-notation, except that
the former is less speci…c. In its simplest form, O(x) stands for something that is cA(x) for
some constant c, but we do not specify what c is. For example,
O (n) = O (n)
Similarly, for n > 0;
n+O
p
n
(ln n +
+ O (1=n))
p
p
p
= n ln n + n + O(1) + O n ln n + O n + O(1= n)
112
20. STOCHASTIC ORDERS AND DELTA METHOD
ECO2010, Summer 2022
The O-notation is most useful for comparing the relative values of two functions, f (x) and
g(x), as x approaches 1 or 0; depending on which of these two cases is being considered.
We will suppose that g is positive-valued and that n > 0.
1. The case x ! 1 :
f (x) = O(g(x)) if there exist constants A and x0 > 0 such that
jf (x)j
g(x)
< A for all
jf (x)j
g(x)
< A for all
x > x0 :
2. The case x ! 0 :
f (x) = O(g(x)) if there exist constants A and x0 > 0 such that
x < x0 :
Example 1: The function f (x) = 3x3 + 4x2 is O(x3 ) as x ! 1: Here, g(x) = x3 : We
have that
jf (x)j
3x3 + 4x2
4
=
=3+
3
g(x)
x
x
There exist in…nitely many pairs of A and x0 to show that f is O(x3 ); for example
jf (x)j
g(x)
< 4:1
for all x > 40:
Example 2: The function f (x) = 3x3 + 4x2 is O(x2 ) as x ! 0: Here
3x3 + 4x2
jf (x)j
=
= 3x + 4
g(x)
x2
and, for example,
jf (x)j
g(x)
< 4:3 for all x < 0:1:
o-notation
20.3
The o-notation ("little oh") is equivalent to the taking of limits, as opposed to specifying
bounds.
1. The case x ! 1 :
f (x) = o(g(x)) if
lim
x!1
jf (x)j
=0
g(x)
2. The case x ! 0 :
f (x) = o(g(x)) if
jf (x)j
=0
x!0 g(x)
lim
113
20. STOCHASTIC ORDERS AND DELTA METHOD
ECO2010, Summer 2022
Example 3: The function f (x) = 3x3 + 4x2 is o(x4 ) as x ! 1; because
jf (x)j
3x3 + 4x2
= lim
= lim
x!1 g(x)
x!1
x!1
x4
lim
3
4
+
x2 x2
=0
Example 4: The function f (x) = 3x3 + 4x2 is o(x) as x ! 0; because
jf (x)j
3x3 + 4x2
= lim
= lim 3x2 + 4x = 0
x!0 g(x)
x!0
x!0
x
lim
Example 5: The function f (x) = 3x3 + 4x2 is o(1) as x ! 0; because
jf (x)j
= lim (3x3 + 4x2 ) = 0
x!0 g(x)
x!0
lim
20.4
Op-notation
De…nition 88. Let fYn g be a sequence of random variables and g(n) a real-valued function
of the positive integer argument n: Then Yn is Op (g(n)) if for all
> 0; there exists a
constant K and a positive integer N such that
P
Yn
>K
g(n)
<
for all n > N:
De…nition 89. Let fYn g be a sequence of random variables and g(n) a real-valued function
of the positive integer argument n: Then Yn is op (g(n)) if for all ";
> 0; there exists a
positive integer N such that
P
Yn
>"
g(n)
<
for all n > N
or equivalently, for all " > 0
lim P
n!1
Yn
>"
g(n)
=0
or equivalently
plim
Yn
g(n)
= 0:
Example:
Let X1 ; X2 ; :::; Xn be iid random variables distributed according to F; with mean
and variance
2
< 1: Then we know that for the average X n it holds
Xn
Xn
p
n Xn
= op (1) (LLN)
p
= Op (1= n) (CLT)
= Op (1)
114
<1
20. STOCHASTIC ORDERS AND DELTA METHOD
ECO2010, Summer 2022
There are many simple rules for manipulating op (1) and Op (1) sequences which can be
deduced from the Continuous Mapping Theorem. For example:
op (1) + op (1) = op (1)
op (1) + Op (1) = Op (1)
Op (1) + Op (1) = Op (1)
op (1)op (1) = op (1)
op (1)Op (1) = op (1)
Op (1)Op (1) = Op (1)
20.5
20.5.1
Delta Method
Probabilistic Taylor Expansion
Theorem 90. Let fXn g be a sequence of random variables such that Xn = a + Op (rn )
where a 2 R and 0 < rn ! 0 as n ! 1: If g is a function with s continuous derivatives at
a then
g(Xn ) =
s
X
g (j) (a)
j=0
j!
(Xn
a)j + op (rns ):
(Reference: Proposition 6.1.5 in P. Brockwell and R. Davis, Time Series: Theory and
Methods, Springer, 1991.)
Consequently, for a sequence of k-vectors fXn g such that Xn = a+Op (rn ) where a 2 Rk
and 0 < rn ! 0 as n ! 1,
g(Xn ) = g(a) + rg(a)0 (Xn
a) + op (rn ):
(69)
Theorem 91 (Multivariate Delta Method). Suppose that fXn g is a sequence of random
k-vectors such that
p
n(Xn
d
) ! N (0; ):
Let g : Rk ! R be a di¤ erentiable function. Then
p
d
n(g(Xn ) g( )) ! N (0; rg( )0 rg( )):
To show that this result holds, note that, using (69),
V ar(g(Xn )) = V ar(g( ) + rg( )0 (Xn
= V ar(g( ) + rg( )0 Xn
)) + op (rn )
rg( )0 ) + op (rn )
= V ar(rg( )0 Xn ) + op (rn )
= rg( )0 V ar(Xn )rg( ) + op (rn )
115
21. REGRESSION WITH MATRIX ALGEBRA
21
ECO2010, Summer 2022
Regression with Matrix Algebra
21.1
Multiple Regression in Matrix Notation
Recall the linear multiple regression model
yi =
For each i; de…ne 1
0
+
1 xi1
+
2 xi2
+ ::: +
k xik
+ ui
(70)
(k + 1) vector
xi = (1; xi1 ; xi2 ; : : : ; xik )
and (k + 1)
1 vector of parameters
=(
0;
1; : : : ;
0
k)
Then we can write the equation (70) as
yi = xi + ui
We can now write the linear multiple regression model in matrix notation for all i =
1; : : : ; n :
2 3 2
3
x1
1 x11 x12
x1k
6 7 6
7
x2k 7
6 x2 7 = 6 1 x21 x22
X
6 . 7 6.
.. 7
7 6.
7
.
n (k + 1) 6
.
.
. 5
4 5 4
xn
1 xn1 xn2
xnk
De…ne also u as the n 1 vector of unobservable errors.
Then the multiple regression for all i = 1; : : : ; n can be written compactly as
y =X +u
where y is (n 1); X is (n (k + 1)); is (k + 1) 1; X is n 1; u is n 1:
Ordinary Least Squares (OLS) principle: minimize the sum of squared residuals (SSR).
De…ne the SSR as a function of (k + 1) 1 parameter vector b as
SSR(b)
n
X
(yi
xi b)2
i=1
The (k + 1) 1 vector of OLS estimates b = ( b 0 ; b 1 ; : : : ; b k )0 minimizes SSR(b) over all
possible vectors b:
116
21. REGRESSION WITH MATRIX ALGEBRA
ECO2010, Summer 2022
Mathematica demonstration: Least Squares
http://demonstrations.wolfram.com/LeastSquaresCriteriaForTheLeastSquaresRegressionLine/
b solves the …rst-order condition
with the solution
@SSR( b )
=0
@b
n
X
x0i (yi
i=1
or equivalently
X0 (y
which are called "Normal equations".
Solving the FOC
X0 (y
for b yields
xi b ) = 0
Xb) = 0
Xb) = 0
X0 y = (X0 X) b
b = (X0 X) 1 X0 y
We need (X0 X) to be invertible, which is guaranteed by the assumption MLR.3 of no
perfect multicollinearity (i.e. no exact linear relationship among the independent variables
giving us full rank of X0 X).
The n 1 vectors of …tted values and residuals are given by
b = Xb
y
117
21. REGRESSION WITH MATRIX ALGEBRA
ECO2010, Summer 2022
and
b=y
u
b
y
Xb
=y
The sum of squared residuals can be written as
SSR( b ) =
n
X
i=1
0
b
bu
=u
= (y
21.2
u
b2i
X b )0 (y
MLR Assumptions in Matrix Notation
Xb)
MLR.1 The model is linear in parameters:
y =X +u
(71)
MLR.2 We have a random sample X; y following (71).
MLR.3 There is no perfect multicollinearity, i.e. rank(X) = k + 1:
MLR.4 The error u has an expected value zero conditional on X :
E[ujX] = 0
MLR.5 The error u is homoskedastic:
V ar[ujX] =
where In is the n
21.3
2
u In
n identity matrix.
Projection Matrices
De…ne the matrices
P = X X0 X
M = In
1
X0
P
where In is the n n identity matrix. P and M are called projection matrices due to the
property that for any matrix Z which can be written as
Z=X
for some matrix
PZ = PX = X X0 X
118
1
X0 X = X = Z
21. REGRESSION WITH MATRIX ALGEBRA
ECO2010, Summer 2022
and
MZ = (In
P)Z = Z
PZ = Z
Z=0
The matrices P and M are symmetric and idempotent. By de…nition of P and M;
b = Xb
y
1
= X X0 X
X0 y
= Py
and
b=y y
b
u
=y
Py
= My
21.4
Unbiasedness of OLS
Theorem 92 (Unbiasedness of OLS). Under MLR.1 - MLR.4, E[ b jX] = :
Proof. Use MLR.1 - MLR.3 to write
b = (X0 X)
1
X0 y
= (X0 X)
1
X0 (X + u)
= (X0 X)
1
(X0 X) +(X0 X)
=
+ (X0 X)
1
1
X0 u
X0 u
Taking the expectation conditional on X and using MLR.4 yields
E[ b jX] =
+ (X0 X)
1
X0 E[ujX]
=
21.5
Variance-Covariance Matrix of the OLS Estimator
Theorem 93. Under MLR.1 - MLR.5,
V ar( b jX) =
1
2
0
u (X X)
Proof. Using the unbiasedness proof b and MLR.5 we obtain
V ar( b jX) = V ar( + (X0 X)
= V ar((X0 X)
1
1
X0 u)jX)
X0 ujX)
= (X0 X)
1
X0 V ar(ujX)X(X0 X)
= (X0 X)
1
X0
=
=
1
2
0
u In X(X X)
1 0
1
0
2
0
u (X X) X
1
2
0
u (X X)
119
X(X X)
1
21. REGRESSION WITH MATRIX ALGEBRA
ECO2010, Summer 2022
Mathematica demonstration: Regression with Transformations
http://demonstrations.wolfram.com/RegressionModelWithTransformations/
21.6
Gauss-Markov in Matrix Notation
Theorem 94 (Gauss-Markov). Under MLR.1 - MLR.5, b is the best linear unbiased estimator.
Proof. Any other linear estimator of
can be written as
e = A0 y
= A0 (X + u)
= A0 X + A0 u
where A is an n
(k + 1) matrix such that (for unbiasedness)
E[ e jX] =
A0 X + A0 E[ujX] =
A0 X =
implying A0 X = Ik+1
From (72) we have
V ar( e jX) = V ar(A0 ujX)
= A0 V ar(ujX)A
=
2 0
uA A
Using A0 X = Ik+1 ; we have
V ar( e jX)
V ar( b jX) =
=
=
=
1
2
0
0
u [A A (X X) ]
0
1 0
2
0
0
u [A A A X(X X) X A]
1 0
2 0
0
u A [In X(X X) X ]A
2 0
u A MA
120
(72)
21. REGRESSION WITH MATRIX ALGEBRA
where M
implying
Thus,
ECO2010, Summer 2022
In X(X0 X) 1 X0 is symmetric and idempotent (i.e. can be written as MM),
2 A0 MA is positive semi-de…nite.
u
V ar( e jX)
for any other linear unbiased estimator e :
V ar( b jX)
121
22. MAXIMUM LIKELIHOOD
22
ECO2010, Summer 2022
Maximum Likelihood
22.1
Method of Maximum Likelihood
If the distribution of yi is F (y; ) where F is a known distribution function and 2
is an unknown m 1 vector, we say that the distribution is parametric and that is
the parameter of the distribution F . The space
is the set of permissible values for :In
this setting the method of maximum likelihood (ML) is the appropriate technique for
estimation and inference on : ML can be used for models that are linear or non-linear in
parameters. The trade-o¤: an assumption has to be made about the form of F (y; ):
If the distribution F is continuous then the density of yi can be written as f (y; ) and
the joint density of a random sample (y1 ; :::; yn ) is
fn (y1 ; :::; yn ; ) =
n
Y
f (yi ; )
i=1
The likelihood of the sample is this joint density evaluated at the observed sample
values, viewed as a function of ,
L( ) = fn (y1 ; :::; yn ; )
The log-likelihood function is its natural log
ln L( ) =
n
X
ln f (yi ; )
i=1
If the distribution F is discrete, the likelihood and log-likelihood are constructed by
setting f (y; ) = P (yi = y; )
The maximum likelihood estimator or bM LE is the parameter value which maximizes the likelihood (equivalently, which maximizes the log-likelihood). We can write this
as
bM LE = arg max ln L( )
2
In some simple cases, we can …nd an explicit expression for bM LE as a function of the
data (the OLS is a special case under the Normality assumption on the residuals!), but these
cases are relatively rare. More typically, the bM LE must be found by numerical methods.
22.2
22.2.1
Examples
Example 1
Suppose we have n independent observations y1 ; : : : ; yn from a N ( ;
is the ML estimate of and 2 ?
122
2)
distribution. What
22. MAXIMUM LIKELIHOOD
ECO2010, Summer 2022
Write the likelihood function as
L( ;
2
jy1 ; : : : ; yn ) =
n
Y
f (yi )
i=1
= p
1
2
2
n
exp
2
n
1 X
2
!
(yi
)2
n
1 X
(yi
i=1
Its logarithm is more convenient to maximize:
l = ln L( ;
2
n
ln(
2
n
ln(2 )
2
jy1 ; : : : ; yn ) =
2
)
2
2
)2
i=1
To compute the maximum we need the partial derivatives:
n
1 X
@l
= 2
(yi
@
)
i=1
@l
=
@ 2
n
n
1 X
+
(yi
2 2 2 4
)2
i=1
The maximum likelihood estimators are those values b and b2 which set these two
partials to equal zero.
The …rst equation determines b :
n
X
yi = n
i=1
n
b=
1X
yi = y
n
i=1
Now plug this b into the second equation to get
n
1 X
n
= 2
(yi
2 2b
22.2.2
Example 2
b2 =
1
n
i=1
n
X
y)2
y)2
(yi
i=1
Consider the random variables t1 ; : : : ; tn that are independent and follow an exponential
distribution with a parameter , i.e.,
f (t; ) =
exp (
123
t)
for t > 0
22. MAXIMUM LIKELIHOOD
ECO2010, Summer 2022
Then
n
L( jt1 ; : : : ; tn ) =
n
X
exp
ti
i=1
!
n
X
` = ln L( jt1 ; : : : ; tn ) = n ln
@` n
=
@
n
X
b = Pnn
Example 3
ti
i=1
i=1 ti
22.2.3
ti
i=1
1
t
=
We will now show that LS estimates can be obtained as a special case of ML in the multiple
regression model.
y=X +u
2
N (0;
u
In )
Conditional on X;
2
N (X ;
y
In )
and hence
f (yi jxi ; ;
2
f (yjX; ;
2
2
) = (2
)
N
Y
) = (2
1=2
2
)
exp
1=2
xi )2
(yi
2
)
N=2
exp
2
(y
exp
xi )2
(yi
i=1
= (2
2
2
!
2
X )0 (y
2 2
!
X )
Rewrite as log-likelihood:
`(yjX; ;
2
)=
2
(N=2) ln(2
)
(y
X )0 (y
2 2
X )
Note that maximizing `(yjX; ; 2 ) w.r.t is equivalent to maximizing
(y X )0 (y X ) w.r.t. (both yield the same argmax).
Also note that
(y
X )0 (y
X ) = u0 u
=
N
X
u2i
i=1
which is the negative of the sum of square residuals criterion function of OLS. Hence maxiP
2
mizing `(yjX; ; 2 ) to obtain b M L yields the same result as minimizing N
i=1 ui to obtain
b
OLS :
124
22. MAXIMUM LIKELIHOOD
ECO2010, Summer 2022
Mathematica demonstration: MLE
http://demonstrations.wolfram.com/MaximumLikelihoodEstimatorsWithNormallyDistributedError/
22.3
Information Matrix Equality
De…ne the negative of the Expected Hessian
H=
E
@2
`(yj
@ @ 0
0)
(73)
De…ne the outer product matrix
=E
@
`(yj
@
0)
@
`(yj
@
0)
0
Theorem 95. For the Maximum Likelihood Estimator,
I0
H=
(74)
The equality (74) is called the information matrix equality. The matrix I0 is called
the Fisher information matrix. We will further see that I0 1 is the variance-covariance
matrix of MLE.
Before deriving the information matrix equality, we will …rst introduce several useful
concepts. The score of the log-likelihood is de…ned as
si ( )
r `i ( )0 =
@`i
@`i
@`i
( );
( ); : : : ;
( )
@ 1
@ 2
@ m
125
0
22. MAXIMUM LIKELIHOOD
ECO2010, Summer 2022
An important property of the score is:
Z
E [si ( 0 )jxi ] = si (y; xi ; 0 )f (y; xi ; 0 )dy
Z
= r 0 log f (y; xi ; 0 )f (y; xi ; 0 )dy
Z
r 0 f (y; xi ; 0 )
=
f (y; xi ; 0 )dy
f (y; xi ; 0 )
Z
= r 0 f (y; xi ; 0 )dy
Z
= r 0 f (y; xi ; 0 )dy
(75)
(76)
= r 01
=0
(77)
From (75) and (76) it follows that
r f (y; xi ;
0)
0
0 ) f (y; xi ;
= si (y; xi ;
Assuming `i ( ) is twice continuously di¤erentiable over
Hi ( )
Di¤erentiate (75) and (77) w.r.t.
Z
r si ( ) = r
0
0)
(78)
; de…ne
`i ( )
(79)
; and use the product rule to obtain
Z
r
Z
si (y; xi ;
r E [si (
0 )jxi ] = 0
0 )f (y; xi ;
0 )dy = 0
r [si (y; xi ; 0 )f (y; xi ; 0 )] dy = 0
Z
r si (y; xi ; 0 )f (y; xi ; 0 )dy + si (y; xi ; 0 )r f (y; xi ; 0 )dy = 0
Use (78) and (79) to obtain
Z
Hi (
0 )f (y; xi ;
0 )dy
+
Z
E [Hi (
si (y; xi ;
0 )jxi ]
0 )si (y; xi ;
+ E si (y; xi ;
Using the Law of Iterated Expectations (LIE) yields
H+
=0
H=
Note: in general, the LIE is stated as:
E[E [Y jX]] = E[Y ]
126
0
0 ) f (y; xi ;
0 )si (y; xi ;
0 )dy = 0
0
0 ) jxi
=0
22. MAXIMUM LIKELIHOOD
22.4
ECO2010, Summer 2022
Asympotic Properties of MLE
When standardized, the log-likelihood is a sample average:
n
1
1X
p
`( ) =
`(yi ; )!E [`(y; )]
n
n
i=1
As the bM LE maximizes the left-hand side, it is an estimator of the maximizer of the righthand side. Under regularity conditions (such as boundedness, smoothness and continuity of
`( )); convergence in probability is preserved under the arg max operation and its inverse.
Theorem 96 (MLE consistency). Under regularity conditions, bM LE is consistent:
p
bM LE !
0
Theorem 97 (MLE Asymptotic Normality). Under regularity conditions, bM LE is asymptotically Normally distributed:
p
n bM LE
d
0
!N (0;nI0 1 )
Therefore, the asymptotic variance of bM LE is
AV ar(bM LE ) = I0 1
Typically, to estimate the asymptotic variance of bM LE we use an estimate based on
the Hessian formula (73)
n
1 X @2
b
b
(80)
H=
0 `(yi ; M LE )
n
@
@
i=1
We then set
b
Ib0 = H
Asymptotic standard errors for bM LE are then the square roots of the diagonal elements of
Ib0 1 :
Theorem 98 (Cramer-Rao Lower Bound). If
V ar( ) I0 1 :
is an unbiased estimator of
2R then
The ML estimator achieves the C-R lower bound and is therefore e¢ cient in the class
of unbiased estimators.
127
23. GENERALIZED METHOD OF MOMENTS
23
ECO2010, Summer 2022
Generalized Method of Moments
23.1
Motivation
Recall the generic form of an econometric model
Y = f (X; U ; )
The multiple regression model estimated by OLS assumes that f ( ) is a linear function
in parameters. If f ( ) is nonlinear, can be estimated by MLE as long as we impose a
functional form assumption on the joint distribution of Y; X; U: In many cases, economic
theory does not provide any basis for the full functional form of the joint distribution, but
only one (or a few) of its moments. In such case, the model can be estimated using the
Method of Moments (MM), or the Generalized Method of Moments (GMM).
MM is based on the so-called analogy principle. If economic theory provides population moments of a model by analogy we can specify corresponding sample moments and
use these to estimate :
We can show that the OLS estimator is a special case of the MM estimator. Consider
the regression model
y =x +u
(81)
where x and
are K
1 vectors. Assume that
E[ujx] = 0
(82)
The single conditional moment restriction (82) implies K unconditional moment restrictions
E[xu] = 0
(83)
since, using the Law of Iterated Expectations and (82),
E[xu] = Ex [E[xujx]] = Ex [xE[ujx]] = Ex [x0] = 0
From (81) and (83) we obtain the population moment condition
E[xu] = E[x (y
x )] = 0
The MM estimator is the corresponding sample moment condition
n
X
[xi (yi
xi )] = 0
(84)
i=1
Rearranging (84) to solve for
yields
b
MM
=
n
X
i=1
0
xi x0i
= (X X)
1
128
!
X0 y
1 n
X
i=1
xi yi
23. GENERALIZED METHOD OF MOMENTS
ECO2010, Summer 2022
which coincides with the OLS estimator.
Economic theory can generate moment conditions that can be used as a basis for estimation. Consider the model
yi = E[yi jxi ; ] + ui
(85)
where the …rst right-hand side term measures the "anticipated" component of y; conditional
on xi ; and the second term measures the "unanticipated" component.
Here, yi can be e.g. a measure of demand. Under the assumptions of rational expectations and market clearing, we can conclude that
E [yi
E[yi jxi ; ]jIi ] = 0
where Ii denotes all information available.
By the Law of Iterated Expectations,
E [zi (yi
E[yi jxi ; ])] = 0
(86)
where zi is formed from any subset of Ii :
This provides many moment conditions (86) that can be used for estimation. If dim(zi ) >
dim( ); then not all conditions (86) may be satis…ed with exact equality in the sample. In
such cases, instead of a direct analogy of (86) expressed as an MM estimator, we minimize
the weighted average of (86). This leads to the Generalized Method of Moments estimator,
which minimizes the quadratic form
" n
#0
" n
#
1X
1X
Qn ( ) =
zi ui Wn
zi u i
n
n
i=1
where ui = yi
23.2
i=1
E[yi jxi ; ]:
GMM Principle
Assume the existence of r moment conditions for q parameters
E [h(wi ;
0 )]
=0
where is (q 1); h( ) is (r 1); with r q; and 0 denotes the true population value of
: The vector wi includes all exogenous and endogenous observable variables.
If r = q, the model is said to be just identi…ed, and 0 can be estimated by the MM
estimator de…ned as the solution to
n
1X
h(wi ; ) = 0
n
i=1
Equivalently, the MM estimator minimizes
" n
#0 " n
#
1X
1X
Qn ( ) =
h(wi ; )
h(wi ; )
n
n
i=1
i=1
129
(87)
23. GENERALIZED METHOD OF MOMENTS
ECO2010, Summer 2022
If r > q, the model is said to be overidenti…ed, and (87) has no solution. Instead,
bGM M estimator is chosen so that a quadratic form in (87) is as close to zero as possible.
Speci…cally, bGM M minimizes the quadratic form
"
#0
" n
#
n
1X
1X
Qn ( ) =
h(wi ; ) Wn
h(wi ; )
n
n
where Wn is (r
i=1
(88)
i=1
r) weight matrix. Di¤erentiating (88) w.r.t. yields the FOC
#0
" n
" n
#
1X
1 X @h(wi ; )0
Wn
h(wi ; b) = 0
0
n
n
@
b
=
i=1
i=1
These equations will be generally nonlinear in b and will be solved using numerical
methods.
Let
n
1 X @hi
G0 = plim
n
@ 0 = 0
i=1
and
n
S0 = plim
n
1 XX
hi h0j
n
=
0
i=1 j=1
where hi = h(wi ; ): The corresponding consistent estimators are
n
X @hi
b=1
G
n
@ 0
i=1
n
=b
i=1
=b
X
b= 1
S
hi h0i
n
(89)
for iid data.
Di¤erent choices for the weighting matrix Wn yield di¤erent GMM estimators. When
S0 is known, the most e¢ cient GMM results for the choice of
Wn = S0 1
Then,
p
n(bGM M
0)
d
! N 0; (G0 S0 1 G)
1
(90)
The optimal GMM estimator weights by the inverse of the variance matrix of the sample
moment conditions. The proof of the result (90) follows from the Multivariate Delta Method
and the Continuous Mapping Theorem.
b The optimal GMM estimator
In practice, S0 is unknown and needs to be estimated by S:
can be obtained using a two-step procedure:
b using (89).
1. Set Wn = In and obtain a suboptimal GMM estimator. Obtain S
130
23. GENERALIZED METHOD OF MOMENTS
ECO2010, Summer 2022
b
2. Run the optimal GMM estimator with Wn = S
1:
b in (90) makes no di¤erence asymptotically, since S
b is
Although replacing S0 with S
consistent, it will make a di¤erence in …nite samples. Depending on the application, using
b may result in a small-sample bias. In general, adding moment restrictions improves
S
asymptotic e¢ ciency, as it reduces the limit variance (G0 S0 1 G) 1 of the optimal GMM
estimator, or at least leaves it unchanged. Nonetheless, the number of moment restrictions
cannot exceed the number of observations.
GMM de…nes a class of estimators, with di¤erent choices of h( ) corresponding to different members of the class.
For example:
hi = xi ui yields the OLS estimator
hi = xi ui =V [ui jxi ] yields the GLS estimator when errors are heteroskedastic
If complete distributional assumptions are made on the model, the most e¢ cient estimator is the MLE. Thus the optimal choice of hi is
hi =
@ ln f (wi ; )
@
where @ ln f@(wi ; ) is the joint density of wi :
Hypothesis tests on can be performed using the Wald test. There is also a general
model speci…cation test that can be used for overidenti…ed models. Let
H0 : E[h(w;
0 )]
=0
which tests the initial population moment condition. The test can be implemented by
P
b i to 0: For the special case of the optimal GMM,
measuring the closeness of n 1 ni=1 h
under the null, the test statistic
" n
#0
" n
#
1 Xb b 1 1 Xb
2
hi S
hi
OIR =
r q
n
n
i=1
23.3
i=1
Example: Euler Equation Asset Pricing Model
Following Hansen and Singleton (1982), a representative agent is assumed to choose an
optimal consumption path by maximizing the present discounted value of lifetime utility
from consumption
1
X
max
E t0 U (Ct )jIt
t=1
subject to the budget constraint Ct + Pt Qt Vt Qt 1 + Wt where It denotes the information
available at time t, Ct denotes real consumption at t, Wt denotes real labor income at t,
131
23. GENERALIZED METHOD OF MOMENTS
ECO2010, Summer 2022
Pt denotes the price of a pure discount bond maturing at time t + 1 that pays Vt+1 , Qt
represents the quantity of bonds held at t, and 0 represents a time discount factor.
The …rst order condition for the maximization problem can be represented as the Euler
equation
U 0 (Ct+1 )
E (1 + Rt+1 ) 0 0
1=0
(91)
It
U (Ct )
where 1 + Rt+1 =
the power form
Vt+1
Vt
is the gross return on the bond at time t + 1. Assuming utility has
U (C) =
where
0
C1
1
0
0
represents the intertemporal rate of substitution (risk aversion), then
U 0 (Ct+1 )
=
U 0 (Ct )
0
Ct+1
Ct
(92)
Using (92) in (91), the conditional moment equation becomes
"
#
0
Ct+1
E (1 + Rt+1 ) 0
It
1=0
Ct
(93)
De…ne the nonlinear error term as
"t+1 = (1 + Rt+1 )
= a(zt+1 ;
with zt+1 = (Rt+1 ; Ct+1 =Ct )0 and
(93) may be represented as
0
0
0
Ct+1
Ct
1
0)
=(
0;
0
0) :
Then the conditional moment equation
E [ "t+1 j It ] = E [ a(zt+1 ;
0 )j It ]
=0
Let
xt = (1; Ct =Ct
Since xt
0
1 ; Ct 1 =Ct 2 ; Rt ; Rt 1 )
It ; (94) implies that
E [ xt "t+1 j It ] = E [ xt a(zt+1 ;
0 )j It ]
=0
and by the Law of Iterative Expectations
E [xt "t+1 ] = 0
For GMM estimation, de…ne the nonlinear residual as
et+1 = (1 + Rt+1 )
132
Ct+1
Ct
1
(94)
23. GENERALIZED METHOD OF MOMENTS
ECO2010, Summer 2022
Form the vector of moments
h(wt+1 ; ) = xt et+1
0
Ct+1
Ct
(1 + Rt+1 )
1
B
B
Ct+1
B (Ct =Ct 1 ) (1 + Rt+1 )
1
B
Ct
B
B
Ct+1
B
1
= B (Ct 1 =Ct 2 ) (1 + Rt+1 )
Ct
B
B
Ct+1
B
Rt (1 + Rt+1 )
1
B
Ct
B
@
Ct+1
1
Rt 1 (1 + Rt+1 )
Ct
1
C
C
C
C
C
C
C
C
C
C
C
C
C
A
There are r = 5 moment conditions to identify q = 2 model parameters, giving r
overidentifying restrictions.
133
q=3
24. TESTING OF NONLINEAR HYPOTHESES
24
ECO2010, Summer 2022
Testing of Nonlinear Hypotheses
A generic version of a hypothesis for a test of linear restrictions can be stated as:
H0 : R
0
r=0
HA : R
0
r 6= 0
for h restrictions on k parameters with h k; where R is an (h k) matrix of full rank
h; and r is an (h 1) vector of constants.
Example: the H0 that 1 = 1 and 2 = 2 + 3 when k = 4 can be expressed with
"
#
" #
10 0 0
1
R=
; r=
01 10
2
A generic version of a hypothesis for a test of nonlinear restrictions can be stated as:
H0 : h (
0)
=0
HA : h (
0)
6= 0
where h ( ) is an (h 1) vector-valued function of : Example: h( ) =
We assume that h ( ) is such that the h q matrix
R( ) =
1= 2
1 = 0:
@h ( 0 )
@ 0
is of full rank h:This assumption is equivalent to linear independence of linear restrictions
R where R = R( ):
We will present three di¤erent procedures for testing nonlinear H0 :
1. Wald Test
2. Likelihood Ratio Test
3. Lagrange Multiplier Test
All three tests are asymptotically equivalent, in the sense that all three test statistics
converge in distribution to the same random variable under H0 . If the number of linear
restrictions is r; the limiting distribution is 2r : In practice we choose among these tests
based on ease of computation and …nite sample performance.
M LE
Let `(e
) denote the log-likelihood function evaluated at the ML estimate of the
M LE
restricted model. Let `(b
) denote the log-likelihood function evaluated at the ML
estimate of the unrestricted model.
134
24. TESTING OF NONLINEAR HYPOTHESES
24.1
ECO2010, Summer 2022
Wald Test
Intuition: to test h ( 0 ) = 0; obtain b (without imposing the restrictions h); and assess
whether h(b) is close to 0: The closeness is assessed using the Wald test statistic W which is
a quadratic form in the vector h(b) and the inverse of the estimate of its covariance matrix.
When
a
h(b) N (0; V[h(b)])
under H0 then the Wald test statistic
h
i
W = h(b)0 V[h(b)]
1
h(b)
a
2
h
(95)
Note that the covariance matrix inverse acts as a set of weights on the quadratic distance
of h(b) from zero. The Wald test is generic in that it does not require an underlying MLE
procedure.
Using a …rst-order Taylor series expansion under H0 ; h(b) has the same limit distribution
b
as R ( 0 ) (b
0 ): Then h( ) is asymptotically Normal under H0 with mean zero and
variance matrix
V[h(b)] = R ( 0 ) V(b)R ( 0 )0
A consistent estimate is
b = R(b) when
where R
b b)] =Rn
b
V[h(
p
n(b
0)
d
1b
b0
CR
(96)
! N (0; C0 )
b is any consistent estimate of C0 :
where C
Using (96) in (95), the Wald test statistic can be expressed as
An equivalent expression is
h
i
0
bC
bR
b0
W = nh(b) R
h
i
0
b V(
b b)R
b0
W = h(b) R
135
1
h(b)
1
h(b)
24. TESTING OF NONLINEAR HYPOTHESES
ECO2010, Summer 2022
b b) = n 1 C
b is the estimated asymptotic variance of b:
where V(
The Wald test can be implemented as a nonlinear F test:
F =
W
h
a
F [h; n
k]
This yields the regular F test in linear multiple regression as a special case. Moreover, in the
p
linear model, for a test of one restriction, W is equivalent to the t statistic. One limitation
of the Wald statistic is that W is not invariant to reformulations of the restrictions. Finite
sample performance of W may be quite poor.
24.2
Likelihood Ratio Test
De…ne
L( )
The Likelihood Ratio test statistic is
h
LR = 2 L(b)
n
X
`i ( )
i=1
i
L(e) = 2 ln
L(b)
L(e)
!
2
r
where e is the MLE in the model restricted under H0 and b is the MLE of the unrestricted
model. In the linear model, the F -statistic results as a special case of LR:
24.3
Lagrange Multiplier Test
The Lagrange Multiplier (LM) test is based on a vector of Lagrange multipliers from a
constrained maximization problem. In practice, LM tests are computed based on a gradient (score) vector of the unrestricted ln L function evaluated at the restricted estimates.
Consider the special case of H0 : 2 = 0 where is partitioned as = ( 1 ; 2 ): The vector
of restricted estimates can then be expressed as e = (e1 ; 0):
The vector e1 maximizes the restricted function ln L( 1 ; 0) and so it satis…es the restricted likelihood equations
s1 (e1 ; 0) = 0
(97)
where s1 is the vector whose components are the k h partial derivatives of ln L w.r.t. the
elements of 1 :
If we partition s(e) = (s1 (e); s2 (e)); the …rst k h elements, which are the vectors of
s1 (e); are equal to zero, by (97). However, the h-vector s2 (e) is in general non-zero. The
LM test is a statistic based on a quadratic form of s2 (e);
LM = sT2 (e)Ie221 s2 (e)
where Ie is the sub-block of the Fisher information matrix corresponding to s2 (e):
The conditions (97) imply that this is equivalent to
LM = sT (e)Ie
136
1
s(e)
(98)
24. TESTING OF NONLINEAR HYPOTHESES
ECO2010, Summer 2022
As (98) does not depend on the partitioning ( 1 ; 2 ); the LM statistic is invariant to model
reparametrization. Since (98) is based on the score function s; this LM statistic is also
called the score test.
24.4
LM Test in Auxiliary Regressions
LM test can be expressed as nR2 in an auxiliary regression. De…ne the (n
scores
0 e 01
s1 ( )
B .. C
S=@ . A
sn (e)0
and let
denote a (n
k) matrix of
1) column of ones. Then
0
=n
n
X
S0 =
si (e)
S0 S =
i=1
n
X
i=1
si (e)si (e)0
Then, using the information matrix equality, we can rewrite the LM statistic as
! n
! n
!
n
X
X
X
LM =
si (e)0
si (e)si (e)0
si (e)
i=1
0
i=1
0
= S SS
1
i=1
0
S
(99)
Now consider the auxiliary regression
= S + residual
(100)
The OLS estimator is given by
and the predicted values by
b = S0 S
b = Sb
= S S0 S
137
1
S0
1
S0
(101)
24. TESTING OF NONLINEAR HYPOTHESES
ECO2010, Summer 2022
Using (99) and (101), the LM test can then be written as
1
LM = 0 S S0 S
=n
=n
=n
S0
0 S (S0 S) 1 S0
0
0 S (S0 S) 1 (S0 S) (S0 S) 1 S0
0
0
bb
0
ESS
T SS
= nR2
=n
where ESS is the explained variation, T SS is total variation, and R2 is the regular regression
R-squared.
The auxiliary regression (100) was used above make the link between the LM statistic
and its nR2 form simple and transparent. Various auxiliary regressions are used in place of
(100) depending on the context:
Examples:
Breusch-Pagan test for no heteroskedasticity:
u
b2i = x0 +residual
Breusch-Godfrey test for no …rst order autocorrelation:
u
bt = u
bt
1
+ x0 +residual
Using the nR2 test statistic, these are viewed as special cases of the LM test.
24.5
Test Comparison
All three tests, Wald, LR, and LM, are asymptotically distributed 2h under H0 : The …nitesample distributions of these test statistics di¤er.
For testing of linear restrictions in the linear model under residual Normality,
W
LR
LM
In practice, in this case the Wald test will reject H0 more often than the LR test which
in turn will reject more often than the LM test.
Ease of estimation:
The Wald test only requires of a model under HA and is best to use when the restricted
model is di¢ cult to estimate.
The LM test only requires estimation under H0 and is best to use when the unrestricted model is di¢ cult to estimate.
The lack of invariance to reparametrization is a major weakness of the Wald test.
138
25. BOOTSTRAP APPROXIMATION
25
ECO2010, Summer 2022
Bootstrap Approximation
Exact …nite-sample results are unavailable for most estimators and related test statistics.
The statistical methods presented above rely on asymptotic approximations, which may
or may not be reliable in …nite samples. An alternative approximation is provided by the
bootstrap (Efron 1979, 1982), which approximates the distribution of a statistic (estimator
or a test) by numerical simulation. We can calculate the so-called bootstrap standard errors,
con…dence intervals and critical values for hypothesis testing, instead of the asymptotic ones.
Bootstrap approximation methods are used in place of asymptotic approximation for a
number of reasons, including:
To avoid the analytical calculation of the estimated asymptotic variance in cases when
the asymptotic sampling distribution is very di¢ cult or infeasible to obtain, such as
in multi-step estimators;
To gain better approximation accuracy than the asymptotic distribution (so-called
“high-order” re…nement);
To bias-correct an estimator.
The justi…cation for bootstrap relies on asymptotic theory. In situations where the
latter fails the former can also fail. As with asymptotics, bootstrap inference is only exact
in in…nitely large samples.
25.1
The Empirical Distribution Function
The Empirical Distribution Function (EDF) for a sample fyi gni=1 is de…ned as
n
Fn (y) =
1X
1fyi
n
yg
i=1
where 1f!g is the indicator function
1f!g =
(
1 if ! is true
0 if ! is false
Fn is a nonparametric estimate of the population distribution function F:Fn is by construction a step function.
The EDF is a consistent estimator of the CDF. To see this, note that for any y, 1fyi yg
is an iid random variable with expectation F (y). Thus by the LLN,
p
Fn (y) ! F (y)
Furthermore, by the CLT
p
n(Fn (y)
d
F (y)) ! N (0; F (y)(1
139
F (y)))
25. BOOTSTRAP APPROXIMATION
ECO2010, Summer 2022
To see the e¤ect of sample size on the EDF, the Figure below shows the EDF and true
CDF for four random samples of size n = 25; 50; 100, and 500. The random draws are from
the N (0; 1) distribution. For n = 25 the EDF is only a crude approximation to the CDF,
but the approximation appears to improve for the large n. In general, as the sample size
gets larger, the EDF step function gets uniformly close to the true CDF.
E¤ect of sample size on EDF
25.2
Bootstrap
The bootstrap assumes that the sample is in fact the population. Instead of drawing from
a speci…ed population distribution by a random number generator (as the MC would), the
bootstrap draws with replacement from the sample, i.e. from the Empirical Distribution
Function (EDF). In other words, the bootstrap assumes that the EDF is the population
distribution function and the sample is thus "truly" representative of the population. This
assumption may or may not be satis…ed in practice.
Let F denote a distribution function for the population of observations (yi ; xi ): Let
Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); F )
be a statistic of interest, for example an estimator b or a t-statistic (b
)=s(b): Note that
we write Tn as a function of F . For example, the t-statistic is a function of b which itself
is a function of F . The exact (unknown) CDF of Tn when the data are sampled from the
distribution F is
Gn (u; F ) = P (Tn ujF )
Ideally, inference would be based on Gn (u; F ). This is generally impossible since F is
unknown. Asymptotic inference is based on approximating Gn (u; F ) with
G(u; F ) = lim Gn (u; F )
n!1
140
25. BOOTSTRAP APPROXIMATION
ECO2010, Summer 2022
When G(u; F ) = G(u) does not depend on F , we say that Tn is asymptotically pivotal
and use the distribution function G(u) for inferential purposes.
The bootstrap constructs an alternative approximation. The unknown F is replaced by
the EDF estimate Fn . Plugged into Gn (u; F ) we obtain
Gn (u) = Gn (u; Fn )
We call Gn the bootstrap distribution. Bootstrap inference is then based on Gn (u). A
random sample from Fn is called the bootstrap data: denote these by (yi ; xi ). The statistic
Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); Fn )
is a random variable with distribution Gn . We call Tn the bootstrap statistic.
The bootstrap algorithm is similar to our discussion of Monte Carlo simulation, but
drawing from Fn instead of some assumed F: The bootstrap sample size n for each replication
is the same as the sample size. Hence, a bootstrap sample f(y1 ; x1 ); ::::; (yn ; xn )g will
typically have multiple replications of the same observation. The bootstrap statistic
Tn = Tn ((y1 ; x1 ); : : : ; (yn ; xn ); Fn )
is calculated for each bootstrap sample. This is repeated B times. B is known as the number
of bootstrap replications. It is desirable for B to be large, so long as the computational costs
are reasonable. B = 1000 typically su¢ ces.
141
26. ELEMENTS OF BAYESIAN ANALYSIS
26
ECO2010, Summer 2022
Elements of Bayesian Analysis
Bayesian methods provide a set of tools for statistical analysis that are alternative to frequentist methods. Many researchers treat frequentist and Bayesian methods as complementary and use both in areas of their relative strength. We have examined the foundations of
each approach earlier in the course. We will now …rst brie‡y characterize their similarities
and di¤erences (listed in italics), and then describe Bayesian methods in closer detail.
Frequentist Methods:
Before observing the data:
– Start with the formulation of a model that seeks to adequately describe the
situation of interest
– Utilize a rule to estimate the unobservables ( ;
i)
– Determine the asymptotic (or bootstrap) properties of the rule
After observing the data:
– Update the system by conditioning on the data
– Learn about the unobservables by running a max/min routine or or via numerical
methods such as MCMC
– Perform inference based on the asymptotic distribution of the estimation rule
(form "con…dence intervals")
– Perform prediction based on the model and the asymptotic distribution of the
estimation rule
Bayesian Methods:
Before observing the data:
– Start with the formulation of a model that seeks to adequately describe the
situation of interest
– Utilize a rule to estimate the unobservables (e.g.
;
i)
– Formulate a "prior" probability density over the unobservable quantities ( ;
i)
After observing the data:
– Update the system by conditioning on the data using Bayes’ rule yielding a
"posterior" probability density over the unobservables
– Learn about the unobservables by taking their "draws" from the posterior density
(direct, MCMC)
142
26. ELEMENTS OF BAYESIAN ANALYSIS
ECO2010, Summer 2022
– Perform inference based on these draws (form "credible sets")
– Perform prediction based on the model and these draws by computing predictive
distributions
Let
( ) represent the prior state of knowledge about : Let Y1 ; : : : ; Yn
Using Bayes’Rule, the posterior is
f (Y j ):
( jY1 ; : : : ; Yn ) / f (Y1 ; : : : ; Yn j ) ( )
= L( jY1 ; : : : ; Yn ) ( )
where L( jY1 ; : : : ; Yn ) is the likelihood function. The posterior distribution ( jY) summarizes our current state of knowledge about . ( jY) is a distribution over the whole
parameter space
of . In contrast, a frequentist estimator is a function of the data,
b = b(Y), which yields a single value of .
26.1
The Normal Linear Regression Model
Consider the multiple linear regression model
y =X +"
where X is n
k with rank k: As in Maximum Likelihood analysis assume
"
N (0;
1
In )
Then, conditional on X ;
1
f (yjX; ) = N (yjX ;
In )
where = ( ; ):
Let the prior be given by
( ) = N( j
1
0;
0
)
( ) = Gamma( j ; b)
Then,
( j ; X; y) / exp
1
(
2
1
)0
(
)
/ N( j ;
1
)
where
=
X0 X+
1
0
;
=
1
X0 y+
0
0
and
( j ; X; y) = Gamma
jn=2 + ; (y
143
X )0 (y
X )=2 + b
1
1
26. ELEMENTS OF BAYESIAN ANALYSIS
26.2
ECO2010, Summer 2022
Bernstein-Von Mises Theorem
Theorem 99 (Bernstein-Von Mises). The posterior distribution converges to Normal with
covariance matrix equal to 1=n times the information matrix.
This result says that under general conditions the posterior distribution for unknown
quantities in any problem is e¤ectively independent of the prior distribution once the amount
of information supplied by a sample of data is large enough. That is, in large samples, the
choice of a prior distribution is not important in the sense that the information in the prior
distribution gets dominated by the sample information. The Theorem implies that the
asymptotic distribution of the posterior mean is the same as that of the MLE.
26.3
Posterior Sampling
In place of frequentist maximization of the likelihood, Bayesians obtain a sequence of realizations (or "draws") from the posterior f r gR
( jX): The analyst chooses
r=1 , where r
the number of draws R: This sequence characterizes many posterior properties of interest,
such as moments (mean, variance), or quantiles. Under weak conditions
R
1 X
h(
R
r=1
r jX)
R!1
!
Z
h( jy) ( jX)d
for any integrable real-valued function h:
For example, the posterior mean is obtained as
R
=
1 X
R
r
r=1
Inference is performed by constructing Con…dence Sets using quantiles q of
95% CS = [q2:5 ; q97:5 ]
Direct sampling:
– If a posterior density is easy to draw from (e.g. N ( ; 1))
Indirect sampling (for complicated posteriors):
– Acceptance and Importance Sampling
– Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings algorithm
144
( jX), e.g.
26. ELEMENTS OF BAYESIAN ANALYSIS
26.3.1
ECO2010, Summer 2022
Gibbs Sampling
Used when a multivariate joint posterior density is di¢ cult to sample, but can be split into
a sequence of conditional posterior densities that are easy to sample. Obtain a sample
from p( 1 ; 2 ) by drawing in turn from p( 2 j 1 ) and p( 1 j 2 )
In our Y1 ; : : : ; Yn
N( ;
2)
( jY1 ; : : : ; Yn ;
example above, we found two conditional posteriors:
2
)=N
nY =
n=
+ 0 = 20
;
2 + 1= 2
n=
0
2
2
1
+ 1=
( jY1 ; : : : ; Yn ; ) = Gamma n=2 + ; (ns2 =2 + b
These results provide two Gibbs blocks for posterior sampling of ;
Reference:
Greenberg (2012), Geweke et al (2011)
145
1
2
0
)
1
2 jY ; : : : ; Y :
1
n
27. MARKOV CHAIN MONTE CARLO
27
ECO2010, Summer 2022
Markov Chain Monte Carlo
27.1
Acceptance Sampling
Consider a posterior p( jI) that is di¢ cult to sample from while a density p( jS) is easy to
sample from. Let the ratio p( jI)=p( jS) be bounded above by a constant a: Then draw
from p( jS) and accept the draw with probability = p( jI)=[a p( jS)]
Mathematica demonstration: Acceptance Sampling
http://demonstrations.wolfram.com/AcceptanceRejectionSampling/
27.2
Metropolis-Hastings Algorithm
The purpose of the Metropolis–Hastings (MH) algorithm is to draw
p( )
where p( ) is a posterior distribution. MH uses a Markov process that is uniquely de…ned
by its transition probabilities. Denote by p( 0 j ) the probability of transitioning from any
given state to any other given state 0 .
The Markov process has a unique stationary distribution p( ) when the following two
conditions are met:
146
27. MARKOV CHAIN MONTE CARLO
ECO2010, Summer 2022
1. Existence of stationary distribution, for which a su¢ cient condition is detailed balance:
p( )p( 0 j ) = p( 0 )p( j 0 )
which requires that each transition
to
0
is "reversible".
2. Uniqueness of stationary distribution is guaranteed by ergodicity of the Markov process,
for which su¢ cient conditions are:
(a) Aperiodicity - the chain does not return to the same state at …xed intervals;
(b) Positive recurrence - the expected number of steps for returning to the same
state is …nite.
Detailed balance implies
p( 0 )
p( 0 j )
=
p( )
p( j 0 )
(102)
Now separate the transition in two sub-steps:
1. Proposal step using the proposal distribution g( 0 j )
2. Acceptance-rejection step, using the acceptance distribution A( 0 j )
Thus:
p( 0 j ) = g( 0 j )A( 0 j )
(103)
p( 0 )
g( 0 j )A( 0 j )
=
p( )
g( j 0 )A( j 0 )
(104)
Using (103) in (102) yields
The acceptance probability that ful…lls condition (104) is given by
A( 0 j ) = min 1;
p( 0 ) g( j 0 )
p( ) g( 0 j )
with the normalization A( j 0 ) = 1:
The Metropolis-Hastings Algorithm:
1. Step 1: initialize
(0)
2. Step 2: in each iteration g = 1; : : : ; n0 + G;
(a) Propose: a proposal value
from g( j (g) ; y) and calculate the quantity (the
acceptance probability or the probability of move)
(
)
p( jy) g( (g) j ; y)
(g)
( j
; y) = min 1;
p( (g) jy) g( j (g) ; y)
147
27. MARKOV CHAIN MONTE CARLO
(b) Move: set
(g+1)
=
(
ECO2010, Summer 2022
with probability ( j (g) ; y)
with probability 1
( j (g) ; y)
(g)
3. Step 3: Discard the draws from the …rst n0 iterations and save the subsequent G
draws (n0 +1) ; : : : ; (n0 +G)
Mathematica demonstration 1: Metropolis-Hastings algorithm
http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/
Mathematica demonstration 2: Metropolis-Hastings algorithm 2
http://demonstrations.wolfram.com/TargetMotionWithTheMetropolisHastingsAlgorithm/
148
28. NEURAL NETWORKS AND MACHINE LEARNING
28
28.1
ECO2010, Summer 2022
Neural Networks and Machine Learning
Arti…cial Neural Networks
Arti…cial Neural Networks (ANNs) are models that allow complex nonlinear relationships
between the response variable and its predictors. A neural network is composed of observed
and unobserved random variables, called neurons (also called nodes), organized in layers. The observed predictor variables form the "Input" layer, and the predictions form
the "Output" layer. Intermediate layers contain unobserved random variables (so-called
“hidden neurons”).
The simplest ANN with no hidden layers is equivalent to a linear regression:
In the ANN notation, the formula for the …tted regression model is
yb = a1 + w1;1 x1 + w2;1 x2 + w3;1 x3 + w4;1 x4
The coe¢ cients wk;j attached to the predictors xk are called weights.
Once we add intermediate layer(s) with hidden neurons and activation functions, the
ANN becomes non-linear. An example shown in the following …gure is known as the feedforward network (FFN).
The weights wk;j are selected in the ANN framework using a machine learning algorithm
that minimizes a loss function, such as the Mean Squared Error (MSE), or Sum of Squared
149
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
Residuals (SSR). In the special case of linear regression, OLS provides an analytical solution
to the learning algorithm that minimizes SSR. In general ANNs the response variable is a
nonlinear function of the predictors and hence OLS is not applicable as a method of model
…t. A neural network with many hidden layers is called a deep neural network (DNN)
and its training algorithm is called deep learning.
28.2
Feed-Forward Network
In FFN each layer of nodes receives inputs from the previous layers. The outputs of the
nodes in one layer are inputs to the next layer. The inputs to each node are merged using
a weighted linear combination. For example, the inputs into hidden neuron j (blue dots) in
the …gure above are combined linearly to give
zj = aj +
4
X
wk;j xk
(105)
k=1
In the hidden layer, the linear combination zj is transformed using a nonlinear activation function s(zj ), which adds ‡exibility and complexity to the model. Without the
activation function the model would be limited to a linear combination of predictors, i.e.
multiple regression. Popular activation functions are the logistic (or sigmoid) function
logistic(zj ) =
and the tanh function
tanh(zj ) =
1
1+e
ezj e
ezj + e
shown in the following Figure:
150
zj
zj
zj
(106)
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
For building an ANN model, we need to pre-specify in advance: the number of hidden
layers, the number of nodes in each hidden layer, and the functional form of the activation
function. The parameters a1 ; a2 ; a3 and w1;1 ; : : : ; w4;3 are “learned” from the data.
28.3
Machine Learning for NNs
Machine learning for neural networks, also known as "training" or "optimization", involves
searching for the set of weights that best enable the network to model the patterns in the
data. The goal of a learning algorithm is typically to minimize a loss function that quanti…es
the lack of …t of the network to the data. We broadly distinguish several types of learning:
Supervised learning
– We supply the ANN with inputs and outputs, as in the examples above
– The weights are modi…ed to reduce the di¤erence between the predicted and
actual outputs using a loss function
– Examples: NN (auto)regression, face, speech, or handwriting recognition, spam
detection
Unsupervised learning
– We supply the ANN with inputs only
– The ANN works only on the input values so that similar inputs create similar
outputs
– Examples: K-means clustering, dimensionality reduction
We will focus here on supervised learning only as it is the method of choice in typical economic applications. Supervised learning typically involves many cycles of so-called
forward propagation and backpropagation.
The process of forward propagation in FFNs involves:
Computing zj ; as in (105), at every hidden neuron j
Applying the activation function s(zj ) at each j; as in (106)
Constructing a linear combination of s(zj )0 s to obtain the predicted output
Once the predicted output is obtained at the output layer, we compute the loss or
"error" (predicted output minus the original output).
151
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
The goal of backpropagation is to adjust the weights in each layer to minimize the
overall error (loss) at the output layer.
One iteration of forward and backpropagation is called an epoch. Typically, many
epochs, (often tens of thousands) are required to train a neural network well.
The following steps are typically taken in FFN supervised learning for one vector of
inputs xi = (xi;1 ; : : : ; xi;K ) and corresponding output yi :
1. (Initialization): Assign random weights and "biases" (parameters aj ) to each of the
neurons in the hidden layer
P
2. (Forward propagation start): Obtain aj + K
k=1 wk;j xi;k at each hidden neuron
3. Apply the activation function at each hidden neuron, s(zj )
4. Take this output and pass it onto the next layer of neurons, until the output layer
(forward propagation end)
5. Calculate the error term at the output neuron(s) by ei = ybi
predicted output value and yi is the actual data value
yi where ybi is the
6. (Backpropagation start): Obtain the loss function, typically using the square loss
L(ei ) = e2i = (b
yi
yi )2
(107)
In the FFN example (105) and (106) above,
0
12 0
!
3
3
4
X
X
X
L(ei ) = @
wj s (zj ) yi A = @
wj s aj +
wk;j xi;k
j=1
j=1
7. Using the chain rule, obtain
k=1
12
yA
@L(e)
@w :
In the FFN example (105) and (106) above,
!
4
X
@
@L(ei )
= 2wj xi;k s aj +
wk;j xi;k
@wk;j
@
k=1
i)
and similarly for @L(e
@wj ; update each weight in the direction of the partial derivative
(gradient descent method) (backpropagation end)
152
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
Backpropagation works well with relatively shallow networks with only a few hidden
layers. As the networks gets deeper, with more hidden layers, the training time can increase
exponentially and the learning algorithm may not converge. The chain rule multiplication
of terms in the last step above often results in numerical values of less than 1 that compound
during backpropagation. Weight adjustment slows down and the deepest layers closest to the
inputs early either train very slowly or do not move away from their starting positions. This
phenomenon is known as the vanishing gradient problem. A typical remedy involves
changes in the network structure, such as in the case of Recurrent Neural Networks (RNN)
covered further below.
28.4
Cross-Validation
Machine learning methods typically use three di¤erent subsets of the complete data set:
1. A training set is used by the machine learning algorithm described above to train the
model. The outputs are the …tted weights (model parameters) for any one given model
structure (e.g. 2 hidden layers, 5 neurons each). Training the network minimizes the
training loss, typically mean-square error called training MSE. An analyst often trains
several or many networks with di¤erent structures. For the loss function L(ei ) in (107)
and a training set of size M we obtain the training MSE as
LM
M
1 X
=
L(ei )
M
i=1
2. Each trained (optimized) model is then evaluated on a validation set, not used
in training the model. Evaluating the loss function then results in the validation
MSE for each model. We can compare the validation MSE for the di¤erent models
(e.g. with di¤erent number of hidden layers and neurons) and select the model with
the smallest validation MSE. The model structural and learning elements (such as
number of hidden layers, number of neurons, optimizer and epochs) are referred to
as hyperparameters. Their selection using the validation MSE is referred to as
hyperparameter tuning.
3. The selected model is then applied to a test set, not used in training or selecting the
model. The test set is used only once at the end of the selection process. The aim
is to get a measure of how the selected model will perform on future data that have
not yet been observed. For example, if we train a neural net using stock market data
for the past 6 months, we would like to see how well the model can predict the next
month. Evaluating the model at the test set gives us an estimate the model prediction
error. The test MSE is
R
1 X
M SE(test) =
(b
yi yi )2
R
i=1
for the selected model predictions ybi based on the test set inputs xi , and observed test
set data on yi :
153
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
The terms "validation" and "test" are commonly referred to as the hold-out (i.e. any
set not used in training). If we only train one given model (e.g. 3 hidden layers, 10 neurons
each) without further model selection then we don’t need to use the validation set, and the
complete data set is only split into the training set and the test set. Similarly, if we don’t
plan to estimate the model prediction error then we don’t need to set aside a test set, and
only split the complete data set into the training set and the validation set.
28.4.1
Random-split Cross-Validation
A basic way of obtaining a validation set is to split the complete data set randomly into the
training set (blue) and the validation set (beige), typically of about equal size, as shown in
the …gure below
This approach is referred to as random-split cross-validation (CV). A drawback of this
approach is that the model prediction error can be highly variable depending on which
observations are included in the training set and which observations are included in the
validation set.
28.4.2
Leave-one-out Cross-Validation
Leave-one-out cross-validation chooses only one point into the validation set while the remainder of the data forms the training set. This is repeated for each observation chosen
for the validation set at a time. Each time, a prediction error is obtained for the validation
observation i:
ei = (yi ybi )
For a complete data set of size n; the leave-one-out cross-validation MSE is then obtained
as
n
1X 2
M SE(n) =
ei
n
i=1
The whole procedure requires training the given ANN model n times. This approach is
more stable than a random-split CV, but can be very time-consuming for large samples
and complex models. In the Figure below, the leave-one-out cross-validation training set is
displayed in blue and the validation set in beige.
154
28. NEURAL NETWORKS AND MACHINE LEARNING
28.4.3
ECO2010, Summer 2022
k-fold Cross-Validation
k-fold cross-validation involves randomly dividing the set of observations into k groups, or
"folds", of approximately equal size. The …rst fold forms the validation set and the k 1
folds form the training set. M SE(j) is obtained for the validation observations for each
j = 1; :::; k: The k-fold cross-validation MSE is then obtained as
CV(k)
k
1X
=
M SE(j)
k
j=1
Leave-one-out cross-validation is a special case with k = n: For k < n; …tting the model k
times can result in substantial savings in computation time. In the Figure below, the k-fold
cross-validation training set is displayed in blue and the validation set in beige.
28.5
Model Fit
We can broadly distinguish three di¤erent categories of ANN models:
1. Over…t model - model that is very complex and ‡exible, with many hidden layers
and neurons, which adapts very well to the training data set (low training MSE) but
performs poorly in the validation set (high validation MSE)
155
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
2. Under…t model - with minimal ‡exibility, with small number of hidden layers and
neurons, which adapts poorly to the training sample (high training MSE) and also
performs poorly in the validation set (high validation MSE)
3. Good …t model - a model that suitably learns the training dataset (relatively low training MSE) and performs suitably well in the validation data (relatively low validation
MSE)
A model …t can be considered in the context of the bias-variance trade-o¤. An under…t
model has high bias and low variance. Regardless of the speci…c samples in the training
data, it cannot learn the problem. An over…t model has low bias and high variance. The
model learns the training data too well and performance varies widely with new unseen
examples or even statistical noise added to examples in the training dataset. A good …t
model balances the bias and variance.
28.6
Recurrent Neural Networks
Recall the structure of Feedforward Neural Networks (FFN). In the following Figure the
earlier FFN graph is turned counter-clockwise to facilitate the transition to recurrent NNs.
In general, FFNs do not account for sequential dependence in the hidden layer structure.
As such, FFNs are typically used for cross-sectional data with independent inputs.
156
28. NEURAL NETWORKS AND MACHINE LEARNING
ECO2010, Summer 2022
Recurrent Neural Networks (RNNs) are designed speci…cally for sequential data. In
RNNs, connections between nodes form a graph along a temporal sequence, which allows
RNNs to exhibit dynamic behavior. Typical uses of RNNs include:
Speech processing (topic classi…cation, sentiment analysis, language translation)
Handwriting recognition
Time series modelling and forecasting
In RNNs, output at current time step depends on current input as well as previous state
via recurrent edges (feedback loops):
We can "unroll" the RNN scheme in time to obtain a directed graph representation
The input is a sequence of vectors fXt gL
t=1 : Each Xt feeds into the hidden layer, which
also has as input the hidden state vector At 1 from the previous element in the sequence,
and produces the current hidden state vector At : The output layer produces a sequence of
predictions Ot : As the network proceeds from t = 1 to t = L; the hidden nodes (or states)
accumulate a history of information used for prediction.
Suppose each vector Xt of the input sequence has p components, Xt = (Xt1 ; Xt2 ; : : : ; Xtp )0 :
The hidden layer consists of K hidden neurons, At = (At1 ; At2 ; : : : ; AtK )0 : The weights of
the input layer wkj are collected into a matrix W of dimension K (p + 1), which includes
the intercepts (or "biases"). The weights of the hidden-to-hidden layer uks are collected
into a matrix U of dimension K K: The weights for the output layer k are collected into
a vector B of dimension K + 1; which includes the intercept.
157
28. NEURAL NETWORKS AND MACHINE LEARNING
The RNN model for the hidden layer is then given by
0
p
K
X
X
@
Atk = g wk0 +
wkj Xtj +
uks At
s=1
j=1
ECO2010, Summer 2022
1
1;s A
where g( ) is an activation function. The RNN model for the output Ot is given by
Ot =
0+
K
X
k Atk
k=1
The same sets of weights W, U and B are used as each element of the sequence is processed,
i.e. they are not functions of t:
There are many variations and enhancements of the base-case RNN design introduced
above. Examples include:
GRU and LSTM node designs that we will brie‡y cover below;
Bidirectional RNNs that obtain information from past (backwards) and future (forward) states simultaneously, used in language translation;
Recurrent multilayer perceptron network (RMLP) network, consisting of cascaded
subnetworks, each of which contains multiple layers of nodes, used in image recognition.
28.6.1
Training RNNs
RNNs can in principle be trained as FFNs using backpropagation. However, when forecasting output at time T + 1; in the chain rule for the derivative of the loss function at time
T the activation function g needs to be applied t + 1 times for weight adjustment at time
T t: This leads to the problem of vanishing gradient for g < 1; or exploding gradient
for g > 1: The problem can make base-case RNN backpropagation unfeasible for large T:
Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units networks
(GRUs) were created to address the training problem. They have internal mechanisms
called gates that can regulate the ‡ow of information. The gates can learn which data in a
sequence is important to keep or delete. In this way only the relevant information can be
passed down the long chain of sequences. Almost all state of the art results based on RNNs
are achieved with the LSTM or GRU node structure. An excellent animated guide is under
this link.
Reference: sections 2.2, 5.1, and 10.5 of James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021)
158
ECO2010, Summer 2022
Part IV
References
Balakrishnan, N. and V.B. Nevzorov (2003) A Primer on Statistical Distributions,
Wiley. Available online via U of T libraries.
Cameron, C. A. and Trivedi, P. K. (2005) Microeconometrics: Methods and Applications, Cambridge University Press. Available online via U of T libraries.
Charalambos D. Aliprantis, C. D. and K. C. Border (2006) In…nite Dimensional
Analysis : a Hitchhiker’s Guide, 3rd ed, Springer. Available online via U of T libraries.
Geweke, J., G. Koop, and H..van Dijk (2011) The Oxford Handbook of Bayesian
Econometrics, Oxford University Press.
Available online.
Greenberg, E. (2012) Introduction to Bayesian Econometrics, Cambridge University
Press. Available online via U of T libraries.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021) An Introduction to
Statistical Learning, 2nd edition, Springer Texts in Statistics. Available online.
Judd, K. L. (1998) Numerical Methods in Economics, MIT Press. Available online
via U of T libraries.
Miranda, M. & Fackler, P. (2002) “Applied Computational Economics and Finance,”
MIT Press.
Ok, E. A. Real Analysis with Economic Applications. Princeton University Press,
2007. Available online via U of T libraries. Errata are available at the book’s website.
Ruppert, D. and Matteson, D. S. (2015). Statistics and Data Analysis for Financial
Engineering : with R Examples, chapter 20: Bayesian Data Analysis and MCMC,
Springer. Available online via U of T libraries.
Simon, C. P. and L. Blume (1994) Mathematics for Economics, WW Norton).
159
Download
Study collections